Lucene comes with some "demo" applications that demonstrate how to use
it.  I translated one of them from Java into Jython, and it still
works pretty much the same way, but takes an extra ten seconds to
start up; here's the code, which is noticeably but not dramatically
shorter than the Java version.

I wrote this because I wrote a list of desiderata for making a
full-text index of my mail for the Nth time, and I realized that
Lucene pretty much had all the items on my list, so maybe I'd be
better off biting the bullet and using Java and Lucene instead of
writing another text indexer from scratch.

Jython seems to make Java a *lot* easier to deal with.  Just being
able to interactively import a package or class, inspect its
attributes, instantiate it, and so on, makes a big difference in my
experience of using Java.  (I wish it included a way to interactively
inspect the signatures and doc comments of the things thus inspected.)
And Java now works out of the box on Debian, thanks to `gij`, which is
another big plus, and even Sun's Java is supposed to be free software
now, although I haven't looked lately to see if they've finished that
process.  It's too bad my laptop is still too small and slow to run
Eclipse, and for some reason my `gcj-4.1` documentation is missing.

#!/usr/bin/env jython

"""A Jython version of org.apache.lucene.demo.IndexFiles, the Lucene demo.

I haven't gotten this working in `jythonc` yet, because of what I
think is a classpath problem.
"""

# Because this is a modified version of IndexFiles.java from the
# Lucene distribution, it carries the same licensing:

# Copyright 2004 The Apache Software Foundation
# Copyright 2008 Kragen Javier Sitaker

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import sys

from java.util import Date
from java.io import IOException, File, FileNotFoundException

from org.apache.lucene.index import IndexWriter
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.demo import FileDocument

# Jython, being 2.1, doesn't have True and False.  Fortunately it does
# seem to map 1 to Java "true" when appropriate, which is kind of
# scary.
True = 1

def main(argv):
    usage = "jython %s <root_directory>" % argv[0]
    if len(argv) != 2:
        sys.stderr.write("Usage: %s\n" % usage)
        sys.exit(1)

    start = Date()

    try:
        writer = IndexWriter("index", StandardAnalyzer(), True)
        indexDocs(writer, File(argv[1]))

        writer.optimize()
        writer.close()

        print Date().getTime() - start.getTime(), "total milliseconds"

    except IOException, e:
        print " caught a", e.getClass()
        print " with message:", e.getMessage()

def indexDocs(writer, file):
    # do not try to index files that cannot be read
    if not file.canRead(): return
    if file.isDirectory():
        # "or []" because an IO error could occur, it says
        for ii in file.list() or []:
            indexDocs(writer, File(file, ii))
    else:
        print "adding", file
        try:
            writer.addDocument(FileDocument.Document(file))
        except FileNotFoundException, fnfe:
            # at least on Windows, some temporary files raise this
            # exception with an "access denied" message, and checking
            # if the file can be read doesn't help, it says.
            pass

if __name__ == '__main__': main(sys.argv)

Reply via email to