On Thu, 17 Jan 2008, Brian Merrell wrote:

The docs are pretty standard English language documents.  I can try to find
some decent spam (I hesitate to give our the actual documents for privacy
reasons) but I imagine anything you have would reproduce the problem.  I am
using the basic MeetLucene Indexer.py with the following Analyzer:

It would be helpful if all I had to do to reproduce the problem was to type in a one-liner in a shell. With what you sent me, I still have to do work to figure out how to include your code into the indexer.py program. It's not rocket science but since you've already done it, it's less work for you to send this via email than for me to reconstruct it.

Maybe my assumptions are wrong and it's more work for you too ?

Now, that being said, looking at your code, it's quite possible that the leak is with the BrianFilter instances. You can verify this by checking the number of PythonTokenFilter instances left in the env after each document is indexed. I'm assuming, possibly wrongly, that BrianAnalyzer's tokenStream() method is called once for every document. If that's indeed the case, you are going to leak BrianFilter instances and it's going to be necessary to finalize() these instances manually.

 1. to verify how many PythonTokenFilter instances you have in env:
      print env._dumpRefs('class 
org.osafoundation.lucene.analysis.PythonTokenFilter', 0)
    where env is the result of the call to initVM() and can also be obtained
    by calling getVMEnv()

 2. to call finalize() on your BrianFilter instances, you need to first keep
    track of them. Then, after each document is indexed, call finalize() on
    the accumulated objects. Below is a modification of BrianAnalyzer to
    support this (assuming one single thread of execution per analyzer)

      class BrianAnalyzer(PythonAnalyzer):
          def __init__(self):
              super(BrianAnalyzer, self).__init__()
              self._filters = []
          def tokenStream(self, fieldName, reader):
              filter = 
BrianFilter(LowerCaseFilter(StandardFilter(StandardTokenizer(reader))))
              self._filters.append(filter)
              return filter
          def finalizeFilters(self):
              for filter in self._filters:
                  filter.finalize()
              del self._filters[:]

    after each document is indexed, call brianAnalyzer.finalizeFilters()

Andi..
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to