Re: [pylucene-dev] memory leak status

Andi Vajda Wed, 16 Jan 2008 22:28:13 -0800


On Thu, 17 Jan 2008, Brian Merrell wrote:

The docs are pretty standard English language documents.  I can try to find
some decent spam (I hesitate to give our the actual documents for privacy
reasons) but I imagine anything you have would reproduce the problem.  I am
using the basic MeetLucene Indexer.py with the following Analyzer:

It would be helpful if all I had to do to reproduce the problem was totype in a one-liner in a shell. With what you sent me, I still have to dowork to figure out how to include your code into the indexer.py program.It's not rocket science but since you've already done it, it's less work foryou to send this via email than for me to reconstruct it.


Maybe my assumptions are wrong and it's more work for you too ?

Now, that being said, looking at your code, it's quite possible that theleak is with the BrianFilter instances. You can verify this by checking thenumber of PythonTokenFilter instances left in the env after each document isindexed. I'm assuming, possibly wrongly, that BrianAnalyzer's tokenStream()method is called once for every document. If that's indeed the case, you aregoing to leak BrianFilter instances and it's going to be necessary tofinalize() these instances manually.


 1. to verify how many PythonTokenFilter instances you have in env:
      print env._dumpRefs('class 
org.osafoundation.lucene.analysis.PythonTokenFilter', 0)
    where env is the result of the call to initVM() and can also be obtained
    by calling getVMEnv()

 2. to call finalize() on your BrianFilter instances, you need to first keep
    track of them. Then, after each document is indexed, call finalize() on
    the accumulated objects. Below is a modification of BrianAnalyzer to
    support this (assuming one single thread of execution per analyzer)

      class BrianAnalyzer(PythonAnalyzer):
          def __init__(self):
              super(BrianAnalyzer, self).__init__()
              self._filters = []
          def tokenStream(self, fieldName, reader):
              filter = 
BrianFilter(LowerCaseFilter(StandardFilter(StandardTokenizer(reader))))
              self._filters.append(filter)
              return filter
          def finalizeFilters(self):
              for filter in self._filters:
                  filter.finalize()
              del self._filters[:]

    after each document is indexed, call brianAnalyzer.finalizeFilters()

Andi..
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Re: [pylucene-dev] memory leak status

Reply via email to