RE: [pylucene-dev] First shot at custom tokenfilter

Ofer Nave Wed, 28 Mar 2007 18:03:20 -0800

Sorry for the delay.  I have a concise test case now.  See below for inline
comments.  Code is at the bottom.


> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Andi Vajda
> Sent: Monday, March 26, 2007 9:00 PM
> 
> On Mon, 26 Mar 2007, Ofer Nave wrote:
> > Ever since I started using a custom Analyer and 
> TokenFilter, my index 
> > build script keeps crashing.  Usually it just freezes at a random 
> > point, and won't even respond to ctrl-c (I have to use kill -9 in 
> > another terminal).  One time it ended with: 'Fatal Python 
> error: This 
> > thread state must be current when releasing'.  One time it finished 
> > successfully (out of about 20 attempts).  This is from 
> repeated runs without changing any code.
> 
> If you submit a piece of code that reproduces the problem, I 
> can take a look at it (best would be something like a unit 
> test, see PyLucene/test).

Haven't had time to look at the unit testing framework, but the code is
simple and runs standalone.

> Also, what is your OS ? did you build PyLucene yourself ? If 
> so, which gcj ?
> Does 'make test' pass ? What is your version of Python ?

Linux 2.6.9
Python 2.3.4
Lucene/PyLucene versions including in sample output below.

I believe the admin compiled PyLucene from source. The box has gcj version
3.4.5 20051201.

Sample code:

---
#!/usr/bin/python
import sys
import PyLucene

def main():
    print 'PyLucene', PyLucene.VERSION, 'Lucene', PyLucene.LUCENE_VERSION
    data = dict(album='Hail To The Thief', artist='Radiohead',
ASIN='B000092ZYX')
    directory = '/tmp/crash'
    store = PyLucene.FSDirectory.getDirectory(directory, True)
#    store = PyLucene.RAMDirectory()
#    analyzer = PyLucene.StandardAnalyzer()
    analyzer = TermJoinAnalyzer()
    writer = PyLucene.IndexWriter(store, analyzer, True)
    docs = 0
    while True:
        doc = PyLucene.Document()
        doc.add(PyLucene.Field('album', data['album'],
PyLucene.Field.Store.YES, PyLucene.Field.Index.TOKENIZED))
        doc.add(PyLucene.Field('artist', data['artist'],
PyLucene.Field.Store.YES, PyLucene.Field.Index.TOKENIZED))
        doc.add(PyLucene.Field('ASIN', data['ASIN'],
PyLucene.Field.Store.YES, PyLucene.Field.Index.UN_TOKENIZED))
        writer.addDocument(doc)
        docs += 1
        if docs % 5000 == 0:
            print docs

class TermJoinTokenFilter(object):
    TOKEN_TYPE_JOINED = "JOINED"
    def __init__(self, tokenStream):
        self.tokenStream = tokenStream
        self.a = None
        self.b = None
    def __iter__(self):
        return self
    def next(self):
        if self.a:  # emitted prev last time - need to set next, emit prev +
next, and reset prev to None
            self.b = self.tokenStream.next()
            if self.b is None:
                return None
            joined = PyLucene.Token(self.a.termText() + self.b.termText(),
self.a.startOffset(), self.a.endOffset(), self.TOKEN_TYPE_JOINED)
            joined.setPositionIncrement(0)
            self.a = None
            return joined
        elif self.b:  # emitted prev + next last time - need to emit next,
set prev to next, and reset next to None
            self.a = self.b
            self.b = None
            return self.a
        else:  # first call ever - set prev to first token and emit first
token
            self.a = self.tokenStream.next()
            return self.a

class TermJoinAnalyzer(object):
    def __init__(self, analyzer=PyLucene.StandardAnalyzer()):
        self.analyzer = analyzer
    def tokenStream(self, fieldName, reader):
        tokenStream = self.analyzer.tokenStream(fieldName, reader)
        filter = TermJoinTokenFilter(tokenStream)
        return tokenStream.tokenFilter(filter)

main()
---

It builds an index in /tmp/crash.  You can change the path, or to avoid
disk, switch which Directory instantiation line is commented out.

It uses my TermJoinAnalyzer class to demonstate the crash.  To demonstrate
how the same code runs fine with StandardAnalyzer, switch which Analayzer
instantiation line is commented out.

I ran it with TermJoinAnalyzer three times, and all three times it crashed
within seconds - with three different errors, no less.  :)  When I ran it
with StandardAnalyzer, it worked fine for several minutes before I killed
it.

Here's the output from the three crashes:

---
[EMAIL PROTECTED] ~/proj/search/trunk]$ bin/tmp.py
PyLucene 2.1.0-1 Lucene 2.1.0-509013
5000
10000
15000
20000
25000
Fatal Python error: auto-releasing thread-state, but no thread-state for
this thread
Aborted
[EMAIL PROTECTED] ~/proj/search/trunk]$ bin/tmp.py
PyLucene 2.1.0-1 Lucene 2.1.0-509013
5000
10000
15000
20000
25000
30000
35000
Fatal Python error: This thread state must be current when releasing
Aborted
[EMAIL PROTECTED] ~/proj/search/trunk]$ bin/tmp.py
PyLucene 2.1.0-1 Lucene 2.1.0-509013
5000
10000
Traceback (most recent call last):
  File "bin/tmp.py", line 57, in ?
    main()
  File "bin/tmp.py", line 19, in main
    writer.addDocument(doc)
PyLucene.JavaError: java.lang.NullPointerException
---

-ofer

_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

RE: [pylucene-dev] First shot at custom tokenfilter

Reply via email to