Re: [pylucene-dev] HitCollector in PyLucene extremely slow

John Kleven Mon, 03 Sep 2007 22:33:26 -0700

Thanks Andi,

After looking at it for the last few hours, I'm
becoming pretty convinced that the crossover from java
to python is the problem in this case.

The collect() function is an inner loop of the Lucene
searching algorithms, so after instrumenting my code I
realized that for a typical *single* search, the
collect callback was executed 26,278 times.  

I don't think this # of exes is abnormal.  My
documents are car names, and there's 540,000 documents
in the system.  If somebody searches for "1994 Audi
Station Wagon Quattro AWD" or the like, *every single*
document that had audi, or awd, ... or any single hit
would result in a callback execution.  I'm sure it
would be the same if not worse for the typical "web
search" use case (where the documents are a crawled
html page).  I could change the search from an OR to
an AND, but that creates problems w/ functionality ...
if somebody misspells a word, or enters too many
words, then an AND search mucks things up.

You would know better, but it sounds like the GIL
acquisition could be the prob.  The collect API is
dirt simple, pass in an integer and a float, and
return nothing.  So I doubt marshalling could take
this long (although .... when you're talking 26,000
times per search, who knows).

I have a sinking feeling that writing this in Java and
then writing the hooks for PyLucene is going to be my
only solution.  I have attempted just compiling
PyLucene from scratch in the past .... and have failed
miserably.  Any new docs/info/wiki entries on
compilation/hooks?

If anybody else is reading this and has utilized a
custom HitCollector via PyLucene extension classes,
did you experience a pretty dramatic slowdown?

Thanks again
J

--- Andi Vajda <[EMAIL PROTECTED]> wrote:

> 
> On Mon, 3 Sep 2007, John Kleven wrote:
> 
> > When using a HitCollector via PyLucene (i.e.,
> > overiding the collect() API) has anybody else
> noticed
> > a massive slowdown?
> >
> > Even if i set my collector to return immediately
> in
> > the collect(doc_id, score) callback, so not even
> > touching any of the ids or scores, on a collection
> of
> > 540,000 documents, I get a an avg search time of
> .11
> > seconds.
> >
> > If I go through the standard IndexSearcher.search,
> > which still uses a hit collector on the Java
> backside
> > (TopDocCollector.java if interested), I get avg
> search
> > times of 0.0104 -- and it is actually doing
> something
> > (namely, tracking the highest scored docs in a
> > priority queue up to size 100).
> >
> > Is this order-of-magnitude slowdown something that
> I
> > can expect just because of the java->python
> callback
> > via the collect() function?
> >
> > To get this up to speed, is my only option to code
> my
> > collector in Java, add in the hooks, then compile
> a
> > custom (gulp) PyLucene version?
> 
> I don't know enough about what you're trying to do
> to have much of an opinion. 
> It seems to me though, that you're comparing apples
> and oranges. In the 
> python case you're using a HitCollector python
> customization that returns 
> nothing and in the Java case you're using a
> TopDocCollector that actually 
> does something.
> 
> If indeed it turns out that calling into Python Java
> is the culprit, then your 
> best bet is what you're suggesting.
> I doubt it, though. The only possibly expensive call
> apart from your python 
> code is the acquiring of the python GIL (Global
> Interpreter Lock). If there is 
> no contention for the GIL, it should be really fast
> acquiring it.
> 
> The rest of the Java->Python boundary crossing is
> the marshalling of Java 
> objects into Python ones, the call to your method
> itself and the reverse 
> marshalling of the return value.
> 
> Andi..
> _______________________________________________
> pylucene-dev mailing list
> [email protected]
>
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
> 

____________________________________________________________________________________
Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Re: [pylucene-dev] HitCollector in PyLucene extremely slow

Reply via email to