Thanks Andi, After looking at it for the last few hours, I'm becoming pretty convinced that the crossover from java to python is the problem in this case.
The collect() function is an inner loop of the Lucene searching algorithms, so after instrumenting my code I realized that for a typical *single* search, the collect callback was executed 26,278 times. I don't think this # of exes is abnormal. My documents are car names, and there's 540,000 documents in the system. If somebody searches for "1994 Audi Station Wagon Quattro AWD" or the like, *every single* document that had audi, or awd, ... or any single hit would result in a callback execution. I'm sure it would be the same if not worse for the typical "web search" use case (where the documents are a crawled html page). I could change the search from an OR to an AND, but that creates problems w/ functionality ... if somebody misspells a word, or enters too many words, then an AND search mucks things up. You would know better, but it sounds like the GIL acquisition could be the prob. The collect API is dirt simple, pass in an integer and a float, and return nothing. So I doubt marshalling could take this long (although .... when you're talking 26,000 times per search, who knows). I have a sinking feeling that writing this in Java and then writing the hooks for PyLucene is going to be my only solution. I have attempted just compiling PyLucene from scratch in the past .... and have failed miserably. Any new docs/info/wiki entries on compilation/hooks? If anybody else is reading this and has utilized a custom HitCollector via PyLucene extension classes, did you experience a pretty dramatic slowdown? Thanks again J --- Andi Vajda <[EMAIL PROTECTED]> wrote: > > On Mon, 3 Sep 2007, John Kleven wrote: > > > When using a HitCollector via PyLucene (i.e., > > overiding the collect() API) has anybody else > noticed > > a massive slowdown? > > > > Even if i set my collector to return immediately > in > > the collect(doc_id, score) callback, so not even > > touching any of the ids or scores, on a collection > of > > 540,000 documents, I get a an avg search time of > .11 > > seconds. > > > > If I go through the standard IndexSearcher.search, > > which still uses a hit collector on the Java > backside > > (TopDocCollector.java if interested), I get avg > search > > times of 0.0104 -- and it is actually doing > something > > (namely, tracking the highest scored docs in a > > priority queue up to size 100). > > > > Is this order-of-magnitude slowdown something that > I > > can expect just because of the java->python > callback > > via the collect() function? > > > > To get this up to speed, is my only option to code > my > > collector in Java, add in the hooks, then compile > a > > custom (gulp) PyLucene version? > > I don't know enough about what you're trying to do > to have much of an opinion. > It seems to me though, that you're comparing apples > and oranges. In the > python case you're using a HitCollector python > customization that returns > nothing and in the Java case you're using a > TopDocCollector that actually > does something. > > If indeed it turns out that calling into Python Java > is the culprit, then your > best bet is what you're suggesting. > I doubt it, though. The only possibly expensive call > apart from your python > code is the acquiring of the python GIL (Global > Interpreter Lock). If there is > no contention for the GIL, it should be really fast > acquiring it. > > The rest of the Java->Python boundary crossing is > the marshalling of Java > objects into Python ones, the call to your method > itself and the reverse > marshalling of the return value. > > Andi.. > _______________________________________________ > pylucene-dev mailing list > [email protected] > http://lists.osafoundation.org/mailman/listinfo/pylucene-dev > ____________________________________________________________________________________ Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/ _______________________________________________ pylucene-dev mailing list [email protected] http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
