On Mon, 3 Sep 2007, John Kleven wrote:
When using a HitCollector via PyLucene (i.e., overiding the collect() API) has anybody else noticed a massive slowdown? Even if i set my collector to return immediately in the collect(doc_id, score) callback, so not even touching any of the ids or scores, on a collection of 540,000 documents, I get a an avg search time of .11 seconds. If I go through the standard IndexSearcher.search, which still uses a hit collector on the Java backside (TopDocCollector.java if interested), I get avg search times of 0.0104 -- and it is actually doing something (namely, tracking the highest scored docs in a priority queue up to size 100). Is this order-of-magnitude slowdown something that I can expect just because of the java->python callback via the collect() function? To get this up to speed, is my only option to code my collector in Java, add in the hooks, then compile a custom (gulp) PyLucene version?
I don't know enough about what you're trying to do to have much of an opinion. It seems to me though, that you're comparing apples and oranges. In the python case you're using a HitCollector python customization that returns nothing and in the Java case you're using a TopDocCollector that actually does something.
If indeed it turns out that calling into Python Java is the culprit, then your best bet is what you're suggesting. I doubt it, though. The only possibly expensive call apart from your python code is the acquiring of the python GIL (Global Interpreter Lock). If there is no contention for the GIL, it should be really fast acquiring it.
The rest of the Java->Python boundary crossing is the marshalling of Java objects into Python ones, the call to your method itself and the reverse marshalling of the return value.
Andi.. _______________________________________________ pylucene-dev mailing list [email protected] http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
