[Nutch-dev] Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Andrzej Bialecki Thu, 15 Dec 2005 12:12:02 -0800

Doug Cutting wrote:

Andrzej Bialecki wrote:
Doug Cutting wrote:
The graph just shows that they differ, not how much better or worsethey are, since the baseline is not perfect. When the top-10 is 50%different, are those 5 different hits markedly worse matches to youreye than the five they've displaced, or are they comparable? That'swhat really matters.
Hmm. I'm not sure I agree with this. Your reasoning would be true ifwe were changing the ranking formula. But the goal IMHO with thesepatches is to return equally complete results, using the same rankingformula.
But we should not assume that the ranking formula is perfect. Imaginea case where the high-order bits of the score are correct and thelow-order bits are random. Then an optimization which changes localorderings does not actually affect result quality.

Yes, that's true, I could accept that. In these tests the score deltawas something like 20 (between the hit #1 and hit #100), and the scorewas falling down rapidly after the first 10 or 20 results. Now, theproblem is that many results _within_ this range (i.e. still within thearea with large score deltas) were missing. And this suggests that thedifferences were also on the high-order bits.

Please re-run the script on your index, using typical queries, and checkthe results. It's possible that I made a mistake somewhere, it would begood to confirm at least the trends in the raw results.

I specifically avoided using normalized scores, instead using theabsolute scores in TopDocs. And the absolute scores in both cases areexactly the same, for those results that are present.
What is wrong is that some results that should be there (judging bythe ranking) are simply missing. So, it's about the recall, and thebaseline index gives the best estimate.
Yes, this optimization, by definition, hurts recall. The onlyquestion is does it substantially hurt relevance at, e.g., 10 hits.If the top-10 are identical then the answer is easy: no, it does not.But if they differ, we can only answer this by looking at results.Chances are they're worse, but how much? Radically? Slightly?Noticiably?

The paper by Suel et al. which you referred to claims the top-100 ashigh as 98% after optimizations. What I observed were values between0-60%, but going above this level caused a heavy performance loss.

What part of Nutch are you trying to avoid? Perhaps you could trymeasuring your Lucene-only benchmark against a Nutch-based one. Ifthey don't differ markedly then you can simply use Nutch, whichmakes it a stronger benchmark. If they differ, then we shouldfigure out why.
Again, I don't see it this way. Nutch results will always be worsethan pure Lucene, because of the added layers. If I can't improve theperformance in Lucene code (which takes > 85% time for every query)then no matter how well optimized Nutch code is it won't get far.
But we're mostly modifying Nutch's use of Lucene, not modifyingLucene. So measuring Lucene alone won't tell you everything, andyou'll keep having to port Nutch stuff. If you want to, e.g., replaya large query log to measure average performance, then you'll needthings like auto-filterization, n-grams, query plugins, etc., no?

Perhaps we misunderstood each other - I'm using an index built by Nutch,there's no substitute for that, I agree. It was just more convenient forme to skip all Nutch classes for _querying_ alone, because it was easierto control the exact final form of Lucene query - especially if you wantto experiment quickly with a lot of variables that are not (yet)parametrized through the config files. In the end, you end up with aplain Lucene query, only then you don't know exactly how much time wasspent on translating, building filters, etc. *shrug* You can do iteither way, I agree.

In several installations I use smaller values of slop (around20-40). But this is motivated by better quality matches, not byperformance, so I didn't test for this...
But that's a great reason to test for it! If lower slop can improveresult quality, then we should certainly see if it also makesoptimizations easier.
I forgot to mention this - the tests I ran already used the smallervalues: the slop was set to 20.
Are they different if the slop is Integer.MAX_VALUE? It would bereally good to determine what causes results to diverge, whether it ismultiple terms (probably not) phrases (probably) and/or slop(perhaps). Chances are that the divergence is bad, that results areadversely affected, and that we need to try to fix it. But to do sowe'll need to understand it.

Agreed. I'll try to re-run the tests with queries that set a differentslop value, or omit the phrases completely (and it's quite easy to dothis with my approach, just use a different translated query on thecmd-line ;-) ).

That's another advantage of using Lucene directly in this script -you can provide any query structure on the command-line withoutchanging the code in Nutch.
But that just means that we should set the SLOP constant inBasicQueryFilter.java from a configuration property, and permit thesetting of configuration properties from the command line, no?

Well, if you want to quickly experiment with radically different querytranslation, then no.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Reply via email to