RE: best practice: 1.4 billions documents

Uwe Schindler Thu, 25 Nov 2010 01:24:55 -0800

You are in trouble if you use MultiTermQuery subclasses as negative clause in a 
BooleanQuery, e.g a range like "-[A TO B]" or even NumericRanges or Wildcards. 
The query will then incorrect results.


-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Ganesh [mailto:emailg...@yahoo.co.in]
> Sent: Thursday, November 25, 2010 9:55 AM
> To: java-user@lucene.apache.org
> Subject: Re: best practice: 1.4 billions documents
> 
> Thanks for the input.
> 
> My results are sorted by date and i am not much bothered about score. Will i
> still be in trouble?
> 
> Regards
> Ganesh
> 
> 
> ----- Original Message -----
> From: "Robert Muir" <rcm...@gmail.com>
> To: <java-user@lucene.apache.org>
> Sent: Thursday, November 25, 2010 1:45 PM
> Subject: Re: best practice: 1.4 billions documents
> 
> 
> On Thu, Nov 25, 2010 at 2:58 AM, Uwe Schindler <u...@thetaphi.de> wrote:
> > ParallelMultiSearcher as subclass of MultiSearcher has the same problems.
> These are not crashes, but more that some queries do not return correct scored
> results for some queries. This effects especially all MultiTermQueries
> (TermRange, Fuzzy, NumericRange, Wildcard, Prefix) if they are used in a
> negative fashion (using MUST_NOT resp. "-" in QueryParser). For all of those
> queries except Fuzzy, you are safe if you use
> CONSTANT_SCORE_REWRITE_METHOD (using setRewriteMethod). The same
> problems apply for span queries. For *all* Fuzzy Queries (negative or not), 
> the
> scores are simply wrong and so scoring is broken with (Parallel)MultiSearcher;
> wrong results are only returned when negative clauses!
> >
> 
> you can use constant score rewrite method with fuzzy, too. then it
> will work "correctly" (even negative) with multisearcher too. but it
> will be slow, with unbounded number of results, and the fuzziness will
> not affect the scoring. (this is what constant score rewrite implies)
> 
> the reason i say "correctly" is that for all of these queries,
> constant score rewrite is just a general workaround, and might still
> be incorrect. This is because many queries often have special cases
> where they rewrite to simpler things and in general the MultiSearcher
> combine() logic is broken here, so there might be more problems.
> 
> > A new class ParallelIndexSearcher could help with that, when it parallelizes
> multiple segments, this is still in planning phase. The difference to
> ParallelMultiSearcher would be that it takes a "single" IndexReader (e.g. a
> MultiReader in your case) and parallelizes per segment/segment bunches.
> >
> 
> Besides the inherited broken-ness from multisearcher,
> parallelmultisearcher is broken further because it requires you to
> organize your index structure in a special way to get concurrency.
> 
> This is all pretty silly though, since ParallelMultiSearcher on a
> single machine isn't going to increase QPS, so how useful really is it
> in general???
> 
> we should deprecate both the broken Multi & ParallelMulti Searchers
> and never look back.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> Download Now! http://messenger.yahoo.com/download.php
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: best practice: 1.4 billions documents

Reply via email to