> you think that something like this - > TopFieldDocs tfd = searcher.search(new ConstantScoreQuery(cluCF), null, > 200, > cluSort);
This is little bit faster as it does not need to intersect the all queries with the filtered ones. > Would be more performant than using MatchAllDocsQuery with Filters like > this > - > TopFieldDocs tfd = searcher.search(new MatchAllDocsQuery(), cluCF, 200, > cluSort); Slower, as the iterator on all document ids needs to be intersected with the filtered iterator. It's just one step more. > Thanks! > > - Mike > aka...@gmail.com > > > On Mon, Nov 30, 2009 at 12:06 PM, Michel Nadeau <aka...@gmail.com> wrote: > > > I'm currently trying something like this - > > > > TopFieldDocs tfd = searcher.search(new MatchAllDocsQuery(), cluCF, 200, > > cluSort); > > > > cluCF = filters > > cluSort = sorts > > > > Now I have another question... is there a way to specify a "start from" > so > > I could get page 2, 3, 4, etc.. ? > > > > - Mike > > aka...@gmail.com > > > > > > > > On Mon, Nov 30, 2009 at 12:00 PM, Uwe Schindler <u...@thetaphi.de> wrote: > > > >> > And sorting is done by the > >> > collector, Lucene has no idea how to sort. > >> > >> Sorting is done by the internal collector behind the > >> Top(Field)Docs-returning method (your own collectors would have to do > it > >> themselves). If you call search(Query, n,... Sort), internally an > >> collector > >> is created that does the sorting for you and throws away all results > that > >> do > >> not fall into the first 200 hits (if n=200). > >> > >> > If you use Sort, the returned > >> > TopDocs will be sorted. > >> > > >> > If you do not sort at all and do not score your results, TopDocs is > not > >> > very > >> > useful, because the first 200 hits cannot be ranked. > >> > > >> > ----- > >> > Uwe Schindler > >> > H.-H.-Meier-Allee 63, D-28213 Bremen > >> > http://www.thetaphi.de > >> > eMail: u...@thetaphi.de > >> > > >> > > -----Original Message----- > >> > > From: Michel Nadeau [mailto:aka...@gmail.com] > >> > > Sent: Monday, November 30, 2009 5:35 PM > >> > > To: java-user@lucene.apache.org > >> > > Subject: Re: Performance problems with Lucene 2.9 > >> > > > >> > > I'll definitely switch to a Collector. > >> > > > >> > > It's just not clear for me if I should use BooleanQueries or > >> > > MatchAllDocuments+Filters ? > >> > > > >> > > And should I write my own collector or the TopDocs one is perfect > for > >> me > >> > ? > >> > > > >> > > - Mike > >> > > aka...@gmail.com > >> > > > >> > > > >> > > On Mon, Nov 30, 2009 at 11:30 AM, Erick Erickson > >> > > <erickerick...@gmail.com>wrote: > >> > > > >> > > > The problem with hits is that a it re-executes the query > >> > > > every N documents where N is 100 (?). > >> > > > > >> > > > So, a loop like > >> > > > for (int idx : hits.length) { > >> > > > do something.... > >> > > > } > >> > > > > >> > > > Assuming my memory is right and it's every 100, your query will > >> > > > re-execute (length/100) times. Which is unfortunate. > >> > > > > >> > > > The very quick test to see where to concentrate first would be to > >> take > >> > > > a time stamp just before you hit your loop..... > >> > > > > >> > > > This will tell you whether this loop is the culprit, but it > really > >> > > doesn't > >> > > > matter because you'll follow the advice from Uwe and Shai anyway > >> <G>. > >> > > > > >> > > > Filtering and Sorting are applied to Collectors before you see > >> > them..... > >> > > > > >> > > > The other bit would be to investigate your sorting. Remember that > >> the > >> > > > first sort or two take quite a while since the relevant caches > are > >> > > > populated with first used, so second+ queries should be faster. > The > >> > > > Wiki has some timing/speedup advice..... > >> > > > > >> > > > Best > >> > > > Erick > >> > > > > >> > > > > >> > > > On Mon, Nov 30, 2009 at 11:10 AM, Michel Nadeau > <aka...@gmail.com> > >> > > wrote: > >> > > > > >> > > > > What is the main difference between Hits and Collectors? > >> > > > > > >> > > > > - Mike > >> > > > > aka...@gmail.com > >> > > > > > >> > > > > > >> > > > > On Mon, Nov 30, 2009 at 11:03 AM, Uwe Schindler > <u...@thetaphi.de> > >> > > wrote: > >> > > > > > >> > > > > > And if you only have a filter and apply it to all documents, > >> make > >> > a > >> > > > > > ConstantScoreQuery on top of the filter: > >> > > > > > > >> > > > > > Query q=new ConstantScoreQuery(cluCF); > >> > > > > > > >> > > > > > Then remove the filter from your search method call and only > >> > execute > >> > > > this > >> > > > > > query. > >> > > > > > > >> > > > > > And if you iterate over all results never-ever use Hits! (its > >> > > already > >> > > > > > deprecated). Write a Collector instead (as you are not > >> interested > >> > in > >> > > > > > scoring). > >> > > > > > > >> > > > > > And: If you replace a relational database with Lucene, be > sure > >> not > >> > > to > >> > > > > think > >> > > > > > in a relational sense with foreign keys / primary keys and so > >> on. > >> > In > >> > > > > > general > >> > > > > > you should flatten everything. > >> > > > > > > >> > > > > > Uwe > >> > > > > > > >> > > > > > ----- > >> > > > > > Uwe Schindler > >> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen > >> > > > > > http://www.thetaphi.de > >> > > > > > eMail: u...@thetaphi.de > >> > > > > > > >> > > > > > > >> > > > > > > -----Original Message----- > >> > > > > > > From: Shai Erera [mailto:ser...@gmail.com] > >> > > > > > > Sent: Monday, November 30, 2009 4:56 PM > >> > > > > > > To: java-user@lucene.apache.org > >> > > > > > > Subject: Re: Performance problems with Lucene 2.9 > >> > > > > > > > >> > > > > > > Hi > >> > > > > > > > >> > > > > > > First you can use MatchAllDocsQuery, which matches all > >> > documents. > >> > > It > >> > > > > will > >> > > > > > > save a HUGE posting list (TAG:TAG), and performs much > faster. > >> > For > >> > > > > example > >> > > > > > > TAG:TAG computes a score for each doc, even though you > don't > >> > need > >> > > it. > >> > > > > > > MatchAllDocsQuery doesn't. > >> > > > > > > > >> > > > > > > Second, move away from Hits ! :) Use Collectors instead. > >> > > > > > > > >> > > > > > > If I understand the chain of filters, do you think you can > >> code > >> > > them > >> > > > > with > >> > > > > > > a > >> > > > > > > BooleanQuery that is added BooleanClauses, each with is > Term > >> > > > > > > (field:value)? > >> > > > > > > You can add clauses w/ OR, AND, NOT etc. > >> > > > > > > > >> > > > > > > Note that in Lucene 2.9, you can avoid scoring documents > very > >> > > easily, > >> > > > > > > which > >> > > > > > > is a performance win if you don't need scores (i.e. if you > >> just > >> > > want > >> > > > to > >> > > > > > > match everything, not caring for scores). > >> > > > > > > > >> > > > > > > Shai > >> > > > > > > > >> > > > > > > On Mon, Nov 30, 2009 at 5:47 PM, Michel Nadeau > >> > <aka...@gmail.com> > >> > > > > wrote: > >> > > > > > > > >> > > > > > > > Hi, > >> > > > > > > > > >> > > > > > > > we use Lucene to store around 300 millions of records. We > >> use > >> > > the > >> > > > > index > >> > > > > > > > both > >> > > > > > > > for conventional searching, but also for all the system's > >> data > >> > - > >> > > we > >> > > > > > > > replaced > >> > > > > > > > MySQL with Lucene because it was simply not working at > all > >> > with > >> > > > MySQL > >> > > > > > > due > >> > > > > > > > to > >> > > > > > > > the amount or records. Our problem is that we have HUGE > >> > > performance > >> > > > > > > > problems... whenever we search, it takes forever to > return > >> > > results, > >> > > > > and > >> > > > > > > > Java > >> > > > > > > > uses 100% CPU/RAM. > >> > > > > > > > > >> > > > > > > > Our index fields are like this: > >> > > > > > > > > >> > > > > > > > TYPE > >> > > > > > > > PK > >> > > > > > > > FOREIGN_PK > >> > > > > > > > TAG > >> > > > > > > > ...other information depending on type... > >> > > > > > > > > >> > > > > > > > * All fields are Field.Index.UN_TOKENIZED > >> > > > > > > > * The field "TAG" always contains the value "TAG". > >> > > > > > > > > >> > > > > > > > Whenever we search in the index, our query is "TAG:TAG" > to > >> > match > >> > > > all > >> > > > > > > > documents, and we do the search like this: > >> > > > > > > > > >> > > > > > > > // Search > >> > > > > > > > Hits h = searcher.search(q, cluCF, cluSort); > >> > > > > > > > > >> > > > > > > > cluCF is a ChainedFilter containing all the other filters > >> > (like > >> > > > > > > > FOREIGN_PK=12345, TYPE=a, etc.). > >> > > > > > > > > >> > > > > > > > I know that the method is probably crazy because > "TAG:TAG" > >> is > >> > > > > matching > >> > > > > > > all > >> > > > > > > > 300M documents and then it applies filters; so that's > >> probably > >> > > why > >> > > > > > every > >> > > > > > > > little query is taking 100% CPU/RAM.... but I don't know > how > >> > to > >> > > do > >> > > > it > >> > > > > > > > properly. > >> > > > > > > > > >> > > > > > > > Help ! Any advice is welcome. > >> > > > > > > > > >> > > > > > > > - Mike > >> > > > > > > > aka...@gmail.com > >> > > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> ------------------------------------------------------------------ > >> > -- > >> > > - > >> > > > > > To unsubscribe, e-mail: java-user- > unsubscr...@lucene.apache.org > >> > > > > > For additional commands, e-mail: > >> java-user-h...@lucene.apache.org > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> > For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org