RE: Performance problems with Lucene 2.9

Uwe Schindler Mon, 30 Nov 2009 09:31:21 -0800

> you think that something like this -
> TopFieldDocs tfd = searcher.search(new ConstantScoreQuery(cluCF), null,
> 200,
> cluSort);


This is little bit faster as it does not need to intersect the all queries
with the filtered ones.

> Would be more performant than using MatchAllDocsQuery with Filters like
> this
> -
> TopFieldDocs tfd = searcher.search(new MatchAllDocsQuery(), cluCF, 200,
> cluSort);

Slower, as the iterator on all document ids needs to be intersected with the
filtered iterator.

It's just one step more.

> Thanks!
> 
> - Mike
> [email protected]
> 
> 
> On Mon, Nov 30, 2009 at 12:06 PM, Michel Nadeau <[email protected]> wrote:
> 
> > I'm currently trying something like this -
> >
> > TopFieldDocs tfd = searcher.search(new MatchAllDocsQuery(), cluCF, 200,
> > cluSort);
> >
> > cluCF = filters
> > cluSort = sorts
> >
> > Now I have another question... is there a way to specify a "start from"
> so
> > I could get page 2, 3, 4, etc.. ?
> >
> > - Mike
> > [email protected]
> >
> >
> >
> > On Mon, Nov 30, 2009 at 12:00 PM, Uwe Schindler <[email protected]> wrote:
> >
> >> > And sorting is done by the
> >> > collector, Lucene has no idea how to sort.
> >>
> >> Sorting is done by the internal collector behind the
> >> Top(Field)Docs-returning method (your own collectors would have to do
> it
> >> themselves). If you call search(Query, n,... Sort), internally an
> >> collector
> >> is created that does the sorting for you and throws away all results
> that
> >> do
> >> not fall into the first 200 hits (if n=200).
> >>
> >> > If you use Sort, the returned
> >> > TopDocs will be sorted.
> >> >
> >> > If you do not sort at all and do not score your results, TopDocs is
> not
> >> > very
> >> > useful, because the first 200 hits cannot be ranked.
> >> >
> >> > -----
> >> > Uwe Schindler
> >> > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> > http://www.thetaphi.de
> >> > eMail: [email protected]
> >> >
> >> > > -----Original Message-----
> >> > > From: Michel Nadeau [mailto:[email protected]]
> >> > > Sent: Monday, November 30, 2009 5:35 PM
> >> > > To: [email protected]
> >> > > Subject: Re: Performance problems with Lucene 2.9
> >> > >
> >> > > I'll definitely switch to a Collector.
> >> > >
> >> > > It's just not clear for me if I should use BooleanQueries or
> >> > > MatchAllDocuments+Filters ?
> >> > >
> >> > > And should I write my own collector or the TopDocs one is perfect
> for
> >> me
> >> > ?
> >> > >
> >> > > - Mike
> >> > > [email protected]
> >> > >
> >> > >
> >> > > On Mon, Nov 30, 2009 at 11:30 AM, Erick Erickson
> >> > > <[email protected]>wrote:
> >> > >
> >> > > > The problem with hits is that a it re-executes the query
> >> > > > every N documents where N is 100 (?).
> >> > > >
> >> > > > So, a loop like
> >> > > > for (int idx : hits.length) {
> >> > > >   do something....
> >> > > > }
> >> > > >
> >> > > > Assuming my memory is right and it's every 100, your query will
> >> > > > re-execute (length/100) times. Which is unfortunate.
> >> > > >
> >> > > > The very quick test to see where to concentrate first would be to
> >> take
> >> > > > a time stamp just before you hit your loop.....
> >> > > >
> >> > > > This will tell you whether this loop is the culprit, but it
> really
> >> > > doesn't
> >> > > > matter because you'll follow the advice from Uwe and Shai anyway
> >> <G>.
> >> > > >
> >> > > > Filtering and Sorting are applied to Collectors before you see
> >> > them.....
> >> > > >
> >> > > > The other bit would be to investigate your sorting. Remember that
> >> the
> >> > > > first sort or two take quite a while since the relevant caches
> are
> >> > > > populated with first used, so second+ queries should be faster.
> The
> >> > > > Wiki has some timing/speedup advice.....
> >> > > >
> >> > > > Best
> >> > > > Erick
> >> > > >
> >> > > >
> >> > > > On Mon, Nov 30, 2009 at 11:10 AM, Michel Nadeau
> <[email protected]>
> >> > > wrote:
> >> > > >
> >> > > > > What is the main difference between Hits and Collectors?
> >> > > > >
> >> > > > > - Mike
> >> > > > > [email protected]
> >> > > > >
> >> > > > >
> >> > > > > On Mon, Nov 30, 2009 at 11:03 AM, Uwe Schindler
> <[email protected]>
> >> > > wrote:
> >> > > > >
> >> > > > > > And if you only have a filter and apply it to all documents,
> >> make
> >> > a
> >> > > > > > ConstantScoreQuery on top of the filter:
> >> > > > > >
> >> > > > > > Query q=new ConstantScoreQuery(cluCF);
> >> > > > > >
> >> > > > > > Then remove the filter from your search method call and only
> >> > execute
> >> > > > this
> >> > > > > > query.
> >> > > > > >
> >> > > > > > And if you iterate over all results never-ever use Hits! (its
> >> > > already
> >> > > > > > deprecated). Write a Collector instead (as you are not
> >> interested
> >> > in
> >> > > > > > scoring).
> >> > > > > >
> >> > > > > > And: If you replace a relational database with Lucene, be
> sure
> >> not
> >> > > to
> >> > > > > think
> >> > > > > > in a relational sense with foreign keys / primary keys and so
> >> on.
> >> > In
> >> > > > > > general
> >> > > > > > you should flatten everything.
> >> > > > > >
> >> > > > > > Uwe
> >> > > > > >
> >> > > > > > -----
> >> > > > > > Uwe Schindler
> >> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> > > > > > http://www.thetaphi.de
> >> > > > > > eMail: [email protected]
> >> > > > > >
> >> > > > > >
> >> > > > > > > -----Original Message-----
> >> > > > > > > From: Shai Erera [mailto:[email protected]]
> >> > > > > > > Sent: Monday, November 30, 2009 4:56 PM
> >> > > > > > > To: [email protected]
> >> > > > > > > Subject: Re: Performance problems with Lucene 2.9
> >> > > > > > >
> >> > > > > > > Hi
> >> > > > > > >
> >> > > > > > > First you can use MatchAllDocsQuery, which matches all
> >> > documents.
> >> > > It
> >> > > > > will
> >> > > > > > > save a HUGE posting list (TAG:TAG), and performs much
> faster.
> >> > For
> >> > > > > example
> >> > > > > > > TAG:TAG computes a score for each doc, even though you
> don't
> >> > need
> >> > > it.
> >> > > > > > > MatchAllDocsQuery doesn't.
> >> > > > > > >
> >> > > > > > > Second, move away from Hits ! :) Use Collectors instead.
> >> > > > > > >
> >> > > > > > > If I understand the chain of filters, do you think you can
> >> code
> >> > > them
> >> > > > > with
> >> > > > > > > a
> >> > > > > > > BooleanQuery that is added BooleanClauses, each with is
> Term
> >> > > > > > > (field:value)?
> >> > > > > > > You can add clauses w/ OR, AND, NOT etc.
> >> > > > > > >
> >> > > > > > > Note that in Lucene 2.9, you can avoid scoring documents
> very
> >> > > easily,
> >> > > > > > > which
> >> > > > > > > is a performance win if you don't need scores (i.e. if you
> >> just
> >> > > want
> >> > > > to
> >> > > > > > > match everything, not caring for scores).
> >> > > > > > >
> >> > > > > > > Shai
> >> > > > > > >
> >> > > > > > > On Mon, Nov 30, 2009 at 5:47 PM, Michel Nadeau
> >> > <[email protected]>
> >> > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi,
> >> > > > > > > >
> >> > > > > > > > we use Lucene to store around 300 millions of records. We
> >> use
> >> > > the
> >> > > > > index
> >> > > > > > > > both
> >> > > > > > > > for conventional searching, but also for all the system's
> >> data
> >> > -
> >> > > we
> >> > > > > > > > replaced
> >> > > > > > > > MySQL with Lucene because it was simply not working at
> all
> >> > with
> >> > > > MySQL
> >> > > > > > > due
> >> > > > > > > > to
> >> > > > > > > > the amount or records. Our problem is that we have HUGE
> >> > > performance
> >> > > > > > > > problems... whenever we search, it takes forever to
> return
> >> > > results,
> >> > > > > and
> >> > > > > > > > Java
> >> > > > > > > > uses 100% CPU/RAM.
> >> > > > > > > >
> >> > > > > > > > Our index fields are like this:
> >> > > > > > > >
> >> > > > > > > > TYPE
> >> > > > > > > > PK
> >> > > > > > > > FOREIGN_PK
> >> > > > > > > > TAG
> >> > > > > > > > ...other information depending on type...
> >> > > > > > > >
> >> > > > > > > > * All fields are Field.Index.UN_TOKENIZED
> >> > > > > > > > * The field "TAG" always contains the value "TAG".
> >> > > > > > > >
> >> > > > > > > > Whenever we search in the index, our query is "TAG:TAG"
> to
> >> > match
> >> > > > all
> >> > > > > > > > documents, and we do the search like this:
> >> > > > > > > >
> >> > > > > > > >        // Search
> >> > > > > > > >        Hits h = searcher.search(q, cluCF, cluSort);
> >> > > > > > > >
> >> > > > > > > > cluCF is a ChainedFilter containing all the other filters
> >> > (like
> >> > > > > > > > FOREIGN_PK=12345, TYPE=a, etc.).
> >> > > > > > > >
> >> > > > > > > > I know that the method is probably crazy because
> "TAG:TAG"
> >> is
> >> > > > > matching
> >> > > > > > > all
> >> > > > > > > > 300M documents and then it applies filters; so that's
> >> probably
> >> > > why
> >> > > > > > every
> >> > > > > > > > little query is taking 100% CPU/RAM.... but I don't know
> how
> >> > to
> >> > > do
> >> > > > it
> >> > > > > > > > properly.
> >> > > > > > > >
> >> > > > > > > > Help ! Any advice is welcome.
> >> > > > > > > >
> >> > > > > > > > - Mike
> >> > > > > > > > [email protected]
> >> > > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> ------------------------------------------------------------------
> >> > --
> >> > > -
> >> > > > > > To unsubscribe, e-mail: java-user-
> [email protected]
> >> > > > > > For additional commands, e-mail:
> >> [email protected]
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [email protected]
> >> > For additional commands, e-mail: [email protected]
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Performance problems with Lucene 2.9

Reply via email to