Re: Performance problems with Lucene 2.9

Michel Nadeau Mon, 30 Nov 2009 09:17:07 -0800

Uwe,

you think that something like this -
TopFieldDocs tfd = searcher.search(new ConstantScoreQuery(cluCF), null, 200,
cluSort);


Would be more performant than using MatchAllDocsQuery with Filters like this
-
TopFieldDocs tfd = searcher.search(new MatchAllDocsQuery(), cluCF, 200,
cluSort);

Thanks!

- Mike
[email protected]


On Mon, Nov 30, 2009 at 12:06 PM, Michel Nadeau <[email protected]> wrote:

> I'm currently trying something like this -
>
> TopFieldDocs tfd = searcher.search(new MatchAllDocsQuery(), cluCF, 200,
> cluSort);
>
> cluCF = filters
> cluSort = sorts
>
> Now I have another question... is there a way to specify a "start from" so
> I could get page 2, 3, 4, etc.. ?
>
> - Mike
> [email protected]
>
>
>
> On Mon, Nov 30, 2009 at 12:00 PM, Uwe Schindler <[email protected]> wrote:
>
>> > And sorting is done by the
>> > collector, Lucene has no idea how to sort.
>>
>> Sorting is done by the internal collector behind the
>> Top(Field)Docs-returning method (your own collectors would have to do it
>> themselves). If you call search(Query, n,... Sort), internally an
>> collector
>> is created that does the sorting for you and throws away all results that
>> do
>> not fall into the first 200 hits (if n=200).
>>
>> > If you use Sort, the returned
>> > TopDocs will be sorted.
>> >
>> > If you do not sort at all and do not score your results, TopDocs is not
>> > very
>> > useful, because the first 200 hits cannot be ranked.
>> >
>> > -----
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail: [email protected]
>> >
>> > > -----Original Message-----
>> > > From: Michel Nadeau [mailto:[email protected]]
>> > > Sent: Monday, November 30, 2009 5:35 PM
>> > > To: [email protected]
>> > > Subject: Re: Performance problems with Lucene 2.9
>> > >
>> > > I'll definitely switch to a Collector.
>> > >
>> > > It's just not clear for me if I should use BooleanQueries or
>> > > MatchAllDocuments+Filters ?
>> > >
>> > > And should I write my own collector or the TopDocs one is perfect for
>> me
>> > ?
>> > >
>> > > - Mike
>> > > [email protected]
>> > >
>> > >
>> > > On Mon, Nov 30, 2009 at 11:30 AM, Erick Erickson
>> > > <[email protected]>wrote:
>> > >
>> > > > The problem with hits is that a it re-executes the query
>> > > > every N documents where N is 100 (?).
>> > > >
>> > > > So, a loop like
>> > > > for (int idx : hits.length) {
>> > > >   do something....
>> > > > }
>> > > >
>> > > > Assuming my memory is right and it's every 100, your query will
>> > > > re-execute (length/100) times. Which is unfortunate.
>> > > >
>> > > > The very quick test to see where to concentrate first would be to
>> take
>> > > > a time stamp just before you hit your loop.....
>> > > >
>> > > > This will tell you whether this loop is the culprit, but it really
>> > > doesn't
>> > > > matter because you'll follow the advice from Uwe and Shai anyway
>> <G>.
>> > > >
>> > > > Filtering and Sorting are applied to Collectors before you see
>> > them.....
>> > > >
>> > > > The other bit would be to investigate your sorting. Remember that
>> the
>> > > > first sort or two take quite a while since the relevant caches are
>> > > > populated with first used, so second+ queries should be faster. The
>> > > > Wiki has some timing/speedup advice.....
>> > > >
>> > > > Best
>> > > > Erick
>> > > >
>> > > >
>> > > > On Mon, Nov 30, 2009 at 11:10 AM, Michel Nadeau <[email protected]>
>> > > wrote:
>> > > >
>> > > > > What is the main difference between Hits and Collectors?
>> > > > >
>> > > > > - Mike
>> > > > > [email protected]
>> > > > >
>> > > > >
>> > > > > On Mon, Nov 30, 2009 at 11:03 AM, Uwe Schindler <[email protected]>
>> > > wrote:
>> > > > >
>> > > > > > And if you only have a filter and apply it to all documents,
>> make
>> > a
>> > > > > > ConstantScoreQuery on top of the filter:
>> > > > > >
>> > > > > > Query q=new ConstantScoreQuery(cluCF);
>> > > > > >
>> > > > > > Then remove the filter from your search method call and only
>> > execute
>> > > > this
>> > > > > > query.
>> > > > > >
>> > > > > > And if you iterate over all results never-ever use Hits! (its
>> > > already
>> > > > > > deprecated). Write a Collector instead (as you are not
>> interested
>> > in
>> > > > > > scoring).
>> > > > > >
>> > > > > > And: If you replace a relational database with Lucene, be sure
>> not
>> > > to
>> > > > > think
>> > > > > > in a relational sense with foreign keys / primary keys and so
>> on.
>> > In
>> > > > > > general
>> > > > > > you should flatten everything.
>> > > > > >
>> > > > > > Uwe
>> > > > > >
>> > > > > > -----
>> > > > > > Uwe Schindler
>> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > > > > > http://www.thetaphi.de
>> > > > > > eMail: [email protected]
>> > > > > >
>> > > > > >
>> > > > > > > -----Original Message-----
>> > > > > > > From: Shai Erera [mailto:[email protected]]
>> > > > > > > Sent: Monday, November 30, 2009 4:56 PM
>> > > > > > > To: [email protected]
>> > > > > > > Subject: Re: Performance problems with Lucene 2.9
>> > > > > > >
>> > > > > > > Hi
>> > > > > > >
>> > > > > > > First you can use MatchAllDocsQuery, which matches all
>> > documents.
>> > > It
>> > > > > will
>> > > > > > > save a HUGE posting list (TAG:TAG), and performs much faster.
>> > For
>> > > > > example
>> > > > > > > TAG:TAG computes a score for each doc, even though you don't
>> > need
>> > > it.
>> > > > > > > MatchAllDocsQuery doesn't.
>> > > > > > >
>> > > > > > > Second, move away from Hits ! :) Use Collectors instead.
>> > > > > > >
>> > > > > > > If I understand the chain of filters, do you think you can
>> code
>> > > them
>> > > > > with
>> > > > > > > a
>> > > > > > > BooleanQuery that is added BooleanClauses, each with is Term
>> > > > > > > (field:value)?
>> > > > > > > You can add clauses w/ OR, AND, NOT etc.
>> > > > > > >
>> > > > > > > Note that in Lucene 2.9, you can avoid scoring documents very
>> > > easily,
>> > > > > > > which
>> > > > > > > is a performance win if you don't need scores (i.e. if you
>> just
>> > > want
>> > > > to
>> > > > > > > match everything, not caring for scores).
>> > > > > > >
>> > > > > > > Shai
>> > > > > > >
>> > > > > > > On Mon, Nov 30, 2009 at 5:47 PM, Michel Nadeau
>> > <[email protected]>
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi,
>> > > > > > > >
>> > > > > > > > we use Lucene to store around 300 millions of records. We
>> use
>> > > the
>> > > > > index
>> > > > > > > > both
>> > > > > > > > for conventional searching, but also for all the system's
>> data
>> > -
>> > > we
>> > > > > > > > replaced
>> > > > > > > > MySQL with Lucene because it was simply not working at all
>> > with
>> > > > MySQL
>> > > > > > > due
>> > > > > > > > to
>> > > > > > > > the amount or records. Our problem is that we have HUGE
>> > > performance
>> > > > > > > > problems... whenever we search, it takes forever to return
>> > > results,
>> > > > > and
>> > > > > > > > Java
>> > > > > > > > uses 100% CPU/RAM.
>> > > > > > > >
>> > > > > > > > Our index fields are like this:
>> > > > > > > >
>> > > > > > > > TYPE
>> > > > > > > > PK
>> > > > > > > > FOREIGN_PK
>> > > > > > > > TAG
>> > > > > > > > ...other information depending on type...
>> > > > > > > >
>> > > > > > > > * All fields are Field.Index.UN_TOKENIZED
>> > > > > > > > * The field "TAG" always contains the value "TAG".
>> > > > > > > >
>> > > > > > > > Whenever we search in the index, our query is "TAG:TAG" to
>> > match
>> > > > all
>> > > > > > > > documents, and we do the search like this:
>> > > > > > > >
>> > > > > > > >        // Search
>> > > > > > > >        Hits h = searcher.search(q, cluCF, cluSort);
>> > > > > > > >
>> > > > > > > > cluCF is a ChainedFilter containing all the other filters
>> > (like
>> > > > > > > > FOREIGN_PK=12345, TYPE=a, etc.).
>> > > > > > > >
>> > > > > > > > I know that the method is probably crazy because "TAG:TAG"
>> is
>> > > > > matching
>> > > > > > > all
>> > > > > > > > 300M documents and then it applies filters; so that's
>> probably
>> > > why
>> > > > > > every
>> > > > > > > > little query is taking 100% CPU/RAM.... but I don't know how
>> > to
>> > > do
>> > > > it
>> > > > > > > > properly.
>> > > > > > > >
>> > > > > > > > Help ! Any advice is welcome.
>> > > > > > > >
>> > > > > > > > - Mike
>> > > > > > > > [email protected]
>> > > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> ------------------------------------------------------------------
>> > --
>> > > -
>> > > > > > To unsubscribe, e-mail: [email protected]
>> > > > > > For additional commands, e-mail:
>> [email protected]
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Re: Performance problems with Lucene 2.9

Reply via email to