Re: "Natural sorting" of documents in a Lucene index - possible?

Michel Nadeau Wed, 18 Aug 2010 09:23:01 -0700

Can you guys tell me more about "warm up queries" strategies ?

I know that once you made one query, the second time is super quick because
it's in cache - but how can you do warm up queries when you don't know what
users are going to search ?


- Mike
aka...@gmail.com


On Wed, Aug 18, 2010 at 11:26 AM, Michel Nadeau <aka...@gmail.com> wrote:

> Thanks !
>
> - Mike
> aka...@gmail.com
>
>
> On Wed, Aug 18, 2010 at 10:37 AM, Ian Lea <ian....@gmail.com> wrote:
>
>> > But - to come back to my original question... is there any way to have a
>> > "natural order" of documents other that the DocId In Lucene?
>>
>> No.
>>
>>
>> --
>> Ian.
>>
>>
>> On Wed, Aug 18, 2010 at 3:21 PM, Michel Nadeau <aka...@gmail.com> wrote:
>> > Cool, so I'll try these things -
>> >
>> > * Replace timestamps with YYYYMMDD - will minimize unique terms count;
>> > * Use NumericField's for dates and numbers - will remove all string
>> sorting.
>> > Thanks guys!
>> >
>> > --
>> >
>> > But - to come back to my original question... is there any way to have a
>> > "natural order" of documents other that the DocId In Lucene? For
>> example, is
>> > there any way to have an index automatically sorted on a specific field,
>> > like :
>> >
>> > DocId     Count     Data
>> > -------------------------------------
>> >  5         1       First test
>> >  1         3       Otter
>> >  8         4       Test
>> >  2         8       Aloha
>> >  10        11       Zulu
>> >  9        17       Bingo
>> >  3        46       Alpha test
>> >  6       112       Tango
>> >  4       120       Charlie test
>> >  7       200       Kiwi
>> >
>> > Notice the DocId and Data random orders, but Count is sorted. That would
>> be
>> > the 'natural order' in the index, and searching for 'test' would return
>> (in
>> > that order) :
>> >
>> > DocId     Count     Data
>> > -------------------------------------
>> >  5         1       First test
>> >  3        46       Alpha test
>> >   4       120       Charlie test
>> >
>> > Already sorted on the Count.
>> >
>> > Thanks!
>> >
>> > - Mike
>> > aka...@gmail.com
>> >
>> >
>> > On Tue, Aug 17, 2010 at 4:08 PM, Ian Lea <ian....@gmail.com> wrote:
>> >
>> >> Using NumericField for dates and other numbers is likely to help a
>> >> lot, and removes padding problems.  I'd try that first, or just sort
>> >> the top n hits yourself.
>> >>
>> >>
>> >> --
>> >> Ian.
>> >>
>> >>
>> >> On Tue, Aug 17, 2010 at 8:46 PM, Michel Nadeau <aka...@gmail.com>
>> wrote:
>> >> > I could at least drop hours/mins/sec, we don't need them, so my
>> timestamp
>> >> > could become 'YYYYMMDD', that would cut the number of unique terms at
>> >> least
>> >> > for dates.
>> >> >
>> >> > What about my other question about numbers : *" We do pad our numbers
>> >> with
>> >> > zeros though (for example: 10 becomes 00000010, etc.) because we had
>> >> trouble
>> >> > with sorting (100 was smaller than 2) ; is that considered as "string
>> >> > sorting" ? This might explain a part of the problem."* ? Thanks.
>> >> >
>> >> > - Mike
>> >> > aka...@gmail.com
>> >> >
>> >> >
>> >> > On Tue, Aug 17, 2010 at 3:40 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >> >wrote:
>> >> >
>> >> >> Hmmm, I glossed over your comment about sorting the top 250. There's
>> >> >> no reason that wouldn't work.
>> >> >>
>> >> >> Well, one way for, say, dates is to store separate fields. YYYY, MM,
>> DD,
>> >> >> HH, MM, SS, MS. That gives you say, 100 year terms, + 12 month
>> >> >> +31 days + .... for a very small total. You pay the price though by
>> >> >> having to change your queries and sorts to respect all 6 fields...
>> >> >>
>> >> >> But I'd only really go there after seeing if other options don't
>> work.
>> >> >>
>> >> >>
>> >> >> Best
>> >> >> Erick
>> >> >>
>> >> >> On Tue, Aug 17, 2010 at 3:35 PM, Michel Nadeau <aka...@gmail.com>
>> >> wrote:
>> >> >>
>> >> >> > Would our approach to limit the search top 250 documents (and then
>> >> sort
>> >> >> > these 250 documents) work fine ? Or even 250 unique terms with a
>> lot
>> >> of
>> >> >> > users is bad on memory when sorting ?
>> >> >> >
>> >> >> > We didn't look at trie fields - I will do though, thanks for the
>> tip !
>> >> >> >
>> >> >> > We do store the original 'Data' field (only the 'SearchableData'
>> field
>> >> is
>> >> >> > analyzed, all other fields are not analyzed), the users mainly
>> sort on
>> >> >> > numeric values; not a lot on string values (in fact I could
>> compltely
>> >> >> drop
>> >> >> > the sort by string feature). We do pad our numbers with zeros
>> though
>> >> (for
>> >> >> > example: 10 becomes 00000010, etc.) because we had trouble with
>> >> sorting
>> >> >> > (100
>> >> >> > was smaller than 2) ; is that considered as "string sorting" ?
>> This
>> >> might
>> >> >> > explain a part of the problem.
>> >> >> >
>> >> >> > Why/how would I reduce the count of unique terms?
>> >> >> >
>> >> >> >
>> >> >> > - Mike
>> >> >> > aka...@gmail.com
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Aug 17, 2010 at 3:28 PM, Erick Erickson <
>> >> erickerick...@gmail.com
>> >> >> > >wrote:
>> >> >> >
>> >> >> > > If you have tens of millions of documents, almost all with
>> unique
>> >> >> fields
>> >> >> > > that you're sorting on, you'll chew through memory like there's
>> no
>> >> >> > > tomorrow.
>> >> >> > >
>> >> >> > > Have you looked at trie fields? See:
>> >> >> > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
>> >> >> > >
>> >> >> > > I'm a little concerned that the user can sort on Data. Any field
>> >> used
>> >> >> for
>> >> >> > > sorting
>> >> >> > > should NOT be analyzed, so unless you are indexing "Data"
>> >> unanalyzed,
>> >> >> > > that's
>> >> >> > > a problem. And if you are sorting on strings unique to each
>> >> document,
>> >> >> > > that's
>> >> >> > > also a memory hog. Not to mention whether capitalization counts.
>> >> >> > >
>> >> >> > > You might enumerate the terms in your index for each of the
>> sortable
>> >> >> > fields
>> >> >> > > to figure out what the total number of unique terms each is and
>> use
>> >> >> that
>> >> >> > as
>> >> >> > > a basis for reducing their count....
>> >> >> > >
>> >> >> > > HTH
>> >> >> > > Erick
>> >> >> > >
>> >> >> > > On Tue, Aug 17, 2010 at 3:05 PM, Michel Nadeau <
>> aka...@gmail.com>
>> >> >> wrote:
>> >> >> > >
>> >> >> > > > Hi Erick,
>> >> >> > > >
>> >> >> > > > Here's some more details about our structure. First here's an
>> >> example
>> >> >> > of
>> >> >> > > > document in our index :
>> >> >> > > >
>> >> >> > > >     PrimaryKey        = SJAsfsf353JHGada66GH6 (it's a hash)
>> >> >> > > >     DocType           = X
>> >> >> > > >     Data              = This is the data
>> >> >> > > >     SearchableContent = This is the data
>> >> >> > > >     DateCreated       = <timestamp>
>> >> >> > > >     DateModified      = <timestamp>
>> >> >> > > >     Counter1          = 17
>> >> >> > > >     Counter2          = 3
>> >> >> > > >     Average           = 0.17
>> >> >> > > >     Cost              = 200
>> >> >> > > >
>> >> >> > > > The users are able to sort on almost all fields: Data,
>> >> DateCreated,
>> >> >> > > > DateModified, Counter1, Counter2, Average, Cost.
>> >> >> > > >
>> >> >> > > > When we search, we always search on the 'SearchableContent'
>> field
>> >> and
>> >> >> > we
>> >> >> > > > have at least one filter on the DocType (because we have many
>> >> >> document
>> >> >> > > > types
>> >> >> > > > in the same index). So a common search that would find the
>> >> document
>> >> >> > above
>> >> >> > > > is
>> >> >> > > > "data *AND DocType:X*" (we automatically add the "*AND
>> DocType:X*"
>> >> >> part
>> >> >> > > > using Lucene Filters.
>> >> >> > > >
>> >> >> > > > I would say that the number of unique terms in the field being
>> >> sorted
>> >> >> > on
>> >> >> > > is
>> >> >> > > > very big - for example timestamps, almost all unique,
>> counters,
>> >> >> > average,
>> >> >> > > > cost, data... so if a query finds 10M results, it's almost 10M
>> >> >> > different
>> >> >> > > > values to sort. About cache and warm-up queries : we don't use
>> >> >> warm-up
>> >> >> > > > queries -at all- because we have absolutely no idea of what
>> users
>> >> are
>> >> >> > > going
>> >> >> > > > to search for (they can search for absolutely anything). About
>> >> >> > "returning
>> >> >> > > > 10M" documents, right, we don't actually return the 10M
>> documents,
>> >> we
>> >> >> > use
>> >> >> > > > pagination to return documents X to Y of the 10M (and the 10M
>> was
>> >> >> only
>> >> >> > an
>> >> >> > > > example, it can be anywhere between 1K and 100M results). The
>> >> >> > pagination
>> >> >> > > > usually works fine and fast, our problem is really sorting.
>> >> >> > > >
>> >> >> > > > Our "Lucene Reader" process has 2GB of ram allowed, here's how
>> I
>> >> >> start
>> >> >> > it
>> >> >> > > -
>> >> >> > > >
>> >> >> > > >     java -Xmx2048m -jar LuceneReader.jar
>> >> >> > > >
>> >> >> > > > The problem really seems to be a ram problem, but I can't be
>> 100%
>> >> >> sure
>> >> >> > > (any
>> >> >> > > > help about how to be sure is welcome).
>> >> >> > > >
>> >> >> > > > Our current idea of a solution would be to get maximum 250
>> results
>> >> >> (the
>> >> >> > > > more
>> >> >> > > > relevant ones; more results than that is totally useless in
>> our
>> >> >> system)
>> >> >> > > so
>> >> >> > > > the sort should work fine on a small data set like that, but
>> we
>> >> want
>> >> >> to
>> >> >> > > > make
>> >> >> > > > sure we're doing everything right before doing that so we
>> don't
>> >> run
>> >> >> in
>> >> >> > > the
>> >> >> > > > same problems again.
>> >> >> > > >
>> >> >> > > > Thank you very much; let me know if you need any more details!
>> >> >> > > >
>> >> >> > > > - Mike
>> >> >> > > > aka...@gmail.com
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > On Mon, Aug 16, 2010 at 4:01 PM, Erick Erickson <
>> >> >> > erickerick...@gmail.com
>> >> >> > > > >wrote:
>> >> >> > > >
>> >> >> > > > > Let's back up a minute. The number of matched records is not
>> >> >> > > > > important when sorting, what's important is the number of
>> unique
>> >> >> > > > > terms in the field being sorted. Do you have any figures on
>> >> that?
>> >> >> One
>> >> >> > > > > very common sorting issue is sorting on very fine date time
>> >> >> > > resolutions,
>> >> >> > > > > although your examples don't include that...
>> >> >> > > > >
>> >> >> > > > > Now, cache loading is an issue. The very first time you sort
>> on
>> >> a
>> >> >> > > field,
>> >> >> > > > > all the values are read into a cache. Is it possible this is
>> the
>> >> >> > source
>> >> >> > > > > of your problems? You can cure this with warmup queries. The
>> >> >> > take-away
>> >> >> > > > > is that measuring the response time for the first sorted
>> query
>> >> is
>> >> >> > > > > very misleading.
>> >> >> > > > >
>> >> >> > > > > Although if you're sorting on many, many, many email
>> addresses,
>> >> >> > > > > that could be "interesting".
>> >> >> > > > >
>> >> >> > > > > The comment "returning 10,000,000 documents" is, I hope, a
>> >> >> > > > > misstatement. If you're trying to *return* 10M docs sorting
>> >> >> > > > > is irrelevant compared to assembling that many documents. If
>> >> >> > > > > you're trying to return the first 100 of 10M documents, it
>> >> should
>> >> >> > > > > work.
>> >> >> > > > >
>> >> >> > > > > Overall, we need more details on what you're sorting and
>> what
>> >> >> > > > > you're trying to return as well as how you're measuring
>> before
>> >> >> > > > > we can say much....
>> >> >> > > > >
>> >> >> > > > > Along with how much memory you're giving your JVM to work
>> with,
>> >> >> > > > > what "exploding" means. Are you CPU bound? IO bound?
>> Swapping?
>> >> >> > > > > You need to characterize what is going wrong before worrying
>> >> about
>> >> >> > > > > solutions......
>> >> >> > > > >
>> >> >> > > > > Best
>> >> >> > > > > Erick
>> >> >> > > > >
>> >> >> > > > > On Mon, Aug 16, 2010 at 3:08 PM, Michel Nadeau <
>> >> aka...@gmail.com>
>> >> >> > > wrote:
>> >> >> > > > >
>> >> >> > > > > > Hi,
>> >> >> > > > > >
>> >> >> > > > > > we are building an application using Lucene and we have
>> HUGE
>> >> data
>> >> >> > > sets
>> >> >> > > > > (our
>> >> >> > > > > > index contains millions and millions and millions of
>> >> documents),
>> >> >> > > which
>> >> >> > > > > > obviously cause us very important problems when sorting.
>> In
>> >> fact,
>> >> >> > we
>> >> >> > > > > > disabled sorting completely because the servers were just
>> >> >> exploding
>> >> >> > > > when
>> >> >> > > > > > trying to sort results in RAM. The users using the system
>> can
>> >> >> > search
>> >> >> > > > for
>> >> >> > > > > > whatever they want, so we never know how many results will
>> be
>> >> >> > > returned
>> >> >> > > > -
>> >> >> > > > > a
>> >> >> > > > > > search can return 10 documents (no problem with sorting)
>> or
>> >> >> > > 10,000,000
>> >> >> > > > > > (huge
>> >> >> > > > > > sorting problems).
>> >> >> > > > > >
>> >> >> > > > > > I woke up this morning and had a flash : is it possible
>> with
>> >> >> Lucene
>> >> >> > > to
>> >> >> > > > > have
>> >> >> > > > > > a "natural sorting" of documents? For example, let's say I
>> >> have 3
>> >> >> > > > columns
>> >> >> > > > > I
>> >> >> > > > > > want to be able to sort by : first name, last name, email;
>> I
>> >> >> would
>> >> >> > > have
>> >> >> > > > 3
>> >> >> > > > > > different indexes with the very same data but with a
>> different
>> >> >> > > primary
>> >> >> > > > > key
>> >> >> > > > > > for sorting. I know it's far fetched, and I have never
>> seen
>> >> >> > anything
>> >> >> > > > like
>> >> >> > > > > > that since I use Lucene, but we're just desperate... how
>> >> people
>> >> >> do
>> >> >> > to
>> >> >> > > > > have
>> >> >> > > > > > huge data sets, a lot of users, and sort!?
>> >> >> > > > > >
>> >> >> > > > > > Thanks,
>> >> >> > > > > >
>> >> >> > > > > > Mike
>> >> >> > > > > >
>> >> >> > > > >
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

Re: "Natural sorting" of documents in a Lucene index - possible?

Reply via email to