Re: "Natural sorting" of documents in a Lucene index - possible?

Michel Nadeau Wed, 18 Aug 2010 08:27:58 -0700

Thanks !

- Mike
[email protected]



On Wed, Aug 18, 2010 at 10:37 AM, Ian Lea <[email protected]> wrote:

> > But - to come back to my original question... is there any way to have a
> > "natural order" of documents other that the DocId In Lucene?
>
> No.
>
>
> --
> Ian.
>
>
> On Wed, Aug 18, 2010 at 3:21 PM, Michel Nadeau <[email protected]> wrote:
> > Cool, so I'll try these things -
> >
> > * Replace timestamps with YYYYMMDD - will minimize unique terms count;
> > * Use NumericField's for dates and numbers - will remove all string
> sorting.
> > Thanks guys!
> >
> > --
> >
> > But - to come back to my original question... is there any way to have a
> > "natural order" of documents other that the DocId In Lucene? For example,
> is
> > there any way to have an index automatically sorted on a specific field,
> > like :
> >
> > DocId     Count     Data
> > -------------------------------------
> >  5         1       First test
> >  1         3       Otter
> >  8         4       Test
> >  2         8       Aloha
> >  10        11       Zulu
> >  9        17       Bingo
> >  3        46       Alpha test
> >  6       112       Tango
> >  4       120       Charlie test
> >  7       200       Kiwi
> >
> > Notice the DocId and Data random orders, but Count is sorted. That would
> be
> > the 'natural order' in the index, and searching for 'test' would return
> (in
> > that order) :
> >
> > DocId     Count     Data
> > -------------------------------------
> >  5         1       First test
> >  3        46       Alpha test
> >   4       120       Charlie test
> >
> > Already sorted on the Count.
> >
> > Thanks!
> >
> > - Mike
> > [email protected]
> >
> >
> > On Tue, Aug 17, 2010 at 4:08 PM, Ian Lea <[email protected]> wrote:
> >
> >> Using NumericField for dates and other numbers is likely to help a
> >> lot, and removes padding problems.  I'd try that first, or just sort
> >> the top n hits yourself.
> >>
> >>
> >> --
> >> Ian.
> >>
> >>
> >> On Tue, Aug 17, 2010 at 8:46 PM, Michel Nadeau <[email protected]>
> wrote:
> >> > I could at least drop hours/mins/sec, we don't need them, so my
> timestamp
> >> > could become 'YYYYMMDD', that would cut the number of unique terms at
> >> least
> >> > for dates.
> >> >
> >> > What about my other question about numbers : *" We do pad our numbers
> >> with
> >> > zeros though (for example: 10 becomes 00000010, etc.) because we had
> >> trouble
> >> > with sorting (100 was smaller than 2) ; is that considered as "string
> >> > sorting" ? This might explain a part of the problem."* ? Thanks.
> >> >
> >> > - Mike
> >> > [email protected]
> >> >
> >> >
> >> > On Tue, Aug 17, 2010 at 3:40 PM, Erick Erickson <
> [email protected]
> >> >wrote:
> >> >
> >> >> Hmmm, I glossed over your comment about sorting the top 250. There's
> >> >> no reason that wouldn't work.
> >> >>
> >> >> Well, one way for, say, dates is to store separate fields. YYYY, MM,
> DD,
> >> >> HH, MM, SS, MS. That gives you say, 100 year terms, + 12 month
> >> >> +31 days + .... for a very small total. You pay the price though by
> >> >> having to change your queries and sorts to respect all 6 fields...
> >> >>
> >> >> But I'd only really go there after seeing if other options don't
> work.
> >> >>
> >> >>
> >> >> Best
> >> >> Erick
> >> >>
> >> >> On Tue, Aug 17, 2010 at 3:35 PM, Michel Nadeau <[email protected]>
> >> wrote:
> >> >>
> >> >> > Would our approach to limit the search top 250 documents (and then
> >> sort
> >> >> > these 250 documents) work fine ? Or even 250 unique terms with a
> lot
> >> of
> >> >> > users is bad on memory when sorting ?
> >> >> >
> >> >> > We didn't look at trie fields - I will do though, thanks for the
> tip !
> >> >> >
> >> >> > We do store the original 'Data' field (only the 'SearchableData'
> field
> >> is
> >> >> > analyzed, all other fields are not analyzed), the users mainly sort
> on
> >> >> > numeric values; not a lot on string values (in fact I could
> compltely
> >> >> drop
> >> >> > the sort by string feature). We do pad our numbers with zeros
> though
> >> (for
> >> >> > example: 10 becomes 00000010, etc.) because we had trouble with
> >> sorting
> >> >> > (100
> >> >> > was smaller than 2) ; is that considered as "string sorting" ? This
> >> might
> >> >> > explain a part of the problem.
> >> >> >
> >> >> > Why/how would I reduce the count of unique terms?
> >> >> >
> >> >> >
> >> >> > - Mike
> >> >> > [email protected]
> >> >> >
> >> >> >
> >> >> > On Tue, Aug 17, 2010 at 3:28 PM, Erick Erickson <
> >> [email protected]
> >> >> > >wrote:
> >> >> >
> >> >> > > If you have tens of millions of documents, almost all with unique
> >> >> fields
> >> >> > > that you're sorting on, you'll chew through memory like there's
> no
> >> >> > > tomorrow.
> >> >> > >
> >> >> > > Have you looked at trie fields? See:
> >> >> > >
> >> >> > >
> >> >> >
> >> >>
> >>
> http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
> >> >> > >
> >> >> > > I'm a little concerned that the user can sort on Data. Any field
> >> used
> >> >> for
> >> >> > > sorting
> >> >> > > should NOT be analyzed, so unless you are indexing "Data"
> >> unanalyzed,
> >> >> > > that's
> >> >> > > a problem. And if you are sorting on strings unique to each
> >> document,
> >> >> > > that's
> >> >> > > also a memory hog. Not to mention whether capitalization counts.
> >> >> > >
> >> >> > > You might enumerate the terms in your index for each of the
> sortable
> >> >> > fields
> >> >> > > to figure out what the total number of unique terms each is and
> use
> >> >> that
> >> >> > as
> >> >> > > a basis for reducing their count....
> >> >> > >
> >> >> > > HTH
> >> >> > > Erick
> >> >> > >
> >> >> > > On Tue, Aug 17, 2010 at 3:05 PM, Michel Nadeau <[email protected]
> >
> >> >> wrote:
> >> >> > >
> >> >> > > > Hi Erick,
> >> >> > > >
> >> >> > > > Here's some more details about our structure. First here's an
> >> example
> >> >> > of
> >> >> > > > document in our index :
> >> >> > > >
> >> >> > > >     PrimaryKey        = SJAsfsf353JHGada66GH6 (it's a hash)
> >> >> > > >     DocType           = X
> >> >> > > >     Data              = This is the data
> >> >> > > >     SearchableContent = This is the data
> >> >> > > >     DateCreated       = <timestamp>
> >> >> > > >     DateModified      = <timestamp>
> >> >> > > >     Counter1          = 17
> >> >> > > >     Counter2          = 3
> >> >> > > >     Average           = 0.17
> >> >> > > >     Cost              = 200
> >> >> > > >
> >> >> > > > The users are able to sort on almost all fields: Data,
> >> DateCreated,
> >> >> > > > DateModified, Counter1, Counter2, Average, Cost.
> >> >> > > >
> >> >> > > > When we search, we always search on the 'SearchableContent'
> field
> >> and
> >> >> > we
> >> >> > > > have at least one filter on the DocType (because we have many
> >> >> document
> >> >> > > > types
> >> >> > > > in the same index). So a common search that would find the
> >> document
> >> >> > above
> >> >> > > > is
> >> >> > > > "data *AND DocType:X*" (we automatically add the "*AND
> DocType:X*"
> >> >> part
> >> >> > > > using Lucene Filters.
> >> >> > > >
> >> >> > > > I would say that the number of unique terms in the field being
> >> sorted
> >> >> > on
> >> >> > > is
> >> >> > > > very big - for example timestamps, almost all unique, counters,
> >> >> > average,
> >> >> > > > cost, data... so if a query finds 10M results, it's almost 10M
> >> >> > different
> >> >> > > > values to sort. About cache and warm-up queries : we don't use
> >> >> warm-up
> >> >> > > > queries -at all- because we have absolutely no idea of what
> users
> >> are
> >> >> > > going
> >> >> > > > to search for (they can search for absolutely anything). About
> >> >> > "returning
> >> >> > > > 10M" documents, right, we don't actually return the 10M
> documents,
> >> we
> >> >> > use
> >> >> > > > pagination to return documents X to Y of the 10M (and the 10M
> was
> >> >> only
> >> >> > an
> >> >> > > > example, it can be anywhere between 1K and 100M results). The
> >> >> > pagination
> >> >> > > > usually works fine and fast, our problem is really sorting.
> >> >> > > >
> >> >> > > > Our "Lucene Reader" process has 2GB of ram allowed, here's how
> I
> >> >> start
> >> >> > it
> >> >> > > -
> >> >> > > >
> >> >> > > >     java -Xmx2048m -jar LuceneReader.jar
> >> >> > > >
> >> >> > > > The problem really seems to be a ram problem, but I can't be
> 100%
> >> >> sure
> >> >> > > (any
> >> >> > > > help about how to be sure is welcome).
> >> >> > > >
> >> >> > > > Our current idea of a solution would be to get maximum 250
> results
> >> >> (the
> >> >> > > > more
> >> >> > > > relevant ones; more results than that is totally useless in our
> >> >> system)
> >> >> > > so
> >> >> > > > the sort should work fine on a small data set like that, but we
> >> want
> >> >> to
> >> >> > > > make
> >> >> > > > sure we're doing everything right before doing that so we don't
> >> run
> >> >> in
> >> >> > > the
> >> >> > > > same problems again.
> >> >> > > >
> >> >> > > > Thank you very much; let me know if you need any more details!
> >> >> > > >
> >> >> > > > - Mike
> >> >> > > > [email protected]
> >> >> > > >
> >> >> > > >
> >> >> > > > On Mon, Aug 16, 2010 at 4:01 PM, Erick Erickson <
> >> >> > [email protected]
> >> >> > > > >wrote:
> >> >> > > >
> >> >> > > > > Let's back up a minute. The number of matched records is not
> >> >> > > > > important when sorting, what's important is the number of
> unique
> >> >> > > > > terms in the field being sorted. Do you have any figures on
> >> that?
> >> >> One
> >> >> > > > > very common sorting issue is sorting on very fine date time
> >> >> > > resolutions,
> >> >> > > > > although your examples don't include that...
> >> >> > > > >
> >> >> > > > > Now, cache loading is an issue. The very first time you sort
> on
> >> a
> >> >> > > field,
> >> >> > > > > all the values are read into a cache. Is it possible this is
> the
> >> >> > source
> >> >> > > > > of your problems? You can cure this with warmup queries. The
> >> >> > take-away
> >> >> > > > > is that measuring the response time for the first sorted
> query
> >> is
> >> >> > > > > very misleading.
> >> >> > > > >
> >> >> > > > > Although if you're sorting on many, many, many email
> addresses,
> >> >> > > > > that could be "interesting".
> >> >> > > > >
> >> >> > > > > The comment "returning 10,000,000 documents" is, I hope, a
> >> >> > > > > misstatement. If you're trying to *return* 10M docs sorting
> >> >> > > > > is irrelevant compared to assembling that many documents. If
> >> >> > > > > you're trying to return the first 100 of 10M documents, it
> >> should
> >> >> > > > > work.
> >> >> > > > >
> >> >> > > > > Overall, we need more details on what you're sorting and what
> >> >> > > > > you're trying to return as well as how you're measuring
> before
> >> >> > > > > we can say much....
> >> >> > > > >
> >> >> > > > > Along with how much memory you're giving your JVM to work
> with,
> >> >> > > > > what "exploding" means. Are you CPU bound? IO bound?
> Swapping?
> >> >> > > > > You need to characterize what is going wrong before worrying
> >> about
> >> >> > > > > solutions......
> >> >> > > > >
> >> >> > > > > Best
> >> >> > > > > Erick
> >> >> > > > >
> >> >> > > > > On Mon, Aug 16, 2010 at 3:08 PM, Michel Nadeau <
> >> [email protected]>
> >> >> > > wrote:
> >> >> > > > >
> >> >> > > > > > Hi,
> >> >> > > > > >
> >> >> > > > > > we are building an application using Lucene and we have
> HUGE
> >> data
> >> >> > > sets
> >> >> > > > > (our
> >> >> > > > > > index contains millions and millions and millions of
> >> documents),
> >> >> > > which
> >> >> > > > > > obviously cause us very important problems when sorting. In
> >> fact,
> >> >> > we
> >> >> > > > > > disabled sorting completely because the servers were just
> >> >> exploding
> >> >> > > > when
> >> >> > > > > > trying to sort results in RAM. The users using the system
> can
> >> >> > search
> >> >> > > > for
> >> >> > > > > > whatever they want, so we never know how many results will
> be
> >> >> > > returned
> >> >> > > > -
> >> >> > > > > a
> >> >> > > > > > search can return 10 documents (no problem with sorting) or
> >> >> > > 10,000,000
> >> >> > > > > > (huge
> >> >> > > > > > sorting problems).
> >> >> > > > > >
> >> >> > > > > > I woke up this morning and had a flash : is it possible
> with
> >> >> Lucene
> >> >> > > to
> >> >> > > > > have
> >> >> > > > > > a "natural sorting" of documents? For example, let's say I
> >> have 3
> >> >> > > > columns
> >> >> > > > > I
> >> >> > > > > > want to be able to sort by : first name, last name, email;
> I
> >> >> would
> >> >> > > have
> >> >> > > > 3
> >> >> > > > > > different indexes with the very same data but with a
> different
> >> >> > > primary
> >> >> > > > > key
> >> >> > > > > > for sorting. I know it's far fetched, and I have never seen
> >> >> > anything
> >> >> > > > like
> >> >> > > > > > that since I use Lucene, but we're just desperate... how
> >> people
> >> >> do
> >> >> > to
> >> >> > > > > have
> >> >> > > > > > huge data sets, a lot of users, and sort!?
> >> >> > > > > >
> >> >> > > > > > Thanks,
> >> >> > > > > >
> >> >> > > > > > Mike
> >> >> > > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: "Natural sorting" of documents in a Lucene index - possible?

Reply via email to