Re: "Natural sorting" of documents in a Lucene index - possible?

Michel Nadeau Tue, 17 Aug 2010 12:36:33 -0700

Would our approach to limit the search top 250 documents (and then sort
these 250 documents) work fine ? Or even 250 unique terms with a lot of
users is bad on memory when sorting ?


We didn't look at trie fields - I will do though, thanks for the tip !

We do store the original 'Data' field (only the 'SearchableData' field is
analyzed, all other fields are not analyzed), the users mainly sort on
numeric values; not a lot on string values (in fact I could compltely drop
the sort by string feature). We do pad our numbers with zeros though (for
example: 10 becomes 00000010, etc.) because we had trouble with sorting (100
was smaller than 2) ; is that considered as "string sorting" ? This might
explain a part of the problem.

Why/how would I reduce the count of unique terms?


- Mike
[email protected]


On Tue, Aug 17, 2010 at 3:28 PM, Erick Erickson <[email protected]>wrote:

> If you have tens of millions of documents, almost all with unique fields
> that you're sorting on, you'll chew through memory like there's no
> tomorrow.
>
> Have you looked at trie fields? See:
>
> http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
>
> I'm a little concerned that the user can sort on Data. Any field used for
> sorting
> should NOT be analyzed, so unless you are indexing "Data" unanalyzed,
> that's
> a problem. And if you are sorting on strings unique to each document,
> that's
> also a memory hog. Not to mention whether capitalization counts.
>
> You might enumerate the terms in your index for each of the sortable fields
> to figure out what the total number of unique terms each is and use that as
> a basis for reducing their count....
>
> HTH
> Erick
>
> On Tue, Aug 17, 2010 at 3:05 PM, Michel Nadeau <[email protected]> wrote:
>
> > Hi Erick,
> >
> > Here's some more details about our structure. First here's an example of
> > document in our index :
> >
> >     PrimaryKey        = SJAsfsf353JHGada66GH6 (it's a hash)
> >     DocType           = X
> >     Data              = This is the data
> >     SearchableContent = This is the data
> >     DateCreated       = <timestamp>
> >     DateModified      = <timestamp>
> >     Counter1          = 17
> >     Counter2          = 3
> >     Average           = 0.17
> >     Cost              = 200
> >
> > The users are able to sort on almost all fields: Data, DateCreated,
> > DateModified, Counter1, Counter2, Average, Cost.
> >
> > When we search, we always search on the 'SearchableContent' field and we
> > have at least one filter on the DocType (because we have many document
> > types
> > in the same index). So a common search that would find the document above
> > is
> > "data *AND DocType:X*" (we automatically add the "*AND DocType:X*" part
> > using Lucene Filters.
> >
> > I would say that the number of unique terms in the field being sorted on
> is
> > very big - for example timestamps, almost all unique, counters, average,
> > cost, data... so if a query finds 10M results, it's almost 10M different
> > values to sort. About cache and warm-up queries : we don't use warm-up
> > queries -at all- because we have absolutely no idea of what users are
> going
> > to search for (they can search for absolutely anything). About "returning
> > 10M" documents, right, we don't actually return the 10M documents, we use
> > pagination to return documents X to Y of the 10M (and the 10M was only an
> > example, it can be anywhere between 1K and 100M results). The pagination
> > usually works fine and fast, our problem is really sorting.
> >
> > Our "Lucene Reader" process has 2GB of ram allowed, here's how I start it
> -
> >
> >     java -Xmx2048m -jar LuceneReader.jar
> >
> > The problem really seems to be a ram problem, but I can't be 100% sure
> (any
> > help about how to be sure is welcome).
> >
> > Our current idea of a solution would be to get maximum 250 results (the
> > more
> > relevant ones; more results than that is totally useless in our system)
> so
> > the sort should work fine on a small data set like that, but we want to
> > make
> > sure we're doing everything right before doing that so we don't run in
> the
> > same problems again.
> >
> > Thank you very much; let me know if you need any more details!
> >
> > - Mike
> > [email protected]
> >
> >
> > On Mon, Aug 16, 2010 at 4:01 PM, Erick Erickson <[email protected]
> > >wrote:
> >
> > > Let's back up a minute. The number of matched records is not
> > > important when sorting, what's important is the number of unique
> > > terms in the field being sorted. Do you have any figures on that? One
> > > very common sorting issue is sorting on very fine date time
> resolutions,
> > > although your examples don't include that...
> > >
> > > Now, cache loading is an issue. The very first time you sort on a
> field,
> > > all the values are read into a cache. Is it possible this is the source
> > > of your problems? You can cure this with warmup queries. The take-away
> > > is that measuring the response time for the first sorted query is
> > > very misleading.
> > >
> > > Although if you're sorting on many, many, many email addresses,
> > > that could be "interesting".
> > >
> > > The comment "returning 10,000,000 documents" is, I hope, a
> > > misstatement. If you're trying to *return* 10M docs sorting
> > > is irrelevant compared to assembling that many documents. If
> > > you're trying to return the first 100 of 10M documents, it should
> > > work.
> > >
> > > Overall, we need more details on what you're sorting and what
> > > you're trying to return as well as how you're measuring before
> > > we can say much....
> > >
> > > Along with how much memory you're giving your JVM to work with,
> > > what "exploding" means. Are you CPU bound? IO bound? Swapping?
> > > You need to characterize what is going wrong before worrying about
> > > solutions......
> > >
> > > Best
> > > Erick
> > >
> > > On Mon, Aug 16, 2010 at 3:08 PM, Michel Nadeau <[email protected]>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > we are building an application using Lucene and we have HUGE data
> sets
> > > (our
> > > > index contains millions and millions and millions of documents),
> which
> > > > obviously cause us very important problems when sorting. In fact, we
> > > > disabled sorting completely because the servers were just exploding
> > when
> > > > trying to sort results in RAM. The users using the system can search
> > for
> > > > whatever they want, so we never know how many results will be
> returned
> > -
> > > a
> > > > search can return 10 documents (no problem with sorting) or
> 10,000,000
> > > > (huge
> > > > sorting problems).
> > > >
> > > > I woke up this morning and had a flash : is it possible with Lucene
> to
> > > have
> > > > a "natural sorting" of documents? For example, let's say I have 3
> > columns
> > > I
> > > > want to be able to sort by : first name, last name, email; I would
> have
> > 3
> > > > different indexes with the very same data but with a different
> primary
> > > key
> > > > for sorting. I know it's far fetched, and I have never seen anything
> > like
> > > > that since I use Lucene, but we're just desperate... how people do to
> > > have
> > > > huge data sets, a lot of users, and sort!?
> > > >
> > > > Thanks,
> > > >
> > > > Mike
> > > >
> > >
> >
>

Re: "Natural sorting" of documents in a Lucene index - possible?

Reply via email to