Can you guys tell me more about "warm up queries" strategies ? I know that once you made one query, the second time is super quick because it's in cache - but how can you do warm up queries when you don't know what users are going to search ?
- Mike aka...@gmail.com On Wed, Aug 18, 2010 at 11:26 AM, Michel Nadeau <aka...@gmail.com> wrote: > Thanks ! > > - Mike > aka...@gmail.com > > > On Wed, Aug 18, 2010 at 10:37 AM, Ian Lea <ian....@gmail.com> wrote: > >> > But - to come back to my original question... is there any way to have a >> > "natural order" of documents other that the DocId In Lucene? >> >> No. >> >> >> -- >> Ian. >> >> >> On Wed, Aug 18, 2010 at 3:21 PM, Michel Nadeau <aka...@gmail.com> wrote: >> > Cool, so I'll try these things - >> > >> > * Replace timestamps with YYYYMMDD - will minimize unique terms count; >> > * Use NumericField's for dates and numbers - will remove all string >> sorting. >> > Thanks guys! >> > >> > -- >> > >> > But - to come back to my original question... is there any way to have a >> > "natural order" of documents other that the DocId In Lucene? For >> example, is >> > there any way to have an index automatically sorted on a specific field, >> > like : >> > >> > DocId Count Data >> > ------------------------------------- >> > 5 1 First test >> > 1 3 Otter >> > 8 4 Test >> > 2 8 Aloha >> > 10 11 Zulu >> > 9 17 Bingo >> > 3 46 Alpha test >> > 6 112 Tango >> > 4 120 Charlie test >> > 7 200 Kiwi >> > >> > Notice the DocId and Data random orders, but Count is sorted. That would >> be >> > the 'natural order' in the index, and searching for 'test' would return >> (in >> > that order) : >> > >> > DocId Count Data >> > ------------------------------------- >> > 5 1 First test >> > 3 46 Alpha test >> > 4 120 Charlie test >> > >> > Already sorted on the Count. >> > >> > Thanks! >> > >> > - Mike >> > aka...@gmail.com >> > >> > >> > On Tue, Aug 17, 2010 at 4:08 PM, Ian Lea <ian....@gmail.com> wrote: >> > >> >> Using NumericField for dates and other numbers is likely to help a >> >> lot, and removes padding problems. I'd try that first, or just sort >> >> the top n hits yourself. >> >> >> >> >> >> -- >> >> Ian. >> >> >> >> >> >> On Tue, Aug 17, 2010 at 8:46 PM, Michel Nadeau <aka...@gmail.com> >> wrote: >> >> > I could at least drop hours/mins/sec, we don't need them, so my >> timestamp >> >> > could become 'YYYYMMDD', that would cut the number of unique terms at >> >> least >> >> > for dates. >> >> > >> >> > What about my other question about numbers : *" We do pad our numbers >> >> with >> >> > zeros though (for example: 10 becomes 00000010, etc.) because we had >> >> trouble >> >> > with sorting (100 was smaller than 2) ; is that considered as "string >> >> > sorting" ? This might explain a part of the problem."* ? Thanks. >> >> > >> >> > - Mike >> >> > aka...@gmail.com >> >> > >> >> > >> >> > On Tue, Aug 17, 2010 at 3:40 PM, Erick Erickson < >> erickerick...@gmail.com >> >> >wrote: >> >> > >> >> >> Hmmm, I glossed over your comment about sorting the top 250. There's >> >> >> no reason that wouldn't work. >> >> >> >> >> >> Well, one way for, say, dates is to store separate fields. YYYY, MM, >> DD, >> >> >> HH, MM, SS, MS. That gives you say, 100 year terms, + 12 month >> >> >> +31 days + .... for a very small total. You pay the price though by >> >> >> having to change your queries and sorts to respect all 6 fields... >> >> >> >> >> >> But I'd only really go there after seeing if other options don't >> work. >> >> >> >> >> >> >> >> >> Best >> >> >> Erick >> >> >> >> >> >> On Tue, Aug 17, 2010 at 3:35 PM, Michel Nadeau <aka...@gmail.com> >> >> wrote: >> >> >> >> >> >> > Would our approach to limit the search top 250 documents (and then >> >> sort >> >> >> > these 250 documents) work fine ? Or even 250 unique terms with a >> lot >> >> of >> >> >> > users is bad on memory when sorting ? >> >> >> > >> >> >> > We didn't look at trie fields - I will do though, thanks for the >> tip ! >> >> >> > >> >> >> > We do store the original 'Data' field (only the 'SearchableData' >> field >> >> is >> >> >> > analyzed, all other fields are not analyzed), the users mainly >> sort on >> >> >> > numeric values; not a lot on string values (in fact I could >> compltely >> >> >> drop >> >> >> > the sort by string feature). We do pad our numbers with zeros >> though >> >> (for >> >> >> > example: 10 becomes 00000010, etc.) because we had trouble with >> >> sorting >> >> >> > (100 >> >> >> > was smaller than 2) ; is that considered as "string sorting" ? >> This >> >> might >> >> >> > explain a part of the problem. >> >> >> > >> >> >> > Why/how would I reduce the count of unique terms? >> >> >> > >> >> >> > >> >> >> > - Mike >> >> >> > aka...@gmail.com >> >> >> > >> >> >> > >> >> >> > On Tue, Aug 17, 2010 at 3:28 PM, Erick Erickson < >> >> erickerick...@gmail.com >> >> >> > >wrote: >> >> >> > >> >> >> > > If you have tens of millions of documents, almost all with >> unique >> >> >> fields >> >> >> > > that you're sorting on, you'll chew through memory like there's >> no >> >> >> > > tomorrow. >> >> >> > > >> >> >> > > Have you looked at trie fields? See: >> >> >> > > >> >> >> > > >> >> >> > >> >> >> >> >> >> http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/ >> >> >> > > >> >> >> > > I'm a little concerned that the user can sort on Data. Any field >> >> used >> >> >> for >> >> >> > > sorting >> >> >> > > should NOT be analyzed, so unless you are indexing "Data" >> >> unanalyzed, >> >> >> > > that's >> >> >> > > a problem. And if you are sorting on strings unique to each >> >> document, >> >> >> > > that's >> >> >> > > also a memory hog. Not to mention whether capitalization counts. >> >> >> > > >> >> >> > > You might enumerate the terms in your index for each of the >> sortable >> >> >> > fields >> >> >> > > to figure out what the total number of unique terms each is and >> use >> >> >> that >> >> >> > as >> >> >> > > a basis for reducing their count.... >> >> >> > > >> >> >> > > HTH >> >> >> > > Erick >> >> >> > > >> >> >> > > On Tue, Aug 17, 2010 at 3:05 PM, Michel Nadeau < >> aka...@gmail.com> >> >> >> wrote: >> >> >> > > >> >> >> > > > Hi Erick, >> >> >> > > > >> >> >> > > > Here's some more details about our structure. First here's an >> >> example >> >> >> > of >> >> >> > > > document in our index : >> >> >> > > > >> >> >> > > > PrimaryKey = SJAsfsf353JHGada66GH6 (it's a hash) >> >> >> > > > DocType = X >> >> >> > > > Data = This is the data >> >> >> > > > SearchableContent = This is the data >> >> >> > > > DateCreated = <timestamp> >> >> >> > > > DateModified = <timestamp> >> >> >> > > > Counter1 = 17 >> >> >> > > > Counter2 = 3 >> >> >> > > > Average = 0.17 >> >> >> > > > Cost = 200 >> >> >> > > > >> >> >> > > > The users are able to sort on almost all fields: Data, >> >> DateCreated, >> >> >> > > > DateModified, Counter1, Counter2, Average, Cost. >> >> >> > > > >> >> >> > > > When we search, we always search on the 'SearchableContent' >> field >> >> and >> >> >> > we >> >> >> > > > have at least one filter on the DocType (because we have many >> >> >> document >> >> >> > > > types >> >> >> > > > in the same index). So a common search that would find the >> >> document >> >> >> > above >> >> >> > > > is >> >> >> > > > "data *AND DocType:X*" (we automatically add the "*AND >> DocType:X*" >> >> >> part >> >> >> > > > using Lucene Filters. >> >> >> > > > >> >> >> > > > I would say that the number of unique terms in the field being >> >> sorted >> >> >> > on >> >> >> > > is >> >> >> > > > very big - for example timestamps, almost all unique, >> counters, >> >> >> > average, >> >> >> > > > cost, data... so if a query finds 10M results, it's almost 10M >> >> >> > different >> >> >> > > > values to sort. About cache and warm-up queries : we don't use >> >> >> warm-up >> >> >> > > > queries -at all- because we have absolutely no idea of what >> users >> >> are >> >> >> > > going >> >> >> > > > to search for (they can search for absolutely anything). About >> >> >> > "returning >> >> >> > > > 10M" documents, right, we don't actually return the 10M >> documents, >> >> we >> >> >> > use >> >> >> > > > pagination to return documents X to Y of the 10M (and the 10M >> was >> >> >> only >> >> >> > an >> >> >> > > > example, it can be anywhere between 1K and 100M results). The >> >> >> > pagination >> >> >> > > > usually works fine and fast, our problem is really sorting. >> >> >> > > > >> >> >> > > > Our "Lucene Reader" process has 2GB of ram allowed, here's how >> I >> >> >> start >> >> >> > it >> >> >> > > - >> >> >> > > > >> >> >> > > > java -Xmx2048m -jar LuceneReader.jar >> >> >> > > > >> >> >> > > > The problem really seems to be a ram problem, but I can't be >> 100% >> >> >> sure >> >> >> > > (any >> >> >> > > > help about how to be sure is welcome). >> >> >> > > > >> >> >> > > > Our current idea of a solution would be to get maximum 250 >> results >> >> >> (the >> >> >> > > > more >> >> >> > > > relevant ones; more results than that is totally useless in >> our >> >> >> system) >> >> >> > > so >> >> >> > > > the sort should work fine on a small data set like that, but >> we >> >> want >> >> >> to >> >> >> > > > make >> >> >> > > > sure we're doing everything right before doing that so we >> don't >> >> run >> >> >> in >> >> >> > > the >> >> >> > > > same problems again. >> >> >> > > > >> >> >> > > > Thank you very much; let me know if you need any more details! >> >> >> > > > >> >> >> > > > - Mike >> >> >> > > > aka...@gmail.com >> >> >> > > > >> >> >> > > > >> >> >> > > > On Mon, Aug 16, 2010 at 4:01 PM, Erick Erickson < >> >> >> > erickerick...@gmail.com >> >> >> > > > >wrote: >> >> >> > > > >> >> >> > > > > Let's back up a minute. The number of matched records is not >> >> >> > > > > important when sorting, what's important is the number of >> unique >> >> >> > > > > terms in the field being sorted. Do you have any figures on >> >> that? >> >> >> One >> >> >> > > > > very common sorting issue is sorting on very fine date time >> >> >> > > resolutions, >> >> >> > > > > although your examples don't include that... >> >> >> > > > > >> >> >> > > > > Now, cache loading is an issue. The very first time you sort >> on >> >> a >> >> >> > > field, >> >> >> > > > > all the values are read into a cache. Is it possible this is >> the >> >> >> > source >> >> >> > > > > of your problems? You can cure this with warmup queries. The >> >> >> > take-away >> >> >> > > > > is that measuring the response time for the first sorted >> query >> >> is >> >> >> > > > > very misleading. >> >> >> > > > > >> >> >> > > > > Although if you're sorting on many, many, many email >> addresses, >> >> >> > > > > that could be "interesting". >> >> >> > > > > >> >> >> > > > > The comment "returning 10,000,000 documents" is, I hope, a >> >> >> > > > > misstatement. If you're trying to *return* 10M docs sorting >> >> >> > > > > is irrelevant compared to assembling that many documents. If >> >> >> > > > > you're trying to return the first 100 of 10M documents, it >> >> should >> >> >> > > > > work. >> >> >> > > > > >> >> >> > > > > Overall, we need more details on what you're sorting and >> what >> >> >> > > > > you're trying to return as well as how you're measuring >> before >> >> >> > > > > we can say much.... >> >> >> > > > > >> >> >> > > > > Along with how much memory you're giving your JVM to work >> with, >> >> >> > > > > what "exploding" means. Are you CPU bound? IO bound? >> Swapping? >> >> >> > > > > You need to characterize what is going wrong before worrying >> >> about >> >> >> > > > > solutions...... >> >> >> > > > > >> >> >> > > > > Best >> >> >> > > > > Erick >> >> >> > > > > >> >> >> > > > > On Mon, Aug 16, 2010 at 3:08 PM, Michel Nadeau < >> >> aka...@gmail.com> >> >> >> > > wrote: >> >> >> > > > > >> >> >> > > > > > Hi, >> >> >> > > > > > >> >> >> > > > > > we are building an application using Lucene and we have >> HUGE >> >> data >> >> >> > > sets >> >> >> > > > > (our >> >> >> > > > > > index contains millions and millions and millions of >> >> documents), >> >> >> > > which >> >> >> > > > > > obviously cause us very important problems when sorting. >> In >> >> fact, >> >> >> > we >> >> >> > > > > > disabled sorting completely because the servers were just >> >> >> exploding >> >> >> > > > when >> >> >> > > > > > trying to sort results in RAM. The users using the system >> can >> >> >> > search >> >> >> > > > for >> >> >> > > > > > whatever they want, so we never know how many results will >> be >> >> >> > > returned >> >> >> > > > - >> >> >> > > > > a >> >> >> > > > > > search can return 10 documents (no problem with sorting) >> or >> >> >> > > 10,000,000 >> >> >> > > > > > (huge >> >> >> > > > > > sorting problems). >> >> >> > > > > > >> >> >> > > > > > I woke up this morning and had a flash : is it possible >> with >> >> >> Lucene >> >> >> > > to >> >> >> > > > > have >> >> >> > > > > > a "natural sorting" of documents? For example, let's say I >> >> have 3 >> >> >> > > > columns >> >> >> > > > > I >> >> >> > > > > > want to be able to sort by : first name, last name, email; >> I >> >> >> would >> >> >> > > have >> >> >> > > > 3 >> >> >> > > > > > different indexes with the very same data but with a >> different >> >> >> > > primary >> >> >> > > > > key >> >> >> > > > > > for sorting. I know it's far fetched, and I have never >> seen >> >> >> > anything >> >> >> > > > like >> >> >> > > > > > that since I use Lucene, but we're just desperate... how >> >> people >> >> >> do >> >> >> > to >> >> >> > > > > have >> >> >> > > > > > huge data sets, a lot of users, and sort!? >> >> >> > > > > > >> >> >> > > > > > Thanks, >> >> >> > > > > > >> >> >> > > > > > Mike >> >> >> > > > > > >> >> >> > > > > >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >