Great explanation Erick. Thanks. I'll try that. On Wed, Apr 28, 2010 at 8:27 PM, Erick Erickson <erickerick...@gmail.com>wrote:
> Quick reply to question (4). Not quite right, you're not gaining anything, > you still have a yymmddHHMMSS field that will consume all the memory > when you sort by it. > > Remember that the controlling part here is the number of unique values. So > think about two fields, yymmdd and HHMMSS. Use the HHMMSS as the > secondary sort and yymmdd as the primary. This will sort correctly since > any time HHMMSS is used, the yymmdd is guaranteed be equal. > > And you can extend this ad nauseum. For instance, you could use 6 > fields, yy, mm, dd, HH, MM, SS and have a very small number of > unique values in each using really tiny amounts of memory to sort down > to the second in this case. > > Best > Erick > > On Wed, Apr 28, 2010 at 2:24 AM, Samarendra Pratap <samarz...@gmail.com > >wrote: > > > I have got a lot of valuable information in this thread so far. > > Thanks to all. > > > > In my last mail I mentioned only two fields because others' usage was > > negligible and I thought they are not important. But now after *Toke > > *explained > > the formulae, I think sorting on those fields would also be consuming a > > huge > > part of memory.There are 2 other sorting fields; one of which is used in > > both ascending/descending sorting. > > > > Within next couple of days (or may be a week) I'll be > > > > 1. profiling my application, > > 2. analyzing and tuning GC options > > > > > > > > However, I have a few more curiosities - > > > > 1. Tom wrote: > > > > *Have you checked that your machine is correctly identified as a server* > > *and has optimized GC settings?* > > > > *I did not understand the meaning of "correctly identified as a server" > Can > > you please help me understand?* > > * > > * > > 2. *Should I change the type of fields?* > > ** As I said in my first mail that I have 56 fields in my index, most of > > them contain a numeric value or one of system defined values (e.g. gender > > field can contain only "male", "female", or "unknown"). There are only 7 > > fields which are indexed with user defined values. > > All the fields are created with *Field* > > (String name, String value, Field.Store store, Field.Index index) > > It would be creating all the fields as normal string fields. Is it > > *always*a good idea to use specific classes (NumericField, DateTime > > etc.). We do not > > have space problem if that matters. > > > > 3. *Is there any advice on number of fields?* > > *Somewhere on the net I read that instead of keeping different type of > > values in different fields, (e.g. field1:value1, field2:value2,...) one > > should practice keeping different values in single field (e.g. > > field:field1_value1, > > field:field2_value2,...). But I could not confirm it from anywhere else. > > Any > > comments?* > > > > 4. Ian wrote: > > > > *Sorting by score down to the second will use a lot of memory. Can you* > > *make it less granular?* > > > > Is it less painful sorting on two fields; first on yymmdd and then on > > yymmddHHMMSS than sorting just on latter? (Naturally it should use second > > field, only where required but technically ...?) > > > > > > Thanks again for the invaluable support I am getting from here. > > > > - Samar > > > > On Wed, Apr 28, 2010 at 9:12 AM, Lance Norskog <goks...@gmail.com> > wrote: > > > > > Solr's timestamp representation (TrieDateField) is tuned for space and > > > speed. It has a compressed representation, and sorts with far less > > > space than Strings. > > > > > > Also you get something called a date facet, which lets you bucketize > > > facet searches by time block. > > > > > > On Tue, Apr 27, 2010 at 1:02 PM, Toke Eskildsen < > t...@statsbiblioteket.dk> > > > wrote: > > > > Samarendra Pratap [samarz...@gmail.com] wrote: > > > >> 1. Our default option is sort by score, however almost 8% of > searches > > > use > > > >> sorting on a field (yyyymmddHHMMSS). This field is indexed as string > > > (not as > > > >> NumericField or DateField). > > > > > > > > Guessing that the timestamp is practically unique for each document, > > > sorting by String takes up a bit more than > > > > 18M * (40 bytes + 2 * "yyyymmddHHMMSS".length() bytes) ~= 1.2 GB of > RAM > > > as the Strings are cached. Coupled with the normal overhead of just > > opening > > > an index of your size (500MB by your measurements?), I would have > guessed > > > that 3600MB would definitely be enough to open the index and do sorted > > > searches. > > > > > > > > I realize that fiddling with production servers is dangerous, but > > > connecting with JConsole and forcing a garbage collection might be > > > acceptable? That should enable you to determine whether you're leaking > > > memory or if it's just the JVM being greedy. I'd guess you leaking > > though, > > > as HotSpot does not normally allocate up to the limit if it does not > need > > > to. > > > > > > > > Anyway, changing to one of the optimized fields for sorting dates > > should > > > shave 1 GB off the memory requirement, so I'll recommend doing that no > > > matter what the main cause of your memory problems is. > > > > > > > > Regards, > > > > Toke Eskildsen > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > -- > > > Lance Norskog > > > goks...@gmail.com > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > -- > > Regards, > > Samar > > > -- Regards, Samar