On Fri, Jun 19, 2009 at 10:07 AM, Marcel Reutegger <marcel.reuteg...@gmx.net > wrote:
> On Thu, Jun 18, 2009 at 23:20, Ard Schrijvers <a.schrijv...@onehippo.com> > wrote: > > As I like this solution, it seems to me to only suitable for dates, > > right? > > yeah, it probably works best with fixed length values. And for similar strings..so for dates, a common thing to sort on, you have achieved 50% memory reduction which is really nice (as I think the other 50% is retained by Lucene) > see also the wiki page I created about this: > http://wiki.apache.org/jackrabbit/ReduceMemOfSharedFieldCache > > > How do we know that we are sorting on a date...by checking > > whethet it has length 9..or that it starts with msq? > > as of IndexFormatVersion V3 (jackrabbit 1.5) the property type is > stored as a payload on the indexed term. Great, I did not know. I haven't had time to play with payloads yet. Do they retain memory? Do you happen to know if you can store multiple payloads on a term? If so, it might be possible to store, say, the short_title as a payload, and we could choose to order by a short_title (and perhaps only the ambiguous documents having the same first 6 chars for the short_title doing the sort on the entire title)...just thinking out loud, not sure if this is total nonsense, as I did not look at any code. > > > > > retArray[termDocs.doc()] = new > String(term.text().substring(prefix.length())); > > hmm, you're right. it was actually my intention to reduce memory usage > by only keeping the significant part of the term. we should fix that. It is a nasty String gotcha :-). It save is not huge, but just a couple of bytes per cached term. Regards Ard > > > regards > marcel > > > It is a bit strange, but as for dates I think the prefix.length is > > something like "lastModified" and a delimiter, suppose 13 chars..this > > would bring back the char array retained in memory back from 22 to > > 9...(for dates) > > > > Furthermore, it follows that using short property names saves you > > memory. This could be avoided in the end if we index each property in > > its own lucene field, instead of all in :_PROPERTIES and prefix the > > value with the propertyname..this though requires quite some rewrite > > for indexing i think. > > > > [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622 > > > > > > > > On Thu, Jun 18, 2009 at 1:25 PM, Marcel > > Reutegger<marcel.reuteg...@day.com> wrote: > > > On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers < > a.schrijv...@onehippo.com> wrote: > > >> If you happen to find the holy grail solution, I suppose you'll let us > know > > >> :-) Also if you would have some memory usage numbers with and without > the > > >> suggestion of mine regarding reducing the precision of you Date field, > this > > >> would be very valuable. > > > > > > hmm, I'm been thinking about a solution that I would call > > > flyweight-substring-collation-key. it assumes that there is usually a > > > major overlap of substrings of the the values to sort on. i.e. a > > > lastModified value. so instead of always keeping the entire value we'd > > > have a collation key that references multiple reusable substrings. > > > > > > assume we have the following values: > > > > > > - msqyw2shb > > > - msqyw2t93 > > > - msqyw2u0v > > > - msqyw2usn > > > - msqyw2vkf > > > - msqyw2wc7 > > > - msqyw2x3z > > > - msqyw2xvr > > > - msqyw2ynj > > > - msqyw2zfb > > > > > > (those are date property values each 1 second after the previous one) > > > > > > we could create collation keys for use as comparable in the field > > > cache like this: > > > > > > substring cache: > > > [0] msq > > > [1] shb > > > [2] t93 > > > [3] u0v > > > [4] usn > > > [5] vkf > > > [6] wc7 > > > [7] x3z > > > [8] xvr > > > [9] ynj > > > [10] yw2 > > > [11] zfb > > > > > > and then the actual comparable that reference the substrings in the > cache: > > > > > > - {0, 10, 1} > > > - {0, 10, 2} > > > - {0, 10, 3} > > > - {0, 10, 4} > > > - {0, 10, 5} > > > - {0, 10, 6} > > > - {0, 10, 7} > > > - {0, 10, 8} > > > - {0, 10, 9} > > > - {0, 10, 11} > > > > > > this will result in a lower memory consumption and using the reference > > > indexes could even speed up the comparison. > > > > > > a quick test with 1 million dates values showed that the memory > > > consumption drops to 50% with this approach. > > > > > > regards > > > marcel > > > >