Re: Query that sorts a large result set.

Marcel Reutegger Fri, 19 Jun 2009 01:08:04 -0700

On Thu, Jun 18, 2009 at 23:20, Ard Schrijvers <[email protected]> wrote:
> As I like this solution, it seems to me to only suitable for dates,
> right?


yeah, it probably works best with fixed length values.

see also the wiki page I created about this:
http://wiki.apache.org/jackrabbit/ReduceMemOfSharedFieldCache

> How do we know that we are sorting on a date...by checking
> whethet it has length 9..or that it starts with msq?

as of IndexFormatVersion V3 (jackrabbit 1.5) the property type is
stored as a payload on the indexed term.

> Furthermore, I am
> quite curious how you implemented this below. If you just used
> substrings, we could gain quite a bit more with, but i am not sure
> whether you already do this:
>
> Suppose
>
> String s = "msqyw2shb";
>
> If you are having
>
> String[0] = s.subString(0,3);
>
> we reduce memory usage quite a bit more with
>
> String[0] = new String(s.subString(0,3))
>
> Also see [1]. But perhaps you are already doing this.

yes, I already did. I've put the test code on the wiki:

> A direct small improvement we could directly make is replacing :
>
> retArray[termDocs.doc()] = term.text().substring(prefix.length());
>
> with
>
> retArray[termDocs.doc()] = new String(term.text().substring(prefix.length()));

hmm, you're right. it was actually my intention to reduce memory usage
by only keeping the significant part of the term. we should fix that.

regards
 marcel

> It is a bit strange, but as for dates I think the prefix.length is
> something like "lastModified" and a delimiter, suppose 13 chars..this
> would bring back the char array retained in memory back from 22 to
> 9...(for dates)
>
> Furthermore, it follows that using short property names saves you
> memory. This could be avoided in the end if we index each  property in
> its own lucene field, instead of all in :_PROPERTIES and prefix the
> value with the propertyname..this though requires quite some rewrite
> for indexing i think.
>
> [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622
>
>
>
> On Thu, Jun 18, 2009 at 1:25 PM, Marcel
> Reutegger<[email protected]> wrote:
> > On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers <[email protected]> 
> > wrote:
> >> If you happen to find the holy grail solution, I suppose you'll let us know
> >> :-) Also if you would have some memory usage numbers with and without the
> >> suggestion of mine regarding reducing the precision of you Date field, this
> >> would be very valuable.
> >
> > hmm, I'm been thinking about a solution that I would call
> > flyweight-substring-collation-key. it assumes that there is usually a
> > major overlap of substrings of the the values to sort on. i.e. a
> > lastModified value. so instead of always keeping the entire value we'd
> > have a collation key that references multiple reusable substrings.
> >
> > assume we have the following values:
> >
> > - msqyw2shb
> > - msqyw2t93
> > - msqyw2u0v
> > - msqyw2usn
> > - msqyw2vkf
> > - msqyw2wc7
> > - msqyw2x3z
> > - msqyw2xvr
> > - msqyw2ynj
> > - msqyw2zfb
> >
> > (those are date property values each 1 second after the previous one)
> >
> > we could create collation keys for use as comparable in the field
> > cache like this:
> >
> > substring cache:
> > [0] msq
> > [1] shb
> > [2] t93
> > [3] u0v
> > [4] usn
> > [5] vkf
> > [6] wc7
> > [7] x3z
> > [8] xvr
> > [9] ynj
> > [10] yw2
> > [11] zfb
> >
> > and then the actual comparable that reference the substrings in the cache:
> >
> > - {0, 10, 1}
> > - {0, 10, 2}
> > - {0, 10, 3}
> > - {0, 10, 4}
> > - {0, 10, 5}
> > - {0, 10, 6}
> > - {0, 10, 7}
> > - {0, 10, 8}
> > - {0, 10, 9}
> > - {0, 10, 11}
> >
> > this will result in a lower memory consumption and using the reference
> > indexes could even speed up the comparison.
> >
> > a quick test with 1 million dates values showed that the memory
> > consumption drops to 50% with this approach.
> >
> > regards
> >  marcel
> >

Re: Query that sorts a large result set.

Reply via email to