From: Robert Muir [rcm...@gmail.com]:
> Right, JDK collation sucks, use the ICU for collation keys too:
> http://site.icu-project.org/charts/collation-icu4j-sun
> at 1.59 bytes/char, thats less than UTF-16

Ah... I should have seen that. I does not change the overall picture though: 
Althought the ICU collation keys are impressively small, they still take up 
nearly as much space as the original Strings when they themselves are 
represented as Strings. Thus the collation keys does not help memory usage 
(much).

When they are stored as bytes, it helps significantly, but even then there's 
still a huge difference between having them in-memory and using an array of 
positions. Even with optimal storing (the collator keys takes up exactly the 
number of bytes they contain), an index of 10M documents with 10M unique terms 
of length 20 in a sort field would use about 300MB for a given locale vs. the 
10M*log2(10M)/8 = 27MB for a compressed order array.

Still, depending on how little space a byte-array will take in flex, using the 
indexed collator key approach might turn out to be the best choice in a lot of 
cases as it works really well for incremental updates.

Regards,
Toke Eskildsen 
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to