On Tue, Sep 11, 2012 at 10:43 AM, Toke Eskildsen <t...@statsbiblioteket.dk> 
wrote:
>
> ICU sort keys are always null (00) terminated and when two keys are
> compared, the comparison stops as soon as null is reached(?)
> http://userguide.icu-project.org/collation/architecture
>
> If we concatenate the keys with the original values:
> <key><00><original value><offset of original value>
> we get an entity where the ordering is still correct upon comparison and
> where the original value can be extracted by using the offset from the
> last int (or maybe short, to spare 2 bytes) in the BytesRef.
>

I think the idea is sound, but I don't think we need the offset? I'm
fairly positive ICU
collation keys explicitly avoid 0 bytes except for the null
terminator. So the original value
can be extracted after the fact just by looking for the terminator...
such a thing
could even be done client-side and i dont think we need the offset for
speed either,
because its something you would do before final display.

we need to verify what I'm saying is true about avoiding 0 bytes, I'll
look into it.

Of course such an option is only useful for the new
ICUCollationAnalyzer (solr's ICUCollationField uses that)
because the older deprecated filters are encoded in a different way: I
think we should leave those alone.

-- 
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to