On Tue, Sep 11, 2012 at 10:43 AM, Toke Eskildsen <t...@statsbiblioteket.dk> wrote: > > ICU sort keys are always null (00) terminated and when two keys are > compared, the comparison stops as soon as null is reached(?) > http://userguide.icu-project.org/collation/architecture > > If we concatenate the keys with the original values: > <key><00><original value><offset of original value> > we get an entity where the ordering is still correct upon comparison and > where the original value can be extracted by using the offset from the > last int (or maybe short, to spare 2 bytes) in the BytesRef. >
I think the idea is sound, but I don't think we need the offset? I'm fairly positive ICU collation keys explicitly avoid 0 bytes except for the null terminator. So the original value can be extracted after the fact just by looking for the terminator... such a thing could even be done client-side and i dont think we need the offset for speed either, because its something you would do before final display. we need to verify what I'm saying is true about avoiding 0 bytes, I'll look into it. Of course such an option is only useful for the new ICUCollationAnalyzer (solr's ICUCollationField uses that) because the older deprecated filters are encoded in a different way: I think we should leave those alone. -- lucidworks.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org