On Wed, 21 Mar 2012 08:11:55 -0700, Palani TT <[email protected]> wrote:
> Hi, > > I have a confusion on which collation to use while defining a range > index for a String type. I understand that 'root collation' (the default > collation for String), returns duplicate results. So, would 'Unicode > Codepoint' collation alone would suffice for a String range index or > should I have both the 'root collation' as well as the 'Unicode > Codepoint' collation defined for String range indexes? > > Thanks, > Palani I don't quite know what you mean by "returns duplicate results". The collation you should use depends on what values you want to consider equivalent and what order you want things to appear in. The Unicode codepoint collation will order all the uppercase values before any of the lowercase values and will store values beginning with a letter with a diacritic after all of those. It will store distinct entries for all the variants. So "Resume", "resume", "résume", and "Résumé" are all different entries in the range index. If you are doing a case/diacritic insensitive match against that range index, we'll need to scan through all the values starting with R then all the words starting with S, T,..., Z, a, b, ..., and r to check. The root collation will order all the words starting with 'a' before any word starting with 'b', regardless of case or diacritics on the a. The default strength on the root collation is S3, so case and diacritical variants are still stored separately. The root collation will collapse (treat as equivalent) normalization variants (e.g. "é" vs "e"+accent) but for string range indexes this makes no practical difference as all the strings are normalized to NFC before we put them in the index anyway. If you use S1, then case and diacritic variants will be collapsed, so for "Resume", "resume", "résume", and "Résumé" there will be only one entry in the index. This can make case/ diacritic insensitive matching much more efficient (but case/diacritic *sensitive* matching impossible). If you are not collapsing values, the codepoint collation is generally about 10% faster in its operations. Note that when string ranges are used to optimize queries, the collation on the range index has to match the query collation, so you are generally better off picking a consistent collation that matches your appserver default collation. //Mary [email protected] Principal Engineer MarkLogic Corporation _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
