[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651422#action_12651422 ]
Uwe Schindler commented on LUCENE-1470: --------------------------------------- I do not know what you had done in your code. Did you directly converted the byte[] arrays to a string, e.g. by using new String(byte[],charset) or without charset (in this case Java would use whats actual, but maybe its UTF-8)? Then, the character 216 (0xd8) would be interpreted by new String() as an UTF-8/16 sequence or whatever and map to some unknown char. UnicodeUtils, on the other hand, encodes the java chars to UTF-8 when storing in index, but does not like chars >0xd800 and replaces them by the replacement char for unknown chars. And this is a modification, as the char >0xd800 is not valid (see source of UnicodeUtils). and 0xd800 looks like 0xd8 (==216). My code does not transform a byte[] directly to a string, it creates a 16 bit standard java char of each byte with some offset. As the trie code only produces bytes between 0-255 and it adds a offset of 0x30 to it, the range of chars is 0x30..0x12f. This range is unicode safe and can be easily encoded to UTF-8. But I will make a explicit testcase for that tomorrow! Maybe your problem was the mentioned mis-use of directly generating strings from byte arrays using unknown/incorrect charset. Keep me informed! By the way Earwin Burrfoot: are you sure, your 15bit encoding works with the latest Lucene version, maybe this affects you, too? > Add TrieRangeQuery to contrib > ----------------------------- > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Affects Versions: 2.4 > Reporter: Uwe Schindler > Assignee: Michael McCandless > Attachments: LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 500000 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]