Chris Duncan created COLLECTIONS-577:
----------------------------------------
Summary: PatriciaTrie bugs when only a few bits change
Key: COLLECTIONS-577
URL: https://issues.apache.org/jira/browse/COLLECTIONS-577
Project: Commons Collections
Issue Type: Bug
Components: Map
Affects Versions: 4.0
Reporter: Chris Duncan
Priority: Critical
I have a bug report for you. It has to do with how you handle bits when they
are very close to each other. For example, some of your methods seem to think
that if the only difference between a prefix and a longer string, is a single
additional bit, then they are actually the same data. Or if the only
difference is some number of zero bits, then it also thinks they are the same
data. There are also MANY situations where the prefixMap does not return all
the strings that start with the prefix.
Take the following code for example, which shows the bugs happening in a very
simple context.
I create two strings. The first string is always a prefix of the second
string. Here are the characters I am using:
final char u0000 = Character.toChars(0)[0]; // U+0000 (0000000000000000)
final char u0001 = Character.toChars(1)[0]; // U+0001 (0000000000000001)
final char u8000 = Character.toChars(32768)[0]; // U+8000 (1000000000000000)
final char uffff = Character.toChars(65535)[0]; // U+FFFF (1111111111111111)
final char char_b = 'b'; // 1100010
final char char_c = 'c'; // 1100011
final PatriciaTrie<String> trie = new PatriciaTrie<>();
And here is a quick example of the trie working correctly, showing how it
should work:
final String prefixString = "" + char_b;
final String longerString = prefixString + char_c;
System.out.println(trie.prefixMap(prefixString)); // {b=prefixString,
bc=longerString} // correct!
In the first example, I show that a character who's bits are all zeros is
always mishandled:
final String prefixString = "" + char_b;
final String longerString = prefixString + u0000;
System.out.println(prefixString);
System.out.println(prefixString.length()); // 1
System.out.println(longerString);
System.out.println(longerString.length()); // 2
System.out.println(longerString.startsWith(prefixString)); // true
trie.put(prefixString, "prefixString");
System.out.println(trie); // prefixString is in
trie.put(longerString, "longerString");
System.out.println(trie); // only longerString shown... prefixString has
disappeared
System.out.println(trie.prefixMap(prefixString).size()); // prints 1, but
should be 2
System.out.println(trie.prefixMap(prefixString)); // {b =longerString} //
prefixString should be here, but isn't
trie.put(prefixString, "prefixString");
System.out.println(trie); // prefixString is in again, but longerString has
now disappeared
System.out.println(trie.prefixMap(prefixString).size()); // prints 1, but
should be 2
System.out.println(trie.prefixMap(prefixString)); // {b=prefixString} //
longerString should be here, but isn't
Next, I show that if the longer string is only 1 bit longer (ignoring zeros)
than the prefix string, then the PatriciaTree fails to include it in the prefix
map.
Here the string would look like: 0000000001100011 1000000000000000
final String prefixString = "" + char_c; // you can use any character for
the prefix, same results
final String longerString = prefixString + u8000;
System.out.println(prefixString);
System.out.println(prefixString.length()); // 1
System.out.println(longerString);
System.out.println(longerString.length()); // 2
System.out.println(longerString.startsWith(prefixString)); // true
trie.put(prefixString, "prefixString");
System.out.println(trie); // prefixString is in
trie.put(longerString, "longerString");
System.out.println(trie); // both are in
System.out.println(trie.prefixMap(prefixString).size()); // prints 1, but
should be 2
System.out.println(trie.prefixMap(prefixString)); // {c=prefixString} //
longerString should be here, but isn't
And again, except flipping it so that the prefix ends with a 1, and the longer
string starts with 1's:
final String prefixString = "" + u0003;
final String longerString = prefixString + u8000; // can also use uffff
here, same result
System.out.println(prefixString);
System.out.println(prefixString.length()); // 1
System.out.println(longerString);
System.out.println(longerString.length()); // 2
System.out.println(longerString.startsWith(prefixString)); // true
trie.put(prefixString, "prefixString");
System.out.println(trie); // prefixString is in
trie.put(longerString, "longerString");
System.out.println(trie); // both are in
System.out.println(trie.prefixMap(prefixString).size()); // prints 1, but
should be 2
System.out.println(trie.prefixMap(prefixString)); // {c=prefixString} //
longerString should be here, but isn't
The class is honestly pretty complex and I wasn't able to completely debug why
this is behaving badly, but I believe it is because of how you are handling the
"bitIndex" and "bitLength".
For example, in the method AbstractPatriciaTrie .subtree(), my comments in bold
(this definitely isn't the only place with weird handling of bitIndex and
lengthInBits):
TrieEntry<K, V> subtree(final K prefix, final int offsetInBits, final int
lengthInBits) {
TrieEntry<K, V> current = root.left;
TrieEntry<K, V> path = root;
while(true) {
// If our bit index has only increased by 1, then our bitIndex will never get
to be greater than lengthInBits, we could still start with the prefix, yet we
have no way to break out of here with current being set to the longer string.
if (current.bitIndex <= path.bitIndex || lengthInBits <
current.bitIndex) {
break;
}
path = current;
if (!isBitSet(prefix, offsetInBits + current.bitIndex, offsetInBits
+ lengthInBits)) {
current = current.left;
} else {
current = current.right;
}
}
...
Can you also make AbstractPatriciaTrie public, and your other package level
methods into protected level, that way I don't have to copy the entire class
and subclasse's code out into another class just to extend it?
thank you,
Chris Duncan (github user: VEQRYN)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)