Chris Duncan created COLLECTIONS-577:
----------------------------------------

             Summary: PatriciaTrie bugs when only a few bits change
                 Key: COLLECTIONS-577
                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-577
             Project: Commons Collections
          Issue Type: Bug
          Components: Map
    Affects Versions: 4.0
            Reporter: Chris Duncan
            Priority: Critical


I have a bug report for you.  It has to do with how you handle bits when they 
are very close to each other.  For example, some of your methods seem to think 
that if the only difference between a prefix and a longer string, is a single 
additional bit, then they are actually the same data.  Or if the only 
difference is some number of zero bits, then it also thinks they are the same 
data.  There are also MANY situations where the prefixMap does not return all 
the strings that start with the prefix.

Take the following code for example, which shows the bugs happening in a very 
simple context.
I create two strings.  The first string is always a prefix of the second 
string.  Here are the characters I am using:

    final char u0000 = Character.toChars(0)[0]; // U+0000 (0000000000000000)
    final char u0001 = Character.toChars(1)[0]; // U+0001 (0000000000000001)
    final char u8000 = Character.toChars(32768)[0]; // U+8000 (1000000000000000)
    final char uffff = Character.toChars(65535)[0]; // U+FFFF (1111111111111111)
    final char char_b = 'b'; // 1100010
    final char char_c = 'c'; // 1100011
    final PatriciaTrie<String> trie = new PatriciaTrie<>();

And here is a quick example of the trie working correctly, showing how it 
should work:

    final String prefixString = "" + char_b;
    final String longerString = prefixString + char_c;
    System.out.println(trie.prefixMap(prefixString)); // {b=prefixString, 
bc=longerString} // correct!


In the first example, I show that a character who's bits are all zeros is 
always mishandled:

    final String prefixString = "" + char_b;
    final String longerString = prefixString + u0000;

    System.out.println(prefixString);
    System.out.println(prefixString.length()); // 1

    System.out.println(longerString);
    System.out.println(longerString.length()); // 2

    System.out.println(longerString.startsWith(prefixString)); // true

    trie.put(prefixString, "prefixString");
    System.out.println(trie); // prefixString is in

    trie.put(longerString, "longerString");
    System.out.println(trie); // only longerString shown... prefixString has 
disappeared

    System.out.println(trie.prefixMap(prefixString).size()); // prints 1, but 
should be 2
    System.out.println(trie.prefixMap(prefixString)); // {b =longerString} // 
prefixString should be here, but isn't

    trie.put(prefixString, "prefixString");
    System.out.println(trie); // prefixString is in again, but longerString has 
now disappeared

    System.out.println(trie.prefixMap(prefixString).size()); // prints 1, but 
should be 2
    System.out.println(trie.prefixMap(prefixString)); // {b=prefixString} // 
longerString should be here, but isn't


Next, I show that if the longer string is only 1 bit longer (ignoring zeros) 
than the prefix string, then the PatriciaTree fails to include it in the prefix 
map.
Here the string would look like: 0000000001100011 1000000000000000

    final String prefixString = "" + char_c;  // you can use any character for 
the prefix, same results
    final String longerString = prefixString + u8000;

    System.out.println(prefixString);
    System.out.println(prefixString.length()); // 1

    System.out.println(longerString);
    System.out.println(longerString.length()); // 2

    System.out.println(longerString.startsWith(prefixString)); // true

    trie.put(prefixString, "prefixString");
    System.out.println(trie); // prefixString is in

    trie.put(longerString, "longerString");
    System.out.println(trie); // both are in

    System.out.println(trie.prefixMap(prefixString).size()); // prints 1, but 
should be 2
    System.out.println(trie.prefixMap(prefixString)); // {c=prefixString} // 
longerString should be here, but isn't


And again, except flipping it so that the prefix ends with a 1, and the longer 
string starts with 1's:

    final String prefixString = "" + u0003;
    final String longerString = prefixString + u8000; // can also use uffff 
here, same result

    System.out.println(prefixString);
    System.out.println(prefixString.length()); // 1

    System.out.println(longerString);
    System.out.println(longerString.length()); // 2

    System.out.println(longerString.startsWith(prefixString)); // true

    trie.put(prefixString, "prefixString");
    System.out.println(trie); // prefixString is in

    trie.put(longerString, "longerString");
    System.out.println(trie); // both are in

    System.out.println(trie.prefixMap(prefixString).size()); // prints 1, but 
should be 2
    System.out.println(trie.prefixMap(prefixString)); // {c=prefixString} // 
longerString should be here, but isn't


The class is honestly pretty complex and I wasn't able to completely debug why 
this is behaving badly, but I believe it is because of how you are handling the 
"bitIndex" and "bitLength".
For example, in the method AbstractPatriciaTrie .subtree(), my comments in bold 
(this definitely isn't the only place with weird handling of bitIndex and 
lengthInBits):

    TrieEntry<K, V> subtree(final K prefix, final int offsetInBits, final int 
lengthInBits) {
        TrieEntry<K, V> current = root.left;
        TrieEntry<K, V> path = root;
        while(true) {
// If our bit index has only increased by 1, then our bitIndex will never get 
to be greater than lengthInBits, we could still start with the prefix, yet we 
have no way to break out of here with current being set to the longer string.
            if (current.bitIndex <= path.bitIndex || lengthInBits < 
current.bitIndex) {
                break;
            }
            path = current;
            if (!isBitSet(prefix, offsetInBits + current.bitIndex, offsetInBits 
+ lengthInBits)) {
                current = current.left;
            } else {
                current = current.right;
            }
        }
      ...


Can you also make AbstractPatriciaTrie public, and your other package level 
methods into protected level, that way I don't have to copy the entire class 
and subclasse's code out into another class just to extend it?

thank you,
Chris Duncan (github user: VEQRYN)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to