Cross-posting from my GSOC list as I think we had a lack of D rox posts lately :)

I rarely am surprised by generic code flexibility and power these days.
But once I fleshed out last bits of generalized Trie I was amazed.

In short: it allows something like "make a multistage integer lookup table using these x upper bits, these y bits in the middle and those last z bits as final offset" as one-liner. Better yet strings and other keys are just as easy, and even better it's up to user to allow custom ways of getting X bits for any key. And while I writing this bit-packing is on it's way to get integrated thus obtaining ultra-small tables (we can pack even index entries since we know how much bits they take exactly).

And last but no least - it's easy to strap Trie on top of some other container (say arrays, sets, hashtable etc.). (currently it doesn't play nice with GC-ed objects, I'll fix it later) Say book catalog that goes by first latter (or two) as Trie stage then as final slot uses sorted array. The advantages of such schemes is that re-sorting huge arrays can get costly, moreover insertion can reallocate and that might even run out of memory while duping it. On the other hand smaller array keep sorting bounded at acceptable level.

(In fact hashtable is a degenerate case of 1-level Trie on top of e.g. List, just not space-optimized and with a weird index transform).

To give you a taste of this power, consider a simple example - keyword recognizer. I'm showing two version first just checks if it's a keyword. Second one does catalogs it to specific kind.
(sample material taken from DCT project, thanks goes to Roman Boiko)

enum keywords = [
            "abstract",
            "alias",
            "align",
            //...  all of them, except @ ones
            "__traits"
    ];

//here is bare minumum set type usable for Trie
 struct SmallMap(size_t N, V, K)
    {
        void insert(Tuple!(V, K) t){ _set.insert(t); }

        V opBinaryRight(string op, T)(T key)
            if(op == "in")
        {
            auto idx = map!"a[1]"(_set.items[]).countUntil(key);
            return idx < 0 ? V.init : _set.items[idx][0];
        }
        private:
            SmallSet!(N, Tuple!(V, K)) _set;
    }
//define how we get bits for our index
   size_t useLength(T)(T[] arr)
    {
return arr.length > 63 ? 0 : arr.length; //need max length, 64 - 6bits
    }

    template useLastItem(T)
    {
        size_t entity(in T[] arr){ return arr[$-1]; }
        enum bitSize = 8*T.sizeof;
    }
    //... useItemAt is similar


//now the usage:
auto keyTrie = Trie!(SetAsSlot!(SmallSet!(2,string))
                         , string
                         , assumeSize!(6, useLength)
                         , useItemAt!(0, char)
                         , useLastItem!(char)
                        )(keywords);

foreach(key; keywords)
//opIndex gives us a slot, here slot is set so test it with 'in'
        assert( key in keyTrie[key]);


And that's it. Importantly there is not a single bit of hardcoding here:
- we choose to make Trie of sets by passing SetAsSlot for value type (simple Trie may just use say bool)
    - key type is string oviously
- assumeSize takes a user define functor and "trusts" user that it is result indeed fits into X bit wide integer - the the table as constructed goes by levels: by length -> by first char -> by last char -> <small set>

As you can observe "alias" and "align" do collide after 3 levels passed, thus we go for fixed 2-slot sets at the end to resolve it and a few others. Of course there are better ways to handle level that would prevent this "collision" and reduce size, it's just to show how Trie on top of Set works. (not to mention all keywords are ASCII alphas, hinting on better compression in say 6bits)

And here is how it can work with Maps as slot (the above was mapping to booleans in fact):

auto keywordsMap = [
            "abstract" : TokenKind.Abstract,
            "alias" : TokenKind.Alias,
            "align" : TokenKind.Align,
            //... all others, cataloged per their kind
            "__thread" : TokenKind._Thread,
            "__traits" : TokenKind._Traits,
    ];

//now the usage:
auto keyTrie = Trie!(SetAsMap!(SmallMap!(2, TokenKind, string), TokenKind, string) //duplication is because not every Map type Key-Value can be derived (we have no standard protocol for AAs)
                         , string
                         , assumeSize!(6, useLength)
                         , useItemAt!(0, char)
                         , useLastItem!(char)
                        )(keywordsMap);

 foreach(k,v; keywordsMap)
        assert((k in keyTrie2[k]) == v);

And the only big change is the data type :

struct SmallMap(size_t N, V, K)
    {
        void insert(Tuple!(V, K) t){ _set.insert(t); }

        V opBinaryRight(string op, T)(T key)
            if(op == "in")
        {
            auto idx = map!"a[1]"(_set.items[]).countUntil(key);
            return idx < 0 ? V.init : _set.items[idx][0];
        }
        private:
            SmallSet!(N, Tuple!(V, K)) _set;
    }


P.S. Sorry that the code for Trie itself isn't presented, but you can find the the whole thing in my Phobos fork:
https://github.com/blackwhale/phobos/blob/gsoc-uni/std/uni.d

--
Dmitry Olshansky

Reply via email to