Re: Making generalized Trie type in D

Roman D. Boiko Mon, 04 Jun 2012 13:13:23 -0700

On Monday, 4 June 2012 at 19:18:32 UTC, Dmitry Olshansky wrote:

On 04.06.2012 22:42, Roman D. Boiko wrote:
... it is possible to create persistent (immutable) tries
with efficient (log N) inserting / deleting (this scenario isvery important for my DCT project). Immutable hash tableswould require 0(N) copying for each insert / delete.
Aye, the good thing about them - the amount of data affected byany change is localized to one "page". So you just copy it overand remap indices/pointers.

Actually, Microsoft went this way (immutable hash tables) intheir Roslyn preview implementation. However, I still believethat tries will work better here. Will check...

Would bulk pre-allocation of memory to be used by trie improvelocality? With some heuristics for copying when it goes out ofcontrol.

It is difficult to create a good API for fundamental datastructures,because various use cases would motivate different trade-offs.The same
is true for implementation. This is why I like your decision to
introduce policies for configuration. Rationale and use casesshouldhelp to analyze design of your API and implementation, thusyou will get
better community feedback :)
Well I guess I'll talk in depth about them in the article, asthe material exceed sane limits of a single NG post.
In brief:
- multiple levels are stored in one memory chunk one afteranother thus helping a bit with cache-locality (first levelgoes first)- constructors do minimize number of "pages" on each level byconstructing it outwards from the last level and checkingduplicates (costs ~ O(N^2) though, IRC)

So this price is paid only on construction, right? Are therealternatives to chose from (when needed)? If yes, which?

- I learned the hard way not to introduce extra conditionalsanywhere, so there is no "out of range, max index, notexistent" crap, in all cases it's clean-cut memory access.Extra bits lost on having at least one "default" page per levelcan be saved by going extra level

Could you please elaborate? How do you handle situations when notexistent, etc., is needed?

Your examples deal with lookup by the whole word (first/lastcharacters and length are needed). Are your API andimplementation adaptable for character-by-character trielookup?
I would say that one by one won't help you much since the speedis almost the same if not worse.

I guess, in general your statement is true, especially becauseknown length could improve speed significantly. Not sure (but caneasily believe) that in my particular situation it is true. Forexample, one by one would allow ignoring key encoding (and thususing multiple encodings simultaneously just as easily as single).

The problematic thing with one by one - say you want to stopearly, right?

Why? I'd like to lex inout as TokenKind.InOut, not TokenKind.Infollowed by TokenKind.Out. Did I misunderstand your question?

Now you have to check the *NOT FOUND* case, and that impliesextra branching (if(...)) on _each level_ and maybe reusingcertain valid values as "not found" marker (extraconfiguration?).

This should be easy, if something is not a keyword, it is likelyan identifier. But I agree in general, and probably even in mycase.

Branching and testing are things that kill speed advantage ofTries, the ones I overlooked in my previous attempt, seestd/internal/uni.d.The other being separate locations for data and index,pointer-happy disjoint node(page) locality is another way ofthe same fault.

This concern disturbs me for some time already, and slightlydemotivates, because implementing something this way will likelylead to a failure. I don't have enough experience withalternatives to know their advantages and trade-offs. I'll checkyour code. I did plan to try table lookup instead of branching. Iguess making my own mistakes is necessary anyway.

Will compile-time generation of lookup code based on tries besupported?Example which is currently in DCT (first implemented by BrianSchott inhis Dscanner project) uses switch statements (which meanslookup linear
in number of possible characters at each position).
Nope, the days of linear lookup for switch are over (were thereeven such days?) compiler always do binary search nowadays iflinear isn't more efficient e.g. for a small number ofvalues.(it even weight out which is better and uses acombination of them)

I thought this *might* be the case, but didn't know nor checkedanywhere. I also wanted to do linear search for some empiricallychosen small number of items.

However you'd better check asm code afterwards. Compiler islike a nasty stepchild it will give up on generating good oldjump tables given any reason it finds justifiable. (but it mayuse few small jump tables + binary search, could be fine... ifnot in a tight loop!)

Thanks.

A trivial
improvement might be using if statements and binary lookup.(E.g., ifthere are 26 possible characters used at some position, useonly 5
comparisons, not 26).
Moreover you'd be surprised but such leap-frog binary searchlooses by a big margin to _multilayer_ lookup table. (I for onewas quite shocked back then)

Thanks again :) Are any numbers or perf. tests available?

I wanted to analyse your regex implementation, but that's notan easy
task and requires a lot of effort...
Yeah, sorry for some encrypted Klingon here and there ;)
It looks like the most promising
alternative to binary trie lookup which I described in previous
paragraph. Similarities and differences with your regex designmight
also help us understand tries better.
Well if you are in no rush you might as well just take mylatest development in the ways of Trie from my gsoc fork :)Skipping some gory days of try and trial (fail) in the process;)

OK

Re: Making generalized Trie type in D

Reply via email to