Re: fast MMD as Hamming distance over hypercube graph

Allison Randal Sun, 29 Nov 2009 06:36:10 -0800

Andrew Whitworth wrote:


The cost is in setting up the hypercube graph for the inheritance hierarchy
of a particular class. Need to dig into it some more to see how cheap that
could be. The graph could at least be cached as a set of integer "parent
ids" in the child class after it's constructed the first time.


This does sound like a pretty neat idea, but I share your worry that
generating the hypercube is going to be expensive. Combine that with
the fact that we would have to completely recompute the hypercube
every time we modified the hierarchy at all, and we couldn't lazily
reuse hypercubes computed for class B when we assign B as a parent of
class A, etc. Of course, creating new classes and modifying existing
hierarchies is likely an infrequent-enough operation for most cases
that the expense will be amortized by fast subsequent lookups.

Actually, I'm not so much concerned that generating the hypercube willbe expensive, more that it'll be a choice between "cheaper than what wehave now", "cheap", and "can we push it hard to make it really cheap?".But, I haven't spent much time on it yet. A simple first take on how todo it would be:


- Set the child ID to 0 (i.e. 0000000000)

- Iterate over the immediate parents.
  - Set each parent ID to one bit different from child
    (i.e. 0000000001, 0000000010, 0000000100, 0000001000)
  - Recursively iterate over parents of parents
    - Set parent-of-parent ID to one bit different from parent
      (i.e. 0000001001, 0000001010, 0000001100)

Could be simpler if you don't bother with unique IDs and just mark thelevel (i.e. immediate parents are all 0000000001, next level is all0000000011, etc). There are some trade-offs there. Using an integermeans we're limited to the size of the integer, 4,294,967,295 uniqueparents or 32 levels in a 32-bit integer.

We already invalidate some caches in the Class on modification, so thatpart isn't difficult. Some choices would need to be made on how to storethe cache of IDs in the child Class for: the smallest possible storage,the quickest possible lookup, and easily invalidated (so, likely all inone place, perhaps a struct pointer in the PMC struct). And low-leveltypes can't be modified anyway, so one hypercube graph for all of themcould be generated at compile-time.

How does Parrot currently calculate the manhattan distance between a
child and parent type?

Slowly. The horror is mmd_distance in src/multidispatch.c (called byParrot_mmd_sort_candidates). That function and many of the functions itcalls are remnants from the old MMD system, left in place for backwardcompatibility.

It iterates through integer type arrays of both the call and themultisub (first creating the arrays if they don't exist), performs aseries of manual checks on the types (which it repeats every time adispatch is made on a particular type), and iterates through all theparents of each call argument (which it repeats every time a dispatch ismade).

So, the step of "iterating over the parents" has to be done anyway. Thehypercube graph ads some bit math, plus cache storage and retrieval.

Bonus points if we can eliminate most of the manual checks (buildingcore types into the hypergraph) and eliminate creating fixed integerarrays for the sub and the call.

Some relevant work, there's likely more to be found:
http://portal.acm.org/citation.cfm?id=301378


/me really needs to get a proper ACM subscription eventually to read
all the cool papers there.

Any chance I could get a copy of this one?

Yes, ACM has a sane policy about redistribution of papers. I'll emailyou a copy. (Also to anyone else who wants it.)


Allison
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev

Re: fast MMD as Hamming distance over hypercube graph

Reply via email to