Ideal design for a general purpose SparseSet?

cblake Thu, 10 Dec 2020 14:45:14 -0800

@jackhftang \- that link you shared is literally exactly `adix/ditab`. I didn't 
think it similar to lrucache, but just was not sure what @b3liever really 
wanted until he shared that entity component system (ECS) skypjack link which 
both seem to be about video games.


Direct indexing is neat and all, but honestly can be less performant than 
hashing "in practice" even when it applies with a limited key space. Also, it 
is almost identical to compact-ordered hashing (like @LemonBoy's 
[compactdict](https://github.com/LemonBoy/compactdict) or the ordered mode of 
`adix/lptabz`), just directly going to a sparse slot index instead of going via 
a hash(). (This is why I think "direct indexing" is a better name than 
"sparse-dense".) I am pretty sure the fixed cost "init/re-init" is even 
achievable with compact-ordered style if it really matters, with some extra 
work each random access. Hashing can be attacked, but this is not common in the 
video game space.

The one other advantage you might see for direct indexing is the same average 
vs worst case per-element time cost. That _sounds_ a lot better than hash 
table's expected worst case per element ~log(table size). But for big tables of 
tiny objects you can fit many per cache line and that cache load dominates 
lookup time. So, worst case random access for the hash table can be more like 
2x the time cost of the average (especially if you have Robin Hood re-org 
activated), not log(N) as much. What's more, in the non-compact case, hashing 
can achieve just 1 cache line fetch almost all the time, while direct indexing 
will usually take 2 cache loads. `adix/althashes.hashRoMu1()` is also so fast 
as to be almost free. So, depending upon scale/features it is easy to imagine 
linearly probed hashing being up to 2x faster than direct indexing, 
insignificantly more variable, and possibly much more memory efficient. I think 
this is a situation where naive "big O" analysis can give misleading 
expectations.

Direct indexing/sparse-dense/etc. mostly seems useful when distrusting 
"probabilistic" aspects of hashing. (I put several mitigations for such 
probabilistic aspects in `lptabz`, but nothing is perfect.)

* * *

@b3liever. I watched that video. A regular Nim `Table` with integer keys has 
almost no drawbacks the speaker mentioned to motivate "slot maps" (similar to 
but distinct from direct indexing) over C++ STL hash tables (and `adix/lptabz` 
has even fewer). I am not 100% sure I get the ABA problem, but you could 
probably do that "generation counter" in the key stuff with a `distinct int` 
and a regular Nim `Table` and maybe a custom `proc hash`/`proc ==`. The value 
could just be a common index into N parallel arrays so that you can do SoA 
(Struct of Array) SIMD vectorization later. This at least seems important to 
the author of the entity-component-system (ECS) links you sent. And at least I 
_think_ that is what he means by "SoA" \- I couldn't find him saying so 
explicitly anywhere.

TL;DR is I'd bet you could probably get by with regular Nim `Table` to start. 
Could be wrong. Just trying to help/maybe simplify things.

Ideal design for a general purpose SparseSet?

Reply via email to