Emma Worley (HE12025-04-22): > Adds a generic hash table with the DXV encoder as an initial use case.
Hi. This code is already really good. I have some local remarks that will not require a lot of work from you and make it better. And I have some global remarks that would require a lot of work from you and might shave a few cycles. For the rice wine of clarity, I have decided to send them separately. This is the mail about the later. After writing most of what follows I thought of checking how this data structure is used in practice. What I wrote is more relevant as the size of the entries grows, but in practice the entries are very small, which means the current version is probably close to the best. Still, I wrote it, I might as well send it. As I said, the code is already really good. I do not consider necessary to act on what I will say here, or even to read it. But it may contain ideas to do even better, for you or somebody else, for now or for later. This is about hash tables and algorithmic. Vocabulary as I will use it here: Hash bucket, or bucket: integer-indexed cell where the hash function directs us to start searching for a key. Entry: key and its associated value, as wanted by the code that uses the hash table. Entry slot, or slot: memory that can store an entry, especially relevant if pre-allocated. We all know the core difficulty with hash tables: different keys can map to the same hash value and therefore hash bucket. There are two kinds of solutions to this collision issue: - try to find another hash bucket to find an available entry slot; - make hash buckets capable of holding multiple entries, typically by the means of some kind of linked list. In the most C and simplistic implementation, hash buckets and entry slots are the same, and the code, common to add, get, del, just tries the next bucket until it finds the right key or a hole. In the most Java/GLib implementation, hash buckets are just pointers and each entry is a separately allocated slot. (Though I am sure the efficient Java hash tables are written much more in the C style.) The issue with the second way is that it requires an extra number (pointer/index, whatever) per bucket and per entry slot. The first way has two problems. Clustering ruins the efficiency the table. And deletion becomes hard, because just leaving a hole would prevent from finding an entry that has the be stored after the hole compared to its ideal place. Clustering can be reduced with a secondary hash function (or even more subtle): instead of looking at bucket h+1, look at bucket h+h'. Make sure h' and the size of the hash table do not have common factors. But that does not solve the issue of deletions. I was not familiar with this “Robin Hood” algorithm. From what I understand, it is a solution to both clustering and deletion. Excellent. Except it costs a number per bucket/slot. If we consider spending extra memory, we have to consider if the linked list solution might not be more efficient. This hash table has constant size. That makes it easy to pre-allocate all slots in a big array. It can be the same array as the buckets or a separate one. If we keep buckets = slots and if we accept to move an entry on occasion, then we can make it work with just one number per entry/bucket. When adding a new entry, if the bucket is not empty, check if it is a collision: same hash value? If it is a collision, then take any empty bucket/slot (keeping them chained is easy) for the new entry and link it to the existing one. If it is not a collision, then take the bucket/slot for the new entry after moving the entry that was there into an empty bucket. What I described above is not very nice, but it is the best I know / can find that only costs one number per bucket/slot. If we can afford to spend more memory on it, we can do better by the virtue of breaking the buckets = slots equivalence. It is especially interesting if the entries are large, because this version does not require copying entries. Also, we can have more buckets than slots, reducing collisions. All of this is very dependant on specifics. In particular, cache locality can make a solution that seems slightly worse actually better. It would require extensive benchmarking. Regards, -- Nicolas George _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".