On Aug 23, 2010, at 10:24, Patrick Roehl wrote: > The entries are short (4 bytes) and the key is the entire entry. My > understanding is that in this situation a hash table has no benefit. Is > this correct? > > I should emphasize that the entries coming in are random in nature and all > that is needed to be determined is if a new 4-byte entry coming in is > already present or is unique to the list. This is because processing is > performed on new entries (the entries are the key for that processing), but > that processing should not be duplicated. Once the input stream has been > exhausted the list of entries can be discarded. > > In most situations the list size would be less than a few hundred entries. > However, there is no specified upper limit (other than running out of > memory) and the size could potentially be a million entire or more. > Of course the upper limit is set by the 4-byte key length. There can not be more than 2**32 unique entries.
Is there a hashing technique that works well without a priori knowledge of the list size? Otherwise, a binary tree works well. If the entries are truly random and presented in a random order, little is gained by balancing the tree. Nodes near the root will likely all occupy a few cache lines and the first few levels in a probe will happen in cache. And, earlier: On Aug 23, 2010, at 08:45, Patrick Roehl wrote: > > The list is only used by a single process that handles data as it > arrives. To process correctly, it must be able to determine immediately > if the data being presented has already been processed. When all of the > incoming data for that process has been handled the list is discarded. > "immediately" is an unfortunate constraint. I assume it means the key must be processed before the next key arrives, or before waiting for the next key. If it were otherwise (I want to explore this, regardless), the obvious approach would be to read all the keys into an array, sort the array in main memory, eliminate duplicates, and process each unique key. Until this week, I believed Shell sort is a good technique. But the discussion of caching has put fear of the LoR into me, and Shell sort has lousy LoR. So, divide the array into chunks, each of which fits comfortably in cache, sort each chunk with Shell sort or whatever, then merge the sorted chunks into a single array. This is the process of a classic tape sort, using as much memory as can be cached as main memory, and the remainder of memory as a collection of RAM-tapes. I believe it was mentioned a while ago in IBM-MAIN (Frank Y.?) that DFSORT's algorithms are designed to avoid paging. I wonder if they're also designed to optimize cache hit ratio? Are there instructions which perform for cache operations analogous to what PGLOAD and PGOUT do for main memory? Is anything to be gained by anticipatory fetching from a cache line to cause it to be brought from main storage? If the result of the fetch is ignored, can it proceed concurently, without causing a pipeline stall? -- gil
