On Aug 23, 2010, at 10:24, Patrick Roehl wrote:

> The entries are short (4 bytes) and the key is the entire entry.  My
> understanding is that in this situation a hash table has no benefit.  Is
> this correct?
>
> I should emphasize that the entries coming in are random in nature and all
> that is needed to be determined is if a new 4-byte entry coming in is
> already present or is unique to the list.  This is because processing is
> performed on new entries (the entries are the key for that processing), but
> that processing should not be duplicated.  Once the input stream has been
> exhausted the list of entries can be discarded.
>
> In most situations the list size would be less than a few hundred entries.
> However, there is no specified upper limit (other than running out of
> memory) and the size could potentially be a million entire or more.
>
Of course the upper limit is set by the 4-byte key length.  There
can not be more than 2**32 unique entries.

Is there a hashing technique that works well without a priori
knowledge of the list size?

Otherwise, a binary tree works well.  If the entries are truly
random and presented in a random order, little is gained by
balancing the tree.  Nodes near the root will likely all occupy
a few cache lines and the first few levels in a probe will
happen in cache.

And, earlier:

On Aug 23, 2010, at 08:45, Patrick Roehl wrote:
>
> The list is only used by a single process that handles data as it
> arrives.  To process correctly, it must be able to determine immediately
> if the data being presented has already been processed.  When all of the
> incoming data for that process has been handled the list is discarded.
>
"immediately" is an unfortunate constraint.  I assume it means
the key must be processed before the next key arrives, or before
waiting for the next key.

If it were otherwise (I want to explore this, regardless),
the obvious approach would be to read all the keys into an
array, sort the array in main memory, eliminate duplicates,
and process each unique key.

Until this week, I believed Shell sort is a good technique.
But the discussion of caching has put fear of the LoR into
me, and Shell sort has lousy LoR.  So, divide the array
into chunks, each of which fits comfortably in cache, sort
each chunk with Shell sort or whatever, then merge the
sorted chunks into a single array.  This is the process of
a classic tape sort, using as much memory as can be cached
as main memory, and the remainder of memory as a collection
of RAM-tapes.

I believe it was mentioned a while ago in IBM-MAIN (Frank
Y.?) that DFSORT's algorithms are designed to avoid paging.
I wonder if they're also designed to optimize cache hit
ratio?

Are there instructions which perform for cache operations
analogous to what PGLOAD and PGOUT do for main memory?
Is anything to be gained by anticipatory fetching from a
cache line to cause it to be brought from main storage?
If the result of the fetch is ignored, can it proceed
concurently, without causing a pipeline stall?

-- gil

Reply via email to