On 1/26/2016 2:51 PM, Scotty C wrote:
gneuner2 (george), you are over thinking this thing. my test data of 1 gb is
but a small sample file. i can't even hash that small 1 gb at the time of data
creation. the hashed data won't fit in ram. at the time i put the redundant
data on the hard drive, i do some constant time sorting so that the redundant
data on the hard drive is contained in roughly 200 usefully sorted files. some
of these files will be small and can be hashed with a single read, hash and
write. some will be massive (data won't fit in ram) and must be split further.
this produces another another type of single read, hash and write. these split
files can now be fully hashed which means a second read, hash and write.
recombining the second level files is virtually instantaneous (copy-port)
relative to the effort spent to get to that point. all of these operations are
constant time. it would be nice to cut into that big fat hard drive induced C
but i can't do it with a single read and write on the larger files.
I reject the term "over thinking".
You may have created something that works, but that doesn't mean it
can't be improved. Remember that programs running in just 10's of
kilobytes routinely processed data files containing 10's of megabytes.
Modern programmers have forgotten - or never knew - how to work with
very large data sets that don't necessarily fit into memory.
For example: it matters greatly what is being hashed and how. With
Racket hashes, using a string key retains the string for comparison - so
hashing a file containing a large percentage of unique strings produces
a very large hash ... in the worst case the hash may be bigger than the
data file. If the strings are suitably long, then, e.g.,
crypto-digesting them down to ~20..32 byte values and hashing digests
instead of strings might allow the hash to be memory resident or let you
process much larger files using the same amount of memory.
If the records are sorted on the field(s) that may be duplicated, then
by splitting the file and using multi-file processing techniques
duplicates can be filtered from all the sections without hashing and in
a single pass through the whole of the data - leaving a single sorted
file as output. If your input data is unordered, then as you split the
file you sort each section individually with an in-memory sort. The
re-merge step - which is where the filtering is performed - sorts the
output.
This is very efficient and it works with stupendously huge data file
because the merge step can (more or less) simultaneously process as many
files (split sections) as data _records_ will fit into memory.
And there are times when you should give up and embrace a DBMS, if only
for it's proficiency at disk based data handling.
YMMV,
George
--
You received this message because you are subscribed to the Google Groups "Racket
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.