On 1/26/2016 2:51 PM, Scotty C wrote:
gneuner2 (george), you are over thinking this thing. my test data of 1 gb is 
but a small sample file. i can't even hash that small 1 gb at the time of data 
creation. the hashed data won't fit in ram. at the time i put the redundant 
data on the hard drive, i do some constant time sorting so that the redundant 
data on the hard drive is contained in roughly 200 usefully sorted files. some 
of these files will be small and can be hashed with a single read, hash and 
write. some will be massive (data won't fit in ram) and must be split further. 
this produces another another type of single read, hash and write. these split 
files can now be fully hashed which means a second read, hash and write. 
recombining the second level files is virtually instantaneous (copy-port) 
relative to the effort spent to get to that point. all of these operations are 
constant time. it would be nice to cut into that big fat hard drive induced C 
but i can't do it with a single read and write on the larger files.


I reject the term "over thinking".

You may have created something that works, but that doesn't mean it can't be improved. Remember that programs running in just 10's of kilobytes routinely processed data files containing 10's of megabytes. Modern programmers have forgotten - or never knew - how to work with very large data sets that don't necessarily fit into memory.

For example: it matters greatly what is being hashed and how. With Racket hashes, using a string key retains the string for comparison - so hashing a file containing a large percentage of unique strings produces a very large hash ... in the worst case the hash may be bigger than the data file. If the strings are suitably long, then, e.g., crypto-digesting them down to ~20..32 byte values and hashing digests instead of strings might allow the hash to be memory resident or let you process much larger files using the same amount of memory.


If the records are sorted on the field(s) that may be duplicated, then by splitting the file and using multi-file processing techniques duplicates can be filtered from all the sections without hashing and in a single pass through the whole of the data - leaving a single sorted file as output. If your input data is unordered, then as you split the file you sort each section individually with an in-memory sort. The re-merge step - which is where the filtering is performed - sorts the output.

This is very efficient and it works with stupendously huge data file because the merge step can (more or less) simultaneously process as many files (split sections) as data _records_ will fit into memory.


And there are times when you should give up and embrace a DBMS, if only for it's proficiency at disk based data handling.

YMMV,
George

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to