Re: [racket-users] Re: appending files

George Neuner Tue, 26 Jan 2016 17:47:44 -0800

On 1/26/2016 2:51 PM, Scotty C wrote:

gneuner2 (george), you are over thinking this thing. my test data of 1 gb is 
but a small sample file. i can't even hash that small 1 gb at the time of data 
creation. the hashed data won't fit in ram. at the time i put the redundant 
data on the hard drive, i do some constant time sorting so that the redundant 
data on the hard drive is contained in roughly 200 usefully sorted files. some 
of these files will be small and can be hashed with a single read, hash and 
write. some will be massive (data won't fit in ram) and must be split further. 
this produces another another type of single read, hash and write. these split 
files can now be fully hashed which means a second read, hash and write. 
recombining the second level files is virtually instantaneous (copy-port) 
relative to the effort spent to get to that point. all of these operations are 
constant time. it would be nice to cut into that big fat hard drive induced C 
but i can't do it with a single read and write on the larger files.


I reject the term "over thinking".

You may have created something that works, but that doesn't mean itcan't be improved. Remember that programs running in just 10's ofkilobytes routinely processed data files containing 10's of megabytes.Modern programmers have forgotten - or never knew - how to work withvery large data sets that don't necessarily fit into memory.

For example: it matters greatly what is being hashed and how. WithRacket hashes, using a string key retains the string for comparison - sohashing a file containing a large percentage of unique strings producesa very large hash ... in the worst case the hash may be bigger than thedata file. If the strings are suitably long, then, e.g.,crypto-digesting them down to ~20..32 byte values and hashing digestsinstead of strings might allow the hash to be memory resident or let youprocess much larger files using the same amount of memory.

If the records are sorted on the field(s) that may be duplicated, thenby splitting the file and using multi-file processing techniquesduplicates can be filtered from all the sections without hashing and ina single pass through the whole of the data - leaving a single sortedfile as output. If your input data is unordered, then as you split thefile you sort each section individually with an in-memory sort. There-merge step - which is where the filtering is performed - sorts theoutput.

This is very efficient and it works with stupendously huge data filebecause the merge step can (more or less) simultaneously process as manyfiles (split sections) as data _records_ will fit into memory.

And there are times when you should give up and embrace a DBMS, if onlyfor it's proficiency at disk based data handling.


YMMV,
George

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

Reply via email to