Re: Merging 2 HUGE documents

Lloyd Brown Thu, 02 Jan 2014 09:45:32 -0800

You're right.  I assumed that since it was CSV, that it was ASCII, which
tends to be very compressible.  You might still give compressing it a
try, though.  How compressible it is will depend on the data, though
some encodings tend to be more compressible than others.


Something like this would be a reasonably easy test (100MB of the
original file):

> head -c 104857600 file1 | gzip --verbose > /dev/null



Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 01/02/2014 10:21 AM, S. Dale Morrey wrote:
> I should have mentioned that the data is UTF-8 not strictly ASCII and we
> need to preserve the encoding.
> 
> I could probably do this without it being "inplace" let's consider that a
> "nice to have".
> 
> 
> On Thu, Jan 2, 2014 at 10:17 AM, Lloyd Brown <[email protected]> wrote:
> 
>> I'm not aware of anything that will do in-place merges, but have you
>> considered compressing them first, to save enough space, that you could
>> manage it without being in-place?  I've seen ASCII compress at 6:1 or
>> more, but it will depend on the specific data.
>>
>> Basically something roughly like this (verify the syntax first):
>>
>>> #compress individual files
>>> for i in file1 file2; do gzip --verbose $i; done
>>> #merge them to a new file
>>> zcat file1.gz file2.gz | sort -u | gzip > file3.gz
>>
>> Not sure if that will save you enough space to get it done without being
>> in-place or not.  But worth considering.
>>
>> Lloyd Brown
>> Systems Administrator
>> Fulton Supercomputing Lab
>> Brigham Young University
>> http://marylou.byu.edu
>>
>> On 01/02/2014 10:12 AM, S. Dale Morrey wrote:
>>> I have a 15 GB file and a 2 GB file.  Both are sourced from somewhat
>>> similar data so there could be quite a lot of overlap, both are in the
>> same
>>> format, i.e. plaintext CSV 1 entry per line.
>>>
>>> I'd like to read the 2GB file and add any entries that are present in it,
>>> but missing in the 15GB file, basically merging the 2 files.  Space is
>> at a
>>> premium so I'd prefer it be an in place merge onto the 15GB file.  Even
>> if
>>> a temp file is made, It would still be best if the end result was a
>> single
>>> 17GB file.
>>>
>>> I don't want to reinvent the wheel.  The time to perform the operation is
>>> irrelevant, but I'd greatly prefer there not be any dupes.  Is there a
>> bash
>>> command that could facilitate the activity?
>>>
>>> /*
>>> PLUG: http://plug.org, #utah on irc.freenode.net
>>> Unsubscribe: http://plug.org/mailman/options/plug
>>> Don't fear the penguin.
>>> */
>>>
>>
>> /*
>> PLUG: http://plug.org, #utah on irc.freenode.net
>> Unsubscribe: http://plug.org/mailman/options/plug
>> Don't fear the penguin.
>> */
>>
> 
> /*
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
> */
> 

/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Re: Merging 2 HUGE documents

Reply via email to