You're right. I assumed that since it was CSV, that it was ASCII, which tends to be very compressible. You might still give compressing it a try, though. How compressible it is will depend on the data, though some encodings tend to be more compressible than others.
Something like this would be a reasonably easy test (100MB of the original file): > head -c 104857600 file1 | gzip --verbose > /dev/null Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 01/02/2014 10:21 AM, S. Dale Morrey wrote: > I should have mentioned that the data is UTF-8 not strictly ASCII and we > need to preserve the encoding. > > I could probably do this without it being "inplace" let's consider that a > "nice to have". > > > On Thu, Jan 2, 2014 at 10:17 AM, Lloyd Brown <[email protected]> wrote: > >> I'm not aware of anything that will do in-place merges, but have you >> considered compressing them first, to save enough space, that you could >> manage it without being in-place? I've seen ASCII compress at 6:1 or >> more, but it will depend on the specific data. >> >> Basically something roughly like this (verify the syntax first): >> >>> #compress individual files >>> for i in file1 file2; do gzip --verbose $i; done >>> #merge them to a new file >>> zcat file1.gz file2.gz | sort -u | gzip > file3.gz >> >> Not sure if that will save you enough space to get it done without being >> in-place or not. But worth considering. >> >> Lloyd Brown >> Systems Administrator >> Fulton Supercomputing Lab >> Brigham Young University >> http://marylou.byu.edu >> >> On 01/02/2014 10:12 AM, S. Dale Morrey wrote: >>> I have a 15 GB file and a 2 GB file. Both are sourced from somewhat >>> similar data so there could be quite a lot of overlap, both are in the >> same >>> format, i.e. plaintext CSV 1 entry per line. >>> >>> I'd like to read the 2GB file and add any entries that are present in it, >>> but missing in the 15GB file, basically merging the 2 files. Space is >> at a >>> premium so I'd prefer it be an in place merge onto the 15GB file. Even >> if >>> a temp file is made, It would still be best if the end result was a >> single >>> 17GB file. >>> >>> I don't want to reinvent the wheel. The time to perform the operation is >>> irrelevant, but I'd greatly prefer there not be any dupes. Is there a >> bash >>> command that could facilitate the activity? >>> >>> /* >>> PLUG: http://plug.org, #utah on irc.freenode.net >>> Unsubscribe: http://plug.org/mailman/options/plug >>> Don't fear the penguin. >>> */ >>> >> >> /* >> PLUG: http://plug.org, #utah on irc.freenode.net >> Unsubscribe: http://plug.org/mailman/options/plug >> Don't fear the penguin. >> */ >> > > /* > PLUG: http://plug.org, #utah on irc.freenode.net > Unsubscribe: http://plug.org/mailman/options/plug > Don't fear the penguin. > */ > /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
