I might be missing a detail, but what about: $> cat A B | uniq > C
you will end up with "one single" file after you deleted A/B but that might be pushing the initial requirement ;) - anyhow, instead of lots of loops and greps and whatnot this seems to be simplistic enough to see if you can't fit it into RAM or a system with a larger FS. best, O. On Thu, Jan 2, 2014 at 6:21 PM, S. Dale Morrey <[email protected]>wrote: > I should have mentioned that the data is UTF-8 not strictly ASCII and we > need to preserve the encoding. > > I could probably do this without it being "inplace" let's consider that a > "nice to have". > > > On Thu, Jan 2, 2014 at 10:17 AM, Lloyd Brown <[email protected]> wrote: > > > I'm not aware of anything that will do in-place merges, but have you > > considered compressing them first, to save enough space, that you could > > manage it without being in-place? I've seen ASCII compress at 6:1 or > > more, but it will depend on the specific data. > > > > Basically something roughly like this (verify the syntax first): > > > > > #compress individual files > > > for i in file1 file2; do gzip --verbose $i; done > > > #merge them to a new file > > > zcat file1.gz file2.gz | sort -u | gzip > file3.gz > > > > Not sure if that will save you enough space to get it done without being > > in-place or not. But worth considering. > > > > Lloyd Brown > > Systems Administrator > > Fulton Supercomputing Lab > > Brigham Young University > > http://marylou.byu.edu > > > > On 01/02/2014 10:12 AM, S. Dale Morrey wrote: > > > I have a 15 GB file and a 2 GB file. Both are sourced from somewhat > > > similar data so there could be quite a lot of overlap, both are in the > > same > > > format, i.e. plaintext CSV 1 entry per line. > > > > > > I'd like to read the 2GB file and add any entries that are present in > it, > > > but missing in the 15GB file, basically merging the 2 files. Space is > > at a > > > premium so I'd prefer it be an in place merge onto the 15GB file. Even > > if > > > a temp file is made, It would still be best if the end result was a > > single > > > 17GB file. > > > > > > I don't want to reinvent the wheel. The time to perform the operation > is > > > irrelevant, but I'd greatly prefer there not be any dupes. Is there a > > bash > > > command that could facilitate the activity? > > > > > > /* > > > PLUG: http://plug.org, #utah on irc.freenode.net > > > Unsubscribe: http://plug.org/mailman/options/plug > > > Don't fear the penguin. > > > */ > > > > > > > /* > > PLUG: http://plug.org, #utah on irc.freenode.net > > Unsubscribe: http://plug.org/mailman/options/plug > > Don't fear the penguin. > > */ > > > > /* > PLUG: http://plug.org, #utah on irc.freenode.net > Unsubscribe: http://plug.org/mailman/options/plug > Don't fear the penguin. > */ > /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
