Re: Merging 2 HUGE documents

Olli Ries Thu, 02 Jan 2014 09:33:22 -0800

I might be missing a detail, but what about:

$> cat A B | uniq > C


you will end up with "one single" file after you deleted A/B but that might
be pushing the initial requirement ;) - anyhow, instead of lots of loops
and greps and whatnot this seems to be simplistic enough to see if you
can't fit it into RAM or a system with a larger FS.

best,
O.


On Thu, Jan 2, 2014 at 6:21 PM, S. Dale Morrey <[email protected]>wrote:

> I should have mentioned that the data is UTF-8 not strictly ASCII and we
> need to preserve the encoding.
>
> I could probably do this without it being "inplace" let's consider that a
> "nice to have".
>
>
> On Thu, Jan 2, 2014 at 10:17 AM, Lloyd Brown <[email protected]> wrote:
>
> > I'm not aware of anything that will do in-place merges, but have you
> > considered compressing them first, to save enough space, that you could
> > manage it without being in-place?  I've seen ASCII compress at 6:1 or
> > more, but it will depend on the specific data.
> >
> > Basically something roughly like this (verify the syntax first):
> >
> > > #compress individual files
> > > for i in file1 file2; do gzip --verbose $i; done
> > > #merge them to a new file
> > > zcat file1.gz file2.gz | sort -u | gzip > file3.gz
> >
> > Not sure if that will save you enough space to get it done without being
> > in-place or not.  But worth considering.
> >
> > Lloyd Brown
> > Systems Administrator
> > Fulton Supercomputing Lab
> > Brigham Young University
> > http://marylou.byu.edu
> >
> > On 01/02/2014 10:12 AM, S. Dale Morrey wrote:
> > > I have a 15 GB file and a 2 GB file.  Both are sourced from somewhat
> > > similar data so there could be quite a lot of overlap, both are in the
> > same
> > > format, i.e. plaintext CSV 1 entry per line.
> > >
> > > I'd like to read the 2GB file and add any entries that are present in
> it,
> > > but missing in the 15GB file, basically merging the 2 files.  Space is
> > at a
> > > premium so I'd prefer it be an in place merge onto the 15GB file.  Even
> > if
> > > a temp file is made, It would still be best if the end result was a
> > single
> > > 17GB file.
> > >
> > > I don't want to reinvent the wheel.  The time to perform the operation
> is
> > > irrelevant, but I'd greatly prefer there not be any dupes.  Is there a
> > bash
> > > command that could facilitate the activity?
> > >
> > > /*
> > > PLUG: http://plug.org, #utah on irc.freenode.net
> > > Unsubscribe: http://plug.org/mailman/options/plug
> > > Don't fear the penguin.
> > > */
> > >
> >
> > /*
> > PLUG: http://plug.org, #utah on irc.freenode.net
> > Unsubscribe: http://plug.org/mailman/options/plug
> > Don't fear the penguin.
> > */
> >
>
> /*
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
> */
>

/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Re: Merging 2 HUGE documents

Reply via email to