On Thu, 2 Jan 2014 10:12:00 -0700
"S. Dale Morrey" <[email protected]> wrote:

> I have a 15 GB file and a 2 GB file.  Both are sourced from somewhat
> similar data so there could be quite a lot of overlap, both are in
> the same format, i.e. plaintext CSV 1 entry per line.


> I don't want to reinvent the wheel.  The time to perform the
> operation is irrelevant, but I'd greatly prefer there not be any
> dupes.  Is there a bash command that could facilitate the activity?

Something like:

cat filex filey | sort -u > filez

It will likely take a while, and you will want lots of disk space for
temporary files and the like, but it doesn't require re-inventing any
wheels.

This will not only catch duplicates across the two files, but it will
also catch any duplicates within a file.

If the two files are already sorted, "sort -m".

-- 

The right of the people to be secure in their persons, houses, papers,
and effects, against unreasonable searches and seizures, shall not be
violated, and no Warrants shall issue, but upon probable cause,
supported by Oath or affirmation, and particularly describing the
place to be searched, and the persons or things to be seized.
-- U.S. Const. Amendment IV

Key fingerprint = CE5C 6645 A45A 64E4 94C0  809C FFF6 4C48 4ECD DFDB

/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Reply via email to