On Thu, 2 Jan 2014 10:12:00 -0700 "S. Dale Morrey" <[email protected]> wrote:
> I have a 15 GB file and a 2 GB file. Both are sourced from somewhat > similar data so there could be quite a lot of overlap, both are in > the same format, i.e. plaintext CSV 1 entry per line. > I don't want to reinvent the wheel. The time to perform the > operation is irrelevant, but I'd greatly prefer there not be any > dupes. Is there a bash command that could facilitate the activity? Something like: cat filex filey | sort -u > filez It will likely take a while, and you will want lots of disk space for temporary files and the like, but it doesn't require re-inventing any wheels. This will not only catch duplicates across the two files, but it will also catch any duplicates within a file. If the two files are already sorted, "sort -m". -- The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized. -- U.S. Const. Amendment IV Key fingerprint = CE5C 6645 A45A 64E4 94C0 809C FFF6 4C48 4ECD DFDB /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
