That’s an interesting challenge; I had some fun playing with it.
Here’s a simple Python program that lists the hashes that are in both lists.
It assumes two files, biglist and smalllist, with one hash per line. It reads
the entire biglist into memory. In my testing the memory required is about 2x
the size of biglist. Testing with a biglist of 32 M records and smalllist of
300 K records, it runs in about 40 sec.
#! /usr/bin/env python
bigset = set()
for line in file("biglist"):
bigset.add(line.strip())
for line in file("smalllist"):
line = line.strip()
if line in bigset:
print line
--
Edward
From: [email protected] [mailto:[email protected]] On
Behalf Of Richard Stovall
Sent: Tuesday, June 28, 2016 11:03 AM
To: [email protected]
Subject: [NTSysADM] Compare two large lists
Not necessarily Windows-related.
I need to compare a list of about 300,000 file hashes against a larger list of
~30,000,000 and find ones that are represented in both data sets.
I'm not a database guy, nor have I ever played one on TeeVee.
Any ideas about how to go about this with standard/free tools in Windows or
Linux?
TIA,
RS