Setting this up took about 30 minutes total, including downloads and
updating SQL Express from Windows Update after installation.  The query
took 7 seconds with the distinct operator and 6 without.  (On my 8.1
workstation with an SSD and 16GB RAM).  There should have been no
duplicates, and in the end there weren't.  The .mdf is 1.3GB.  I still have
to add about 1.5 million rows to the big list, but that won't add too much
to the file size.

Thank you very, very much for suggestion.


On Tue, Jun 28, 2016 at 6:37 PM, Richard Stovall <[email protected]> wrote:

> Shazam!  What a great post, particularly on the fly.  Thank you very much.
>
> Alas, this will ultimately be for production use (even though the current
> data is just test), so SQL Developer is out.  SQL Express, however, might
> do the trick nicely.
>
> I like the idea of a database solution as the lists will change over time
> and having the infrastructure in place to add to and compare with old lists
> is quite appealing.
>
> Again, many thanks for the detailed post.
>
>
> On Tue, Jun 28, 2016 at 3:02 PM, Frank Ress <[email protected]>
> wrote:
>
>> I’m a SQL guy (SQL Server of late).  You can download and install the
>> Developer Edition – it’s free.  You really don’t need any of the add-on
>> services (Analysis Services, Reporting Services, Integration Services…).
>> Just install the database itself and the management tools.  Bing/Microsoft
>> will help answer any questions you might have regarding installation.
>>
>>
>>
>> Once it’s running, open Management Studio.  Connect to your new instance
>> and expand the navigator pane to see the databases.  You’ll have 4 system
>> databases by default – SYSTEM, MODEL, MSDB, and TEMP.
>>
>>
>>
>> Create a new database (right-click the ‘Databases’ node, New Database,
>> etc. – defaults for file names and locations will be fine).  Name it
>> whatever you’d like.  Once the database is created, right-click that
>> database in the navigator pane and pick ‘Tasks’/Import Data…  Assuming you
>> have the hashes in a spreadsheet or whatever, just import both lists each
>> into its own table using the import wizard.  Name the tables whatever you’d
>> like, e.g. BigList and SmallList.  You’ll also give the columns with the
>> data a name (let’s assume you have no other columns of info for each table,
>> just the hashes).  You can give them the same name, but it’s easier if
>> they’re unique.  Call them BigListHash and SmallListHash, for example.
>>
>>
>>
>> Once the tables are created and populated, right-click your database in
>> the navigator again and select ‘New Query’.  A new editing pane will open
>> to the right of the navigator.  In the query pane, enter:
>>
>>
>>
>> SELECT DISTINCT SmallListHash
>>
>>   FROM SmallList
>>
>> INNER JOIN BigList ON SmallListHash = BigListHash
>>
>>
>>
>> You don’t need the ‘DISTINCT’ operator if there are no duplicate hash
>> values in your lists.  The query would perform better without it, but using
>> it will eliminate any dups that exist in the data.  Other than speed, can’t
>> hurt to have it.
>>
>>
>>
>> There are buttons on the toolbar that will let you export the results to
>> a text file, if you’d like.
>>
>>
>>
>> HTH
>>
>>
>>
>> Frank Ress
>>
>> Gas Technology Institute
>>
>>
>>
>> *From:* [email protected] [mailto:
>> [email protected]] *On Behalf Of *Richard Stovall
>> *Sent:* Tuesday, June 28, 2016 1:03 PM
>> *To:* [email protected]
>> *Subject:* [NTSysADM] Compare two large lists
>>
>>
>>
>> Not necessarily Windows-related.
>>
>>
>>
>> I need to compare a list of about 300,000 file hashes against a larger
>> list of ~30,000,000 and find ones that are represented in both data sets.
>>
>>
>>
>> I'm not a database guy, nor have I ever played one on TeeVee.
>>
>>
>>
>> Any ideas about how to go about this with standard/free tools in Windows
>> or Linux?
>>
>>
>>
>> TIA,
>>
>> RS
>>
>> ------------------------------
>>
>> This communication is for the use of the intended recipient only. It may
>> contain information that is privileged and confidential. If you are not the
>> intended recipient of this communication, the disclosure, copying,
>> distribution or use hereof is prohibited. If you have received this
>> communication in error, please advise me by return e-mail or by telephone
>> and then delete it immediately.
>>
>
>

Reply via email to