Re: [powershell] RE: CSV Comparisons:

Devin Rich Wed, 25 May 2016 13:03:30 -0700

Ok, time to pull out the black magic!

There is the new magic .Where and ,Foreach methods that are MUCH faster
than anything you have used. This is a simple way to make your code go many
times faster. How many times faster? Many times. Syntax details:
http://www.powershellmagazine.com/2014/10/22/foreach-and-where-magic-methods/

Next idea: BINARY SEARCH! yay! If you sort by that IP address in your 250k
CSV file, why should you have to search through the whole thing? Just
looking in the middle will let you skip at least half of the total array in
a few quick cycles. Create an array to search through. For each IP in your
9k list, search for it in binary style from the 260k list. This may be best
to do as previously mentioned from a IP to Decimal conversion (Check
out: [ipaddress]"192.168.1.1" in powershell for the address property that
you can add to each. BONUS POINTS!!! Use the magic foreach instead of a
crappy normal foreach.)
(note that there is some more magic you can do if you have multiple IP
addresses in your 260k CSV of IPs to get ALL maching IP lines without
having to search trhoguh the whole list).
Basically:

*$MatchedIPAddresses* *=* New-Object System*.*Collections*.*ArrayList*(*
*$null**)*

*$BigAList* *=* *$BigAList* *|* sort IPAddressDecimal

*$array* *=* *$BigAList**.*IPAddressDecimal #This line makes it so that the
binary search doesn't have to unroll the IPAddressDecimal property from the
list for each time you do the 9k searches.

*foreach* *(**$IP* *in* *$LittleAList**.*IPAdressDecimal*)*

*{*

    *$index* *=* *[*array*]::*BinarySearch*(**$array**,**$IP**)*

    *if* *(**$index* *-*ge 0*)*

    *{*

        *$Null* *=* *$MatchedIPAddresses**.*Add*(**$BigAList**[**$index**])*

    *}*

*}*

But be warned!!!!! This is just theory. I haven't tried this to see if it
would work or not. Also, I often have used $*MatchedIPAddresses* +=
*$BigAList**[**$index**]*instead of the $matched.Add() method. This can be
a VERY easy way to add objects to an array of objects. However, each time
you do that, it creates a new array with length of current array + 1.
Copies all data to that new one from the old one. And then adds your $Match
to the very end of it. As your array of objects gets longer, it takes
longer to do the ($Object += syntax. You can get more performance by doing
an arraylist.

Ok, next idea:  MULTITHREADING!!! Why if you have 9k independent IP
addresses do you need to wait for each one to complete before you start the
next? Check out: https://github.com/RamblingCookieMonster/Invoke-Parallel If
doing that is too much, consider just only importing 1/5th of your 9k
addresses in csv to 5 different powershell sessions. Since all 5 sessions
cover all IP addresses, have them all output to 5 files and combine at the
very end. You'll love it!!!

If you can get a 9k core computer, you could do the whole thing in the time
it takes you to do 1 right now!!! :)

I think that the *binary search* will give you the *greatest *performance
boost. I think that *multi threading* (even the very manual way) will also*
drastically reduce your processing time*. Lastly, the magic .foreach and
.where will also make you loop through each entry *much *more quickly.

Play with it a bit and PLEASE PLEASE PLEASE take a moment to update us with
how it went. You could turn then into functions to make loading the data
faster so that you open up each powershell window and say
Calculate-IPaddressMatches in order to do the job. Then you can even just
use something like this to tell us how long it
took: ((history)[-1].endexecutiontime -
(history)[-1].startexecutiontime).TotalSeconds

Thanks,

Devin Rich
Systems Administrator

On Wed, May 25, 2016 at 12:00 PM, its.mike <its.m...@shaw.ca> wrote:

> 2 ¢:
>
>
>
> I'd try a hash table, created from the 9k DNS entries.
>
>
>
> Then spin the 260k file checking its IP addresses against the hash table
> keys.
>
>
>
>
>
>
>
> *From:* listsad...@lists.myitforum.com [mailto:
> listsad...@lists.myitforum.com] *On Behalf Of *Michael B. Smith
> *Sent:* Tuesday, May 24, 2016 4:03 pm
> *To:* powershell@lists.myitforum.com
> *Subject:* [powershell] RE: CSV Comparisons:
>
>
>
> Add an additional field in both files that is the uint32 version of the
> IP. Sort both files ascending by the new field.
>
>
>
> Since the files are already sorted, compare-object will be very fast. If
> you want it even faster, you can create a couple of while() loops yourself
> to do the comparisons.
>
>
>
> There are a variety of mechanisms for converting the text-ip to uint32-ip.
> The easiest is not the fastest. J Depends on how often you will need to
> do this as to whether that’s relevant.
>
>
>
> Probably the easiest:
>
>
>
>                ( ‘192.168.231.10’ –as [IpAddress] ).Address
>
>
>
> *From:* listsad...@lists.myitforum.com [
> mailto:listsad...@lists.myitforum.com <listsad...@lists.myitforum.com>] *On
> Behalf Of *Orlebeck, Geoffrey
> *Sent:* Tuesday, May 24, 2016 6:03 PM
> *To:* 'powershell@lists.myitforum.com'
> *Subject:* [powershell] CSV Comparisons:
>
>
>
> I have two CSV files. One is a list of IP addresses and DNS hostnames. The
> other has IP, DNS name (if it resolves) as well as “firstseen” and
> “lastseen” on the network. I am attempting to compare the list of IP
> addresses to the second list that contains additional fields.
>
>
>
> The dataset has the following format:
>
> [image: cid:image001.png@01D1B5B2.5F7E3D90]
>
>
>
> The goal is to loop through the list of IP addresses and check for matches
> in the CSV file. If found, I want to note the entire row with all fields
> (IP,Name,FirstSeen,LastSeen). I’m not sure of the most efficient way to
> process this. The smaller list is ~9k entries. The larger list has over
> 260k entries. On a small scale I am able to see this logic will at least
> give me a list of IP addresses found in both files:
>
>
>
> $List = Import-csv C:\IP_Addresses.csv
>
> $CSV = Import-Csv C:\DB_Export.csv
>
> $Matches = $List | % {$CSV.ip -like $_.ip}
>
>
>
> The efficiency of this method is where I’m concerned. I’m currently
> working with a limited subset of the data (about 500 entries each), and it
> takes a few minutes to parse. However, with the full list being 9,000 IP
> addresses against 260,000+ rows in $CSV, I have a feeling it will take
> quite some time to process.
>
>
>
> I’m wondering if anyone has any examples of such data comparisons. My end
> goal is to parse the 9k IP addresses, compare them against our 260k
> entries, and eventually filter based on the ‘LastSeen’ field. My gut says
> using  Compare-Object or the above example is a bad process, I’m just not
> sure how else to go about processing such large files. Basically I read
> about the differences between Get-Content versus Streamreader here
> <https://foxdeploy.com/2016/03/23/coding-for-speed/> and thought there is
> probably a better way than the built-in cmdlets I am attempting to use.
>
>
>
> Thank you.
>
>
>
> -Geoff
>
> Confidentiality Notice: This is a transmission from Community Hospital of
> the Monterey Peninsula. This message and any attached documents may be
> confidential and contain information protected by state and federal medical
> privacy statutes. They are intended only for the use of the addressee. If
> you are not the intended recipient, any disclosure, copying, or
> distribution of this information is strictly prohibited. If you received
> this transmission in error, please accept our apologies and notify the
> sender. Thank you.
>
>
>
>
>
>

-- 
The information contained in this message is privileged, confidential, and 
protected from disclosure. If you are not the intended recipient, you are 
hereby notified that any review, printing, dissemination, distribution, 
copying or other use of this communication is strictly prohibited. If you 
have received this communication in error, please notify us immediately by 
replying to the message and deleting it from your computer.

Re: [powershell] RE: CSV Comparisons:

Reply via email to