Re: [PLUG] Question on efficiently searching large files for a simple text match

Eric Wilhelm Sat, 29 Sep 2012 19:39:05 -0700

# from website reader on Saturday 29 September 2012:
>I have a list of about 2 to 5 thousand items where an item is a couple
>of text words such as "Side 2050" and S is always in the starting
>column, and have to search a large file around 24 gigabytes in size
>for these items.  The file is a simple text file, delineated by line
>feeds.


If the file won't fit in ram and you're searching from start to finish 
for each item, you'll have to pay for the disk reads 2-5k times.  Much 
better to read through the file once and check your matches as you go.  
If you don't know Perl, I think this is a really good time to read the 
`man perlintro` document.

You could go through the file line-by-line and check all of your items 
on each line

  while(my $line = <>) {
    foreach my $item (@items) {
      print $line if m/^$item/
        }
  }

And that might get it done in less time than it takes to optimize it, 
but I'm guessing you could easily compile all of your items into one 
regexp (`man perlre`) and save a lot of time with the inner loop.

Maybe also try reading chunks of e.g. 100 lines and applying a regexp 
with appropriate anchoring + captures, which might get you more speed by 
moving some of the outer loop down into the regexp engine.

For advanced parallelization, read whatever number of whatever size 
chunks memory allows and farm-out to however many cores are available.

--Eric
-- 
---------------------------------------------------
    http://scratchcomputing.com
---------------------------------------------------
_______________________________________________
PLUG mailing list
[email protected]
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Question on efficiently searching large files for a simple text match

Reply via email to