# from website reader on Saturday 29 September 2012:
>I have a list of about 2 to 5 thousand items where an item is a couple
>of text words such as "Side 2050" and S is always in the starting
>column, and have to search a large file around 24 gigabytes in size
>for these items. The file is a simple text file, delineated by line
>feeds.
If the file won't fit in ram and you're searching from start to finish
for each item, you'll have to pay for the disk reads 2-5k times. Much
better to read through the file once and check your matches as you go.
If you don't know Perl, I think this is a really good time to read the
`man perlintro` document.
You could go through the file line-by-line and check all of your items
on each line
while(my $line = <>) {
foreach my $item (@items) {
print $line if m/^$item/
}
}
And that might get it done in less time than it takes to optimize it,
but I'm guessing you could easily compile all of your items into one
regexp (`man perlre`) and save a lot of time with the inner loop.
Maybe also try reading chunks of e.g. 100 lines and applying a regexp
with appropriate anchoring + captures, which might get you more speed by
moving some of the outer loop down into the regexp engine.
For advanced parallelization, read whatever number of whatever size
chunks memory allows and farm-out to however many cores are available.
--Eric
--
---------------------------------------------------
http://scratchcomputing.com
---------------------------------------------------
_______________________________________________
PLUG mailing list
[email protected]
http://lists.pdxlinux.org/mailman/listinfo/plug