On Fri, 2009-07-10 at 13:13 -0700, Jim Gibson wrote: > On 7/10/09 Fri Jul 10, 2009 12:37 PM, "Steve" > <steve.h...@digitalcertainty.co.uk> scribbled: > > > Hi list memebers. I'm new here and I'm interested in learning about > > Perl. I've managed to get some programs together but have a lot to > > learn. > > > > May I put a question to the experts? > > > > Suppose I have a text file that has a whopping amount of lines > > (20-50,000). What would be the best way to condense or index this with > > Perl so I could query it and get a yes/no on it as fast as possible with > > minimum overhead? I could maintain a database to do it, but for one line > > entries this would probably be overkill. It's easy to swap a text file > > to update it and manage the list concerned. > > > > What would be the best approach? > > That depends upon the query that you want to do. If it is just to see if a > line exists in the file, then a hash would be best: > > my %hash; > while(<>) { > chomp; > $hash{$_} = 1; > } > > ... > > if( exists $hash{$someline} ) { > print "Line <$someline> exists\n"; > }else{ > print "Line <$someline> not found\n"; > } > > If it is something more complex, then reading in the file to an array might > work efficiently. > > We need a little more information to help you further. > > The list concerned is updated on a daily - and sometimes hourly - basis and runs to about 90,000 lines as I look at it this morning.
If I used a database it would mean dropping the table and repopulating it on a list change - rather than just swapping out the list and building an index. Each line of the file contains a single uri. The script will run frequently - many times a minute - upon demand. To complicate things a little there may be a total of three similar lists. One for URI's, one for DOMAINS and one for TELEPHONE NUMBERS. My key objectives are fast loading, fast acting and minimal overhead. (I also live in an ideal world!) Here is what I've considered; Load the lists into a HASH on start. This is going to mean big hashes and will take time. It strikes me as 'daft' to do this. If I could index the list and load that instead I suspect it would take less room and execute faster. I suspect what I really need is a very simple form of indexing database. I think the overhead of mysql or postgresql would be serious overkill for single line queries. I'm hoping someone can suggest a much lighter alternative. If I can give an analogy - something that works a bit like Postfix's 'postmap' command may be ideal. It turns a flat file into a smaller faster index file. I've seen some variations on this where .idx (rather than .db) files have been created - but being new to this I don't fully understand them. Reading that back I don't know if I've made things any clearer or mad things more muddy :-) -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/