On Fri, 2009-07-10 at 13:13 -0700, Jim Gibson wrote:
> On 7/10/09 Fri  Jul 10, 2009  12:37 PM, "Steve"
> <steve.h...@digitalcertainty.co.uk> scribbled:
> 
> > Hi list memebers. I'm new here and I'm interested in learning about
> > Perl. I've managed to get some programs together but have a lot to
> > learn.
> > 
> > May I put a question to the experts?
> > 
> > Suppose I have a text file that has a whopping amount of lines
> > (20-50,000). What would be the best way to condense or index this with
> > Perl so I could query it and get a yes/no on it as fast as possible with
> > minimum overhead? I could maintain a database to do it, but for one line
> > entries this would probably be overkill. It's easy to swap a text file
> > to update it and manage the list concerned.
> > 
> > What would be the best approach?
> 
> That depends upon the query that you want to do. If it is just to see if a
> line exists in the file, then a hash would be best:
> 
>     my %hash;
>     while(<>) {
>         chomp;
>         $hash{$_} = 1;
>     }
> 
>     ...
> 
>     if( exists $hash{$someline} ) {
>         print "Line <$someline> exists\n";
>     }else{
>         print "Line <$someline> not found\n";
>     }
> 
> If it is something more complex, then reading in the file to an array might
> work efficiently.
> 
> We need a little more information to help you further.
> 
> 
The list concerned is updated on a daily - and sometimes hourly - basis
and runs to about 90,000 lines as I look at it this morning.

If I used a database it would mean dropping the table and repopulating
it on a list change - rather than just swapping out the list and
building an index.

Each line of the file contains a single uri. The script will run
frequently - many times a minute - upon demand. To complicate things a
little there may be a total of three similar lists. One for URI's, one
for DOMAINS and one for TELEPHONE NUMBERS.

My key objectives are fast loading, fast acting and minimal overhead. (I
also live in an ideal world!)

Here is what I've considered;
Load the lists into a HASH on start. This is going to mean big hashes
and will take time. It strikes me as 'daft' to do this. If I could index
the list and load that instead I suspect it would take less room and
execute faster.

I suspect what I really need is a very simple form of indexing database.
I think the overhead of mysql or postgresql would be serious overkill
for single line queries. I'm hoping someone can suggest a much lighter
alternative.

If I can give an analogy - something that works a bit like Postfix's
'postmap' command may be ideal. It turns a flat file into a smaller
faster index file. I've seen some variations on this where .idx (rather
than .db) files have been created - but being new to this I don't fully
understand them.

Reading that back I don't know if I've made things any clearer or mad
things more muddy :-)



-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to