Alan/Alexandr,
There will always be an overhead with using a webserver to do this -
even using mod_perl.
Assumiptions:
* from what you are saying that there is no actual website
involved but you want to use mod_perl to cache data for an offline process;
* One set of data is used once and once only for a run?
Pros:
* Make sure you use your module in startup so that each child
thread uses the same memory not generating a copy of the data;
* If you use something like curl multi as the fetcher you can
write a simple parallel fetching queue to get the data - great if you
have a multi-core box;
Cons:
* There is an overhead of using HTTP protocol webserver - if you
aren't going to gain much from the parallelization of processes above
you may find that writing a simple script which loops over all data
would be more efficient...
* In your case we are probably looking at about 10ms (or less)
the apache/http round tripping will probably take much more time than
the actual processing...
On 03/02/2015 05:02, Alexandr Evstigneev wrote:
Pre-loading is good, but what you need, I belive, is Storable module.
If your files contains parsed data (hashes) just store them as
serialized. If they containing raw data, need to be parsed, you may
pre-parse, serialize it and store as binary files.
Storable is written in C and works very fast.
2015-02-03 7:11 GMT+03:00 Alan Raetz <alanra...@gmail.com
<mailto:alanra...@gmail.com>>:
So I have a perl application that upon startup loads about ten
perl hashes (some of them complex) from files. This takes up a few
GB of memory and about 5 minutes. It then iterates through some
cases and reads from (never writes) these perl hashes. To process
all our cases, it takes about 3 hours (millions of cases). We
would like to speed up this process. I am thinking this is an
ideal application of mod_perl because it would allow multiple
processes but share memory.
The scheme would be to load the hashes on apache startup and have
a master program send requests with each case and apache children
will use the shared hashes.
I just want to verify some of the details about variable sharing.
Would the following setup work (oversimplified, but you get the
idea…):
In a file Data.pm, which I would use() in my Apache startup.pl
<http://startup.pl>, I would load the perl hashes and have hash
references that would be retrieved with class methods:
package Data;
my %big_hash;
open(FILE,"file.txt");
while ( <FILE> ) {
… code ….
$big_hash{ $key } = $value;
}
sub get_big_hashref { return \%big_hash; }
<snip>
And so in the apache request handler, the code would be something
like:
use Data.pm;
my $hashref = Data::get_big_hashref();
…. code to access $hashref data with request parameters…..
<snip>
The idea is the HTTP request/response will contain the relevant
input/output for each case… and the master client program will
collect these and concatentate the final output from all the requests.
So any issues/suggestions with this approach? I am facing a
non-trivial task of refactoring the existing code to work in this
framework, so just wanted to get some feedback before I invest
more time into this...
I am planning on using mod_perl 2.07 on a linux machine.
Thanks in advance, Alan
---
This email has been checked for viruses by Avast antivirus software.
http://www.avast.com
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.