Re: mod_perl for multi-process file processing?

Dr James Smith Mon, 02 Feb 2015 22:34:57 -0800

Alan/Alexandr,

There will always be an overhead with using a webserver to do this -even using mod_perl.


Assumiptions:

* from what you are saying that there is no actual websiteinvolved but you want to use mod_perl to cache data for an offline process;

    *    One set of data is used once and once only for a run?

Pros:

* Make sure you use your module in startup so that each childthread uses the same memory not generating a copy of the data;* If you use something like curl multi as the fetcher you canwrite a simple parallel fetching queue to get the data - great if youhave a multi-core box;


Cons:

* There is an overhead of using HTTP protocol webserver - if youaren't going to gain much from the parallelization of processes aboveyou may find that writing a simple script which loops over all datawould be more efficient...* In your case we are probably looking at about 10ms (or less)the apache/http round tripping will probably take much more time thanthe actual processing...


On 03/02/2015 05:02, Alexandr Evstigneev wrote:

Pre-loading is good, but what you need, I belive, is Storable module.If your files contains parsed data (hashes) just store them asserialized. If they containing raw data, need to be parsed, you maypre-parse, serialize it and store as binary files.

Storable is written in C and works very fast.

2015-02-03 7:11 GMT+03:00 Alan Raetz <alanra...@gmail.com<mailto:alanra...@gmail.com>>:


    So I have a perl application that upon startup loads about ten
    perl hashes (some of them complex) from files. This takes up a few
    GB of memory and about 5 minutes. It then iterates through some
    cases and reads from (never writes) these perl hashes. To process
    all our cases, it takes about 3 hours (millions of cases). We
    would like to speed up this process. I am thinking this is an
    ideal application of mod_perl because it would allow multiple
    processes but share memory.

    The scheme would be to load the hashes on apache startup and have
    a master program send requests with each case and apache children
    will use the shared hashes.

I just want to verify some of the details about variable sharing.Would the following setup work (oversimplified, but you get the

    idea…):

    In a file Data.pm, which I would use() in my Apache startup.pl
    <http://startup.pl>, I would load the perl hashes and have hash
    references that would be retrieved with class methods:

    package Data;

    my %big_hash;

    open(FILE,"file.txt");

    while ( <FILE> ) {

          … code ….

          $big_hash{ $key } = $value;
    }

    sub get_big_hashref {   return \%big_hash; }

    <snip>

    And so in the apache request handler, the code would be something
    like:

    use Data.pm;

    my $hashref = Data::get_big_hashref();

    …. code to access $hashref data with request parameters…..

    <snip>

    The idea is the HTTP request/response will contain the relevant
    input/output for each case… and the master client program will
    collect these and concatentate the final output from all the requests.

    So any issues/suggestions with this approach? I am facing a
    non-trivial task of refactoring the existing code to work in this
    framework, so just wanted to get some feedback before I invest
    more time into this...

    I am planning on using mod_perl 2.07 on a linux machine.

    Thanks in advance, Alan




---
This email has been checked for viruses by Avast antivirus software.
http://www.avast.com



--

The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE.

Re: mod_perl for multi-process file processing?

Reply via email to