Re: [Pdl-general] Fastest way to read large file and create PDL matrices?

Chris Marshall Mon, 06 Jul 2015 07:08:21 -0700

Ben-

You can set the value of $PDL::undefval to make missing column data to aspecific value.If you don't have the same number of columns in all rows, you'll need touse explicitcolumn numbers so that all the data is read. Otherwise, the extracolumns will be

dropped.  E.g.,


pdl> #cat > in44.cols
 1 2 3 4
 1 2 3
 1 2 3 4

pdl> p rcols 'in44.cols', []
Reading data into piddles of type: [ Double ]
Read in 12 elements.

[
 [1 1 1]
 [2 2 2]
 [3 3 3]
 [4 0 4]
]

while

pdl> #cat > in34.cols
 1 2 3
 1 2 3 4
 1 2 3 4

pdl> p rcols 'in34.cols', []    # uh-oh!
Reading data into piddles of type: [ Double ]
Read in 9 elements.

[
 [1 1 1]
 [2 2 2]
 [3 3 3]
]

pdl> p rcols 'in34.cols', [0..3]  # this works again
Reading data into piddles of type: [ Double ]
Read in 12 elements.

[
 [1 1 1]
 [2 2 2]
 [3 3 3]
 [0 4 4]
]


The zero is from the default value of $PDL::undefval.

Hope this helps,
Chris

On 7/6/2015 09:12, Chris Marshall wrote:

Ben-

You *really* need to look at the rcols() routine: pdldoc rcols
I don't know much about your data as it seems a bit irregular
and too large to verify the format but I was able to slurp it into
a containing 2D PDL:

pdl> $data = rcols 'sample.s321p', [], { EXCLUDE=>'/^[#\!]/' };
Reading data into piddles of type: [ Double ]
Read in 468018 elements.

pdl> ?vars
PDL variables in package main::

Name         Type   Dimension       Flow State          Mem
----------------------------------------------------------------
$data        Double D [52002,9]            -C 0.00KB
pdl> p cat($data->statsover)->mv(-1,0)# the cat()->mv() makes thestats line up vertically
[
[ 2326.8359 375207.06 1.5208731e-06 -0.9782754561000000 4653.4882 375203.45][-0.00028558513 0.0058766133 4.8714989e-06 -0.922951570.021193039 0.00058756778 0.0058765568][ 0.0023870574 0.055326859 1.7099293e-06 -0.978268910.99688132 0.01008911 0.055326327][-0.00025093421 0.0014797211 5.2744513e-06 -0.0449276330.07611833 0.00054094668 0.0014797069][ 0.0023014344 0.055315241 1.7512188e-06 -0.978274910.98798882 0.010026523 0.055314709][-0.00024539816 0.0016970776 6.2292964e-06 -0.0913571950.075922572 0.00054341477 0.0016970613][ 0.0022908007 0.055309355 1.3068813e-06 -0.978268860.98751119 0.0098599264 0.055308823][-0.00024407073 0.0017278177 6.06088e-06 -0.0913571950.07590755 0.0005446816 0.0017278011][-3.2840509e-07 5.2955969e-05 0 -0.0086078672 06.5678493e-07 5.2955459e-05]
]


pdl> p $data->stats
258.537987240325 125070.089699174 1.190347227888e-06 -0.978275445866561000000 517.071919985122 125069.956082351
pdl> ?vars
PDL variables in package main::

Name         Type   Dimension       Flow State          Mem
----------------------------------------------------------------
$data        Double D [52002,9]            P 3.57MB


Which is less than 4MB data.  You could possibly read all the
100+ array data into one 2D PDL and then use slicing operations
and reshape to extract the matrix forms.  Try things out in the
pdl2 or perldl shell to see what works.

Cheers,
Chris

On 7/5/2015 22:25, Benjamin Silva wrote:
Hi Chris,
I had just posted the code snippet with the hopes someone would lookat it and see something obviously wrong. The actual subroutine thatparses the input file is a bit more complex since the file I amparsing isn't all just matrix data; there is a header and some otherstuff that I pull out, so there are a bunch of other loops and suchgoing on in there. That said, I've included a working copy of thescript as well as a sample file that it operates on. Any idea whyit's so slow? Please forgive my awful code.
link to my code:
http://pastebin.com/JS9gTs6t

link to the sample file it operates on:
http://www.5plates.com/sample.zip
Note that this example looks relatively fast (4 seconds), but thisscript is reading just a single matrix from a file that houses just 2matrices. I will be using this script on a file that has 100+instances of these matrices. That means I'll have to read each ofthese 100 matrices which makes 4 seconds turn into several minutes.
Please let me know if this helps, or if you can't read the files linked.

Thanks!
-Ben
On Sun, Jul 5, 2015 at 1:11 PM, Chris Marshall<[email protected] <mailto:[email protected]>>wrote:
    Hi Ben-

    It helps if you include a small working code example rather than
    an out-of-context snippet---something we can run.  Some thoughts:

    - rcols and wcols are useful reading an writing 2D piddles

    - $zmatrix_pre seems to be a perl array object with N x M piddle
    elements

    - PDL is optimized for large data operations
       - Reading in complex values one-by-one is going to be very
    slow when
         you have 350x350 elements.
       - You should be reading all the values in one operation and
    create the
          PDL from that.

    I can't help more since your code doesn't even set $row or $col but
    maybe the above thoughts will give you an idea.  You can use wcols()
    to write out a 2D piddle and rcols() to read it back.  Something like
    this might be applicable from a pdl2 session:

    pdl> $im = sequence(5,5)/25;

    pdl> $re = random(5,5);

    pdl> use PDL::Complex

    pdl> p i
    0 +1i

    pdl> $c = $re + i * $im;

    pdl> p $c

    [
[0.971898 +0i 0.754039+0.04i 0.50257+0.08i 0.190826+0.12i0.613931+0.16i][0.132726 +0.2i 0.327291+0.24i 0.251733+0.28i 0.184122+0.32i0.787163+0.36i][0.103273 +0.4i 0.793739+0.44i 0.286722+0.48i 0.71684+0.52i0.939528+0.56i][0.114506 +0.6i 0.750494+0.64i 0.757878+0.68i 0.761478+0.72i0.827088+0.76i][ 0.69723 +0.8i 0.438457+0.84i 0.177937+0.88i 0.321631+0.92i0.750218+0.96i]
    ]


    I recommend trying out small cases in the pdl2 or perldl shells
    to see how things
    work.  Once you see the patterns it is easier to apply to bigger
    data in a program
    or script.

    Cheers,
    Chris



    On 7/5/2015 14:57, Benjamin Silva wrote:
    Hello,

    I recently recoded some of my old scripts to use the PDL
    libraries instead of some subroutines I had written myself.
    These scripts were for doing matrix inversion and matrix
    manipulation for medium sized matrices of complex numbers
    (~350x350 max matrix size). These matrices are housed in plain
    text files, and I have a parser that goes through and builds a
    piddle out of the data in the file.  My old subroutines were
    slow to do the processing, but were extremely fast for reading
    in the file and creating the matrix.  The new subroutine, using
    PDL, is extremely fast to do the processing, but now it is crazy
    slow for reading in the file and creating the initial PDL.  My
    method for creating the PDL is shown below.  Can anyone please
    let me know if there's a faster way to do this?  Almost all of
    the time savings I've achieved by going to PDL have been
    consumed by the slower file parsing and PDL building.  I've run
    the code through a profiler, and it's definitely wasting a lot
    of cycles on the 3rd line in the while loop where I'm creating
    the cplx data structure.

    open (FILE, "$input_file") or die;
    while($inline1=<FILE>){
    chomp $inline1;
    @data = split(/\s+/, $inline1);
    $zmatrix_pre[$row][$column] = cplx($data[0]+$data[1]*i);
    }

    Thanks for any help!
    -Ben

------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/

_______________________________________________
pdl-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pdl-general

Re: [Pdl-general] Fastest way to read large file and create PDL matrices?

Reply via email to