Re: [Perldl] PDL and large data structure per cell in a large array

Craig DeForest Mon, 30 Mar 2009 20:14:02 -0700

Hey, all,

Thanks, Derek, for the nice explanation.  Sorry to have been out for a  
couple of days.  I do have a couple of minor things to add about  
neighborhood-based calculations:


For neighborhood-based computing on moderately sized arrays (less than  
a few hundred million elements) range() is very nice -- you can use it  
to stack your base array up so that the threading engine handles the  
neighborhoods for you.  This is quite wasteful of memory, as range()  
makes an explicit copy of the initial array into the rearranged data  
structure, but for one-off or small problems that is a good trade:   
using range can speed up *coding* such calculations by a large factor.

Life in a couple of lines:

        sub life_advance {
          my $f = $f_[0]->range(ndcoords($field)-1,3,'t')->reorder(2,3,0,1);
          my $n = $f->clump(2)->sumover;
          $_[0] .= ($n==3 | ($n==4 & $_[0]==1));
        }
        
That takes a 2-D integer PDL and advances it as a game of life.  It  
works by "stacking up" all neighborhoods into a 3 x 3 x NX x NY  
hypercube (in the first line), then calculating the neighbor count in  
each neighborhood (in the second line).  The third line sets the value  
of the playing field according to the number of neighbors.  (The  
second half of the expression handles the case where the central cell  
is alive and has three alive neighbors -- making four total live cells  
in the neighborhood).

Clearly the second two lines can be replaced with any simple  
expression that acts on a single neighborhood.  The threading engine  
takes care of the rest.

The rub lies with the implementation of range(): it explicitly copies  
data into the new structured array from the old one, since the  
addresses in the index field are arbitrary.  That means the first line  
expands the footprint of the initial array by a factor of 10 (9 for  
the new array, plus the original one, which still exists).

PDL does work on 64 bit machines but is not well tested with single  
arrays larger than the 31-bit limit of 2GB.  That is on many peoples'  
wish lists.

Best,
Craig




On Mar 30, 2009, at 6:15 PM, Derek Lamb wrote:

> P Kishor wrote:
>> First, Derek, thanks for explaining things so gently. Obviously, I am
>> super newbie with PDL. Now, onward...
>>
>> On Mon, Mar 30, 2009 at 5:00 PM, Derek Lamb  
>> <[email protected]> wrote:
>>
>>> P Kishor wrote:
>>>
>>>> Here is a large data structure --
>>>>
>>>> my $pdl = pdl (
>>>>       (
>>>>               1 .. 10,
>>>>               [ [1 .. 33], x $d ], # d arrays
>>>>               [ [1 .. 57], x $l ], # l arrays
>>>>               [ [1 .. 9 ], x $m ], # m arrays
>>>>       )
>>>> );
>>>>
>>>> $d is BETWEEN 0 and 5
>>>> $l is BETWEEN 1 and 10
>>>> $m is 7300, but could be as high as 18,000 or 20,000
>>>>
>>>> my $size = howbig($pdl->get_datatype);
>>>> print "size of pdl is: $size\n";
>>>> print $pdl->info("Type: %T Dim: %-15D State: %S"), "\n";
>>>> my $n = $pdl->nelem;
>>>> print "There are $n elements in the piddle\n";
>>>>
>>>> I get the following --
>>>>
>>>> size of pdl is: 8
>>>> Type: Double Dim: D [57,7300,13] State: P
>>>> There are 54093000 elements in the piddle
>>>>
>>>> Makes sense so far, but what does that "size of pdl is: 8" mean?
>>>> Surely, that is not the number of bytes being used by this data
>>>> structure?
>>>>
>>> Of course not.  The docs say that howbig 'Returns the size of a  
>>> piddle
>>> datatype in bytes.'  You have a piddle of type double.  Doubles  
>>> take 8 bytes
>>> each.
>>>
>>>
>>>> By my calculations, the data structure weighs in at about
>>>> 450 KB packed as a Storable object. By the way... in the pseudo  
>>>> code
>>>> above, I have shown the number of elements in the arrays, not the
>>>> actual values. So, for example, in each of the 'd' arrays, there  
>>>> are
>>>> 33 elements, but only about 4 or 5 of them are INTEGERS, the rest
>>>> being REAL numbers. This is useful to get a sense of the size of  
>>>> the
>>>> data structure.
>>>>
>>>>
>>> Perhaps useful to people, but not so useful to PDL.  If you have a
>>> five-element piddle and 4 elements are integers and 1 is a double,  
>>> then the
>>> whole thing is promoted to double.  The efficiency of PDL is  
>>> derived mainly
>>> from knowing the byte-size of the elements of a piddle a priori.   
>>> If you
>>> want to mix ints and doubles like this, you probably need to  
>>> rethink your
>>> data structure.  You can use plain old Perl lists, which don't  
>>> require
>>> uniform typing, but the overhead will probably kill you.  Hashes  
>>> or lists of
>>> piddles is also an option to consider.
>>>
>>>> Now, this data structure is the data for computation that is  
>>>> applied
>>>> to a large array, say, 1000 x 1000 or even 1500 x 1800, so  
>>>> between a
>>>> million to a couple of million or more elements, on a cell by cell
>>>> basis. Imagine applying f(d) to the array where d is data  
>>>> structure,
>>>> with f(d) being applied to each cell individually.
>>>>
>>>> Curious to test the limits of my machine and PDL, I tried to  
>>>> create a
>>>> piddle that held 1000_000 such structures, I got a 'bus error'.  
>>>> At an
>>>> array with 100 elements, I got a segmentation fault. At an array  
>>>> with
>>>> 10 elements, it worked.
>>>>
>>>>
>>> And probably with the 10^6 example you got a computer brought to  
>>> its knees
>>> trying to allocate 40 TB of memory.  If I understand you correctly  
>>> (let me
>>> know if I don't), you want to create a super-piddle that is (to  
>>> use your
>>> examples here) 10^6 by 57 by 7300 by 13.  Simple calculation shows  
>>> that the
>>> base piddle $pdl is 41 MB, so if you want a million of these you  
>>> need 41
>>> million MB of memory somewhere.  10 of those is not such a big  
>>> problem, 100
>>> might work if you have several GB of memory, but 10^6 is just crazy.
>>> Probably need to rethink how you're doing things there.
>>>
>>
>> Yes, only now do I realize that PDL pads everything up to make for  
>> n-d
>> arrays with no holes. Yes, 100 of these piddles would be more than 4
>> GB of memory. I have 32 GB per machine, but, I believe Perl can
>> address only less than 4 GB memory per process, no? or, is it 2 GB
>> (the 32-bit programs limit).
>>
>>
> I'm not sure--I've never run up against a Perl process limit.   
> Remember
> that piddles are treated differently than perl SVs, so that limit  
> may or
> may not apply.  But you could test it by saying
> perldl> $a = zeroes(3*1024*1024*1024)
> which should try to allocate a 24 GB piddle.  Fun stuff.
>
>> In any case, you understood the problem correctly. We have an area of
>> 10^6 or upto 2*10^6 cells. Each one of those cells has that  
>> 57x7300x13
>> (or, even 57x18000x13) piddle data structure (all depends on the
>> number of years of weather data... 20 years is 7300 rows, 50 years is
>> 18250 rows, and so on). For now, thankfully, each cell is  
>> independent.
>> In the future, things might become more interesting when each cell
>> might start depending on what happens in its neighboring cell, kinda
>> like the game of life (has anyone used PDL to do game of life?), but
>> that is not the case for now.
>>
> I think Craig has a version that did it in 3 or 5 lines or something.
>
>> Seems like the best thing might be to break up the area into smaller
>> chunks of n cells so that n x 57 x 7300 x 13 fits into the memory  
>> of a
>> single Perl process and then run multiple processes concurrently  
>> using
>> up the multiple cores in the computer.
>>
>> Guidance on how to achieve this would be very much appreciated. PDL  
>> is
>> making life with Perl seem even more interesting, and I am quite  
>> eager
>> to at least try out PDL in this work. If it doesn't work then it
>> doesn't work, but I do want to give it a shot.
>>
> I'm pretty sure (but could be completely wrong) that Perl does not
> support multiple cores automatically.  This functionality is not yet  
> in
> PDL either.  But there is a Perl fork, which calls your system fork,  
> so
> you might be able to cook something up that way.  I don't have any  
> of my
> books with me right now, so I can't provide specifics.
>>
>>
>>
>>
>>> Derek
>>>
>>>
>>>> I am seeking some suggestions on how to work with such data using  
>>>> PDL.
>>>>
>>>> Many thanks,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>
>
> _______________________________________________
> Perldl mailing list
> [email protected]
> http://mailman.jach.hawaii.edu/mailman/listinfo/perldl
>


_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Re: [Perldl] PDL and large data structure per cell in a large array

Reply via email to