Hi David,

Thanks for your great answer - I appreciate the time you spent looking
into this.

On Wed, Sep 30, 2009 at 10:43 PM, David Mertens <[email protected]> wrote:

> Welcome to PDL!  As you have surmized, it is a language that is dominated by
> the sciences; it was originally created by an astronomer (or
> astrophysicist?), so that's why anything complicated deals with image
> processing.

I read it is going to be included as a new Perl6 datatype so I was
hoping that its applications would not be limited to a single domain.
Perl "arrays" have always been lacking so it is a welcome addition.

> I am also something of a beginner and I would like to see the
> introductory documentation improved, but as with all things it takes time.
> I can't address all your questions, but I can point you in the direction of
> some useful basics that you seem to have missed (which given the
> organization of the documentation, is not surprising).

There seem to be a lot of documentation but they are not so many
examples of practical code. The wiki has very limited information too.

> PDL is meant to handle these sorts of number-crunching problems, but it
> really works best with vector-oriented calculations, which I discuss near
> the end of my response.  I like that you're posing a business-oriented
> question, and it's a shame that such questions are unusual on this mailing
> list.  Keep them coming!

This is why I posted this question. All my searches led to examples of
image-processing or galaxy-crunching applications. While it is great
that astronomers get proper tools to handle their data, a lot of us
would be happy with better arrays in Perl.

It seems indeed that PDL is focused on vector-oriented calculations.
What I liked about it is its ability to handle large data structures
in memory, something that Perl does quite poorly.

> I know nothing about databases, but thankfully Doug already answered this
> part of your question.

Indeed he did, and it's really simple.

>> ($ts, $pid, $sub, $time) = rcols ("payments.csv", { perlcols => [4],
>> DEFTYPE => long }
>> This create a list of 1D piddles.
>
> Did this work for you?  It did not work for me, at least not as expected.

It did work as expected, maybe because of the perlcols => [4] that you
did not include?

>> $all = cat(rcols ("payments.csv"))
>> This created a 374540x 4 array of type DOUBLE. Couldn't manage to get
>> it to create of type LONG:
>
> Are you sure about your dimensions?  I got 477 x 1 array (my file is 477
> lines long) because it's reading only the first column in for me.  I think I
> get what you want if I specify the columns:
>
>> $all = cat(rcols ("payments.csv", 0,1,2,3))
>     Reading data into piddles of type: [ Double Double Double Double ]
>     Read in  477  elements.
>
> Looks right to me.

Will check the dimensions as soon as I am back to my workstation (am
traveling at the moment).

>> $all = cat(rcols ("payments.csv", {DEFTYPE => long}, 0,1,2,3))
>     Reading data into piddles of type: [ Long Long Long Long ]
>     Read in  477  elements.
>
> Looks right to me.  However, you should probably stick with the first
> attempt anyway, since the data really are distinct and should be treated as
> such.

Right, this is an important point. My data is really 4 vectors of
different unrelated data.

> So this is how you use rasc... I never could figure it out, but as you say,
> you must know the size ahead of time.

The speed difference is huge - rasc was almost 10 times faster. The
number of lines can be easily obtained with a call to "wc -l" so it
might be worth it.

> In case you haven't come across it yet, there's an important difference
> between the '=' operator and the '.=' operator.  You'll probably appreciate
> it if you ever try to assign values to a slice, but I won't go into it now.
> It's adequately discussed in the docs.

I've noticed the discussion about it and the 'dataflow' concepts,
which might be very useful in some applications.

>> $sub = $all->slice('2,:');         # Subscribers' column
>
> First of all, this slicing operation is only really necessary when you load
> all the data into a single array.  You don't have to do that, and probably
> shouldn't

Noted.

>> # This shouldn't be two-dimensional...
>> $sub = $sub->reshape();
>> p $sub->dims;
> 477

I like that, was looking for it!

> This is made possibly by NiceSlice, a source filter.  Read the documentation
> for it.  If you're impatient, jump to section 5 and read the sections
> "Parentheses following a scalar variable" and then jump to "The argument
> list" and read until you get bored or hit section 6.  Then read about the
> default method invocation, just to be sure you're aware of it.

Ok. What a steep learning curve for someone who just wants to do
generic array operations!

> Finally, read up on the 'where' and 'which' functions, which are aluded to
> in the NiceSlice documentation, and which we will use shortly.  You can find
> a good summary by typing 'help where' and 'help which' at the PDL prompt, or
> reading the PDL::Primitive documentation (which will tell you about a whole
> bunch of other useful functions).

I did see those functions (and you can see it used 'which' in the code
I posted) and they are amazingly useful functions. I'll have to do
some performance tests to see how they compare to other methods, but
my guess would be that they're very efficient.

> Also consider using dims() (or getdims) and dim (or getdim), which are a bit
> more precise.

Noted.

>> %count = (); # Hash of number of purchases for each subscriber
>>
>> for ($i=0; $i<$rows; $i++) { $count{$all->at(3,$i)}++; } # populate hash
>
> TYPO? I used 'at($i,2)' to get this to work.

No typo, this is how it worked for me. at(3,$i) means (in my
understanding) "at column 3, row $i". The subscriber ID is in column 3
in my data.

> This is definitely the Perl way to go about doing this.  I don't think there
> is a PDL way to do this, however.  We discussed a related question on the
> mailing list a couple of months ago entitled "Compute a distribution
> function from irregular data", which suggests to me a different approach:
>
>> $uniq_subs = $sub->uniq();
>> for($i = 0; $i < $uniqu_sub->dim(0); $i++) {%count{$uniq($i;-)} =
>> $sub->where($sub == $uniq_subs($i))->dim(0); }
>
> but this appears to be MUCH slower than your technique.    Generally
> speaking, however, you should avoid explicit for loops along large
> dimensions unless it's absolutely necessary.  For this problem, I think it
> is actually necessary.

For many array operations that I work with (the mundane type, e.g.
data mining, log analysis, accounting, transaction logs, etc.) it is
necessary to scan the entire dimension.

>> - Do I really have to know the number of rows to dimension my LONG array
>> before doing an rasc()?
>> - Is "$sub = $all->slice('2,:')" the proper way to get the third column of
>> my piddle? Can it be written in a nicer way?
>> - Is the "at" function the proper way to address an element of the array?
>> Really?
>
> First question - I think so.  I'm not sure, but you can dig through the
> source code to check.

Good point.

> Second and Third questions: already answered - use NiceSlice.

Ok.

>> Example 2: We want to find out how frequently subscribers re-purchase
>>
>> - This code works, but is really how things should be done?

> No.  When working with PDL, if you ever feel tempted to write a for-loop,
> you should be on guard.  With vectorized languages, such as PDL, Matlab,
> IDL, Octave, etc, you want to express as much as you can with vector
> operations rather than element-by-element calculations.

Right. I'm thinking too much like a database programmer I guess.

> So for this problem, here's how I would do it:
>
> %avg_lapses = ();
>
> foreach (keys %count) # can't get rid of this for loop, unfortunately
> {
>        next if $count{$_} < 2;
>        ($purchase_times, $purchase_durations) = where ($tx, $time, $sub,
> $sub  == $_);
>        $purchase_count = $purchases->nelem();
>
>        $start_times = $purchase_times(0:-2);
>        $start_durations = $purchase_durations(0:-2);
>        $next_times = $purchase_times(1:-1);
>
>        $lapse = $next_times - $start_times + $start_durations * $daysec;
>        %avg_lapses{$_} = $lapses->avg;
> }

This is neat. I will try it asap!

> Now %avg_lapses contains the average lapse time per subscriber.  This is
> both more concise and much faster, since the differencing operation is
> computed using C-code.  Also, though I could be wrong about this, it doesn't
> even require more memory since the $start... and $next... piddles are
> virtual piddles.  If you want the global average, you can insert an
> accumulator in the foreach loop, and divide that by the number of entries in
> the %count hash.

Very nice, and it seems a much better way to use piddle. This is
exactly the kind of comment I was looking for... thanks!

>> - Are constructs such as $timestamp1 = $all->at(0,$idx->at($i)) the right
>> way to access the piddle's data
>
> Yes.  You can also use NiceSlice.  Note that if you need to change a value,
> you can use slicing together with the '.=' assignment operator, not the '='
> operator, which won't do what you want.

Ok.

> With vectorized languages, you try to construct everything so that you DON'T
> need to scan every element of a piddle.  If you must do that, yes, a for
> loop is pretty much the only way to go.  For speed, you could alwasy right a
> routine in PDL::PP, but I don't know how to do that yet.

I had a quick look in PDL::PP and got dizzy ;)

> Like a histogram?  PDL can do that, try 'help histogram' for starters.
> There are some basic statistics routines available using $piddle->stats and
> there's been a recent contribution of a much more advanced statistics
> library called PDL::Stats.

Histograms! Yes that's what I wanted. It looks promising, I will
experiment and post back my code.

> I hope that helps!  Post back with more questions.  It's got an annoying
> learning curve and it's easy to miss seemingly basic stuff, but keep at it.

It did help tremendously, once again I really appreciate your kind answer.
I will post back some questions once I test more code, in a few days.

Regards
Emmanuel

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Reply via email to