On 5/6/2010 10:01 PM, Daniel Carrera wrote:
> Chris Marshall<[email protected]>  wrote:
>
>>> 1) Why is that O(N**3) memory ops? I'm not very versed with PDL,
>>> but it looks to me that ->dummy() and ->xchg() should not cost any
>>> memory. Is there a reason why ->inner() costs extra memory?
>>
>> The inner product takes 2 O(N) vectors and produces O(1) result.
>> The threading is over O(N**2) elements of the output matrix.
>> O(N) * O(N**2) is O(N**3)
>
> Wait... it's doing the inner product in parallel?  Why? Unless PDL was
> dividing the work among different CPUs (and my system monitor clearly
> shows that it isn't) what would be the purpose of computing N**2 dot
> products at the same time instead of one at a time?

What do you mean doing the inner product in parallel?

The above was just the operation counts.
An inner product is O(N) operations.
To generate the output matrix you have to loop over O(N**2) elements.

Unless you are using a parallel PDL, I doubt the inner products
are being done in parallel.  What makes you think that they are
done at the same time?

> Secondly, why does it need a brand-new copy of the vectors for each
> computation? If vector/column 'v' is involved in N computations, that
> doesn't mean it has to be copied N times in memory.

It doesn't---if you write the algorithm as a matrix multiply
kernel rather than a dot-product kernel.  And yes, unless the
implementation reuses the values explicitly, then they will
be reloaded.

The good news is that is what caches are for which is why things
aren't so bad for smaller matrices.  This type of optimization
is what is needed to be done to speed up PDL's matrix multiply.
Since you have O(N**3) computations and O(N**2) memory accesses,
for large matrix multiplies you can completely hide the memory
access cost---if it is implemented to do that...

> I haven't thought much about algorithms in years, so maybe my brain is
> numb, but this all seems odd to me. I would have thought that the inner
> product would be just 3 for loops and require no additional memory
> besides what's needed for the input (2 x N**2) and the output (N**2).

I'm just talking about the current implementation and not
the high performance one.  That was the whole point---it is
not implemented that way at the moment.

--Chris

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Reply via email to