If you could comment on the PR that would be great. In this case I know where 
the code is you are talking about.

1) OK, this is a good catch. I didn’t know CheckpointedDrmSpark or really all 
Drms are to be immutable, which is actually documented in 2.4 of the DSL PDF. I 
think this is what Ted was saying too, assuming I knew it was supposed to be 
immutable. Scala puts “immutable’ in the fully qualified class name to flag the 
fact. Wonder if that’s a good idea here? 

2) I’m talking about the R semantics for rbind. Out of the box R is only dense 
so the semantics by definition are dense. Putting in all zero rows adds a bunch 
of 0.0 doubles to a matrix.  I’m saying you don’t even need or want the empty 
row keys. This is certainly not what we want in a sparse vector or matrix 
unless needed. Please rely on Dmitriy, Sebastian, or Ted about this and maybe 
they can contradict me.

3) If I did an rbind do you want me to overload it to take an Int and only 
touch _nrow (not even sure this is possible—haven’t looked)? Is this really 
what you want?


> On Jul 17, 2014, at 4:58 PM, Anand Avati <[email protected]> wrote:
> 
> And I still really doubt if just fudging nrow is a "complete". For e.g, if
> after fixing up nrow (either mutating or by creating a new CheckpointedDrm
> as I described in my previous mail), if you were to do:
> 
>  drmA = ... // somehow nrow is fudged
> 
>  drmB = drmA + 1 // invoke OpAewScalar operator
> 
> I don't see how this would return the correct answer in drmB. mapBlock() on
> drmA is just not performed on those "invisible" rows for the "+ 1" to be
> applied on the cells.

Seems like a good test. I certainly can be done correctly given my 
understanding below—not sure if it is.

First you are creating a dense matrix from a sparse one—drmB is really a 
non-sparse matrix that is distributed. This requires that all non-existent rows 
and columns be created in the new matrix. The map would be over over all IDs 
from 0 to nrow, also each Vector elements needs to have 1 added, even the 
non-existent ones so you need to use the right vector iterator. There are 
several cases where dense matrices are created from sparse ones like 
factorization. Assumptions about ordinality and the row or columns IDs allow 
this to happen. So new dense rows and elements may be created assuming that key 
ordinality and nrow can be used to determine missing rows (or columns). The 
point would be not to force a dense anything unless needed, as in your case 
above.

The question is good and i admit that my knowledge of this is not the best so 
please refer to the experts.

> 
> I think rbind() is the safest approach here. Also, I'm not sure why you
> feel rbind() is only for "dense" matrices. If the B matrix for rbind
> operator was created from drmParallelizeEmpty() (as shown in the example in
> the commit), the DrmRdd will be holding only empty RandomAccessSparseVectors
> and will be significantly less expensive than a dense operation.

I didn’t say that rbind is only for dense matrices, at least that isn’t what I 
meant. If it requires me to calculate missing rows IDs for no reason, it’s 
wrong. It violates my understanding of sparse semantics—don’t mess with 
non-exsistant data in vectors of matrices unless needed (as in your example of 
adding 1 to all elements, even non-existant ones). Also I meant that we 
shouldn’t be inflexibly bound to R since there are a few sparse cases that 
don’t fit and this seems like one.

> 

Reply via email to