If you could comment on the PR that would be great. In this case I know where the code is you are talking about.
1) OK, this is a good catch. I didn’t know CheckpointedDrmSpark or really all Drms are to be immutable, which is actually documented in 2.4 of the DSL PDF. I think this is what Ted was saying too, assuming I knew it was supposed to be immutable. Scala puts “immutable’ in the fully qualified class name to flag the fact. Wonder if that’s a good idea here? 2) I’m talking about the R semantics for rbind. Out of the box R is only dense so the semantics by definition are dense. Putting in all zero rows adds a bunch of 0.0 doubles to a matrix. I’m saying you don’t even need or want the empty row keys. This is certainly not what we want in a sparse vector or matrix unless needed. Please rely on Dmitriy, Sebastian, or Ted about this and maybe they can contradict me. 3) If I did an rbind do you want me to overload it to take an Int and only touch _nrow (not even sure this is possible—haven’t looked)? Is this really what you want? > On Jul 17, 2014, at 4:58 PM, Anand Avati <[email protected]> wrote: > > And I still really doubt if just fudging nrow is a "complete". For e.g, if > after fixing up nrow (either mutating or by creating a new CheckpointedDrm > as I described in my previous mail), if you were to do: > > drmA = ... // somehow nrow is fudged > > drmB = drmA + 1 // invoke OpAewScalar operator > > I don't see how this would return the correct answer in drmB. mapBlock() on > drmA is just not performed on those "invisible" rows for the "+ 1" to be > applied on the cells. Seems like a good test. I certainly can be done correctly given my understanding below—not sure if it is. First you are creating a dense matrix from a sparse one—drmB is really a non-sparse matrix that is distributed. This requires that all non-existent rows and columns be created in the new matrix. The map would be over over all IDs from 0 to nrow, also each Vector elements needs to have 1 added, even the non-existent ones so you need to use the right vector iterator. There are several cases where dense matrices are created from sparse ones like factorization. Assumptions about ordinality and the row or columns IDs allow this to happen. So new dense rows and elements may be created assuming that key ordinality and nrow can be used to determine missing rows (or columns). The point would be not to force a dense anything unless needed, as in your case above. The question is good and i admit that my knowledge of this is not the best so please refer to the experts. > > I think rbind() is the safest approach here. Also, I'm not sure why you > feel rbind() is only for "dense" matrices. If the B matrix for rbind > operator was created from drmParallelizeEmpty() (as shown in the example in > the commit), the DrmRdd will be holding only empty RandomAccessSparseVectors > and will be significantly less expensive than a dense operation. I didn’t say that rbind is only for dense matrices, at least that isn’t what I meant. If it requires me to calculate missing rows IDs for no reason, it’s wrong. It violates my understanding of sparse semantics—don’t mess with non-exsistant data in vectors of matrices unless needed (as in your example of adding 1 to all elements, even non-existant ones). Also I meant that we shouldn’t be inflexibly bound to R since there are a few sparse cases that don’t fit and this seems like one. >
