Replies inline.
On Fri, Jul 18, 2014 at 12:42 PM, Pat Ferrel <[email protected]> wrote: > If you could comment on the PR that would be great. In this case I know > where the code is you are talking about. > > 1) OK, this is a good catch. I didn’t know CheckpointedDrmSpark or really > all Drms are to be immutable, which is actually documented in 2.4 of the > DSL PDF. I think this is what Ted was saying too, assuming I knew it was > supposed to be immutable. Scala puts “immutable’ in the fully qualified > class name to flag the fact. Wonder if that’s a good idea here? > > 2) I’m talking about the R semantics for rbind. Out of the box R is only > dense so the semantics by definition are dense. Putting in all zero rows > adds a bunch of 0.0 doubles to a matrix. I’m saying you don’t even need or > want the empty row keys. This is certainly not what we want in a sparse > vector or matrix unless needed. Please rely on Dmitriy, Sebastian, or Ted > about this and maybe they can contradict me. > > 3) If I did an rbind do you want me to overload it to take an Int and only > touch _nrow (not even sure this is possible—haven’t looked)? Is this really > what you want? > > > > On Jul 17, 2014, at 4:58 PM, Anand Avati <[email protected]> wrote: > > > > And I still really doubt if just fudging nrow is a "complete". For e.g, > if > > after fixing up nrow (either mutating or by creating a new > CheckpointedDrm > > as I described in my previous mail), if you were to do: > > > > drmA = ... // somehow nrow is fudged > > > > drmB = drmA + 1 // invoke OpAewScalar operator > > > > I don't see how this would return the correct answer in drmB. mapBlock() > on > > drmA is just not performed on those "invisible" rows for the "+ 1" to be > > applied on the cells. > > Seems like a good test. I certainly can be done correctly given my > understanding below—not sure if it is. > > First you are creating a dense matrix from a sparse one—drmB is really a > non-sparse matrix that is distributed. This requires that all non-existent > rows and columns be created in the new matrix. The map would be over over > all IDs from 0 to nrow, also each Vector elements needs to have 1 added, > even the non-existent ones so you need to use the right vector iterator. But this is not how things would happen internally. mapBlock() does not even refer to nrow (at least in the spark implementation). It just marches on to execute the operator on drmrdd which is unaware of the nrow fudge at CheckpointedDrmSpark. What this would result in is, a matrix drmB where the first orig_nrow rows have all elements incremented by +1, and the remaining rows (orig_nrow .. nrow) will just continue to be "sparse" (0s) -- clearly wrong. > There are several cases where dense matrices are created from sparse ones > like factorization. Assumptions about ordinality and the row or columns IDs > allow this to happen. So new dense rows and elements may be created > assuming that key ordinality and nrow can be used to determine missing rows > (or columns). The point would be not to force a dense anything unless > needed, as in your case above. > > The question is good and i admit that my knowledge of this is not the best > so please refer to the experts. > > > > > I think rbind() is the safest approach here. Also, I'm not sure why you > > feel rbind() is only for "dense" matrices. If the B matrix for rbind > > operator was created from drmParallelizeEmpty() (as shown in the example > in > > the commit), the DrmRdd will be holding only empty > RandomAccessSparseVectors > > and will be significantly less expensive than a dense operation. > > I didn’t say that rbind is only for dense matrices, at least that isn’t > what I meant. If it requires me to calculate missing rows IDs for no > reason, it’s wrong. Can you please explain? All you need to do is literally replace selfSimilarityDataset.addToRowCardinality(newRowIDsFound) with selfSimilarityDataset = selfSimilarityDataset rbind drmParallelizeEmpty(newRowIDsFound) (I would actually look at re-organizing the whole code and avoid such "C like".. if (condition) { statement; } and instead make it more functional style) > It violates my understanding of sparse semantics—don’t mess with > non-exsistant data in vectors of matrices unless needed (as in your example > of adding 1 to all elements, even non-existant ones). Sparse vs Dense are just implementation styles for efficiency (of space and time). Logical operators must always yield the right answer whether operands are sparse or dense. Non-existence is only in the internal representation, not in the logical matrix. Sparseness typically is for optimizing off lots of 0 values. You are probably referring to "missing elements"? > Also I meant that we shouldn’t be inflexibly bound to R since there are a > few sparse cases that don’t fit and this seems like one. > rbind is an R "like" operation, just like all other R "like" operations in the DSL. It is (I think) similar enough to indicate that we are splicing two compatible matrices one on top of the other. I don't think it deserves a sparse vs dense argument - just like cbind. Thanks
