> > On Jul 16, 2014, at 8:34 AM, Anand Avati <[email protected]> wrote: > > On Wed, Jul 16, 2014 at 7:53 AM, Pat Ferrel <[email protected]> wrote: > There IS no issue with nrow being a lazy val. I never touch it read below. > > The value itself is may not be immutable. But it sounds like the same matrix > would return different values for nrow() depending on when you called it. > That sounds very much like a problem if the same matrix is part of two > separate spark graph hierarchies where each is making a different assumption > about its cardinality.
My understanding of a CheckpointeDrmSpark is to assure that the rdd is checkpointed so no optimizer graph at that point. Yes it would produce different results but so would the matrix pre and post rbind. There are serveral matrix reads, that create one dictionary for all rows in all matrices. The length of this dictionary is the row cardinality for all matrices. Each CheckpointedDrmSpark has the rows with Int IDs/keys for the data it knows about and so on. The cardinality is changed before any operation is performed. > > creating a new matrix val is fine if it doesn’t cause a new rdd to be created > I’ll look into that. > > rbind as I read it requires me to construct the rows to be added. I don’t > know what their keys are and don’t want to calculate them. If I’m right about > how the math works the actual rows are not needed. > > See the example code to add empty rows. Numerical keys are auto computed > based on the sizes. > I’m pretty sure that’s impossible for this case. The only way to calculate the true cardinality is read in all data for all matrices.The only way to know what IDs are _missing_ from a single DRM is to read them all in. The internal Int IDs are different for each matrix and there is no telling which is in which. I’d have to keep the unified row dictionary and some collection for each matrix then subtract the smaller from the unified to get the missing keys. Only then could I create empty rows with rbind using the missing keys. The users you see for a specific interaction recorded in the several matrices may bear no relationship any of the others. In fact that’s what cooccurrence may tell you. Any operation performed on the drms before these hypothetical rbinds are performed will be wonky anyway so I miss the point about that. If my solution works (forcing cardinality without actually adding empty rows) it is far far simpler than keeping these extra collections and calculating the IDs for rbind. If I’m wrong, what’s described above should work. This is for one rather unique pipeline so my shortcut may not be safe for any use that is my worry. We’ll find out.
