I appreciate your help Anand. You may know the area better than most but my question is about _should_ not _does_. It is one of intent, what is our covenant with the gods (to: line).
On another thread I’ll send you code that shows A + 1 works with blank rows in A. BTW this implies rbind will not solve the problem, it is firmly in data prep. But until I know the rules I won’t know how to do the right thing. On Jul 19, 2014, at 5:25 PM, Anand Avati <[email protected]> wrote: On Sat, Jul 19, 2014 at 5:04 PM, Pat Ferrel <[email protected]> wrote: We need to back up a bit here. This involves two questions, one for core math one for data prep: 1) The math question: does a CheckpointedDrm need to have a row for every sequential row key from 0 to nrow? Can there be missing row keys in the sequence and still get correct results for B %*% C where C and/or B have rows that have no representation in the underlying rdd, not even n => {} but have the same _nrow passed in during creation. 2) The data prep issue depends on the answer to #1: potentially there are matrices A, B, C, … All come from data whose rows are IDed by external User IDs. The total of these IDs define a row cardinality for all matrices. The total number of Mahout row keys will come from the collected number of unique User IDs. If the answer to #1 is “yes you must have at least n => {} for every sequential row key 0 through nrow”. Then A, B, C, and so on will need to have the Int row Keys inserted at all points in the matrices where no data for the external ID was seen. This implies reading them in as a unit. Rbind cannot do this after each matrix has bee read in since the row key gaps may not all be at the end of a matrix. If the answer to #1 is that a non-existant row key (a gap in the sequence) is exactly the same as having in rdd n => {} then changing only the row cardinality of all matrices to match the total number of IDs seen will create the correct result. If rbind with drmParallelizeEmpty can be used to only change the cardinality then it may work. I’ll keep poking at #1 but would love a definitive answer. The answer _has_ to be "yes". There cannot be missing row keys for Int keyed DRMs. The proof for my claim is in my previous mail, that val drmB = drmA + 1 will give incorrect result (at least on spark backend) if there are such missing rows. Thanks
