We need to back up a bit here. This involves two questions, one for core math 
one for data prep:

1) The math question: does a CheckpointedDrm need to have a row for every 
sequential row key from 0 to nrow? Can there be missing row keys in the 
sequence and still get correct results for B %*% C where C and/or B have rows 
that have no representation in the underlying rdd, not even n => {} but have 
the same _nrow passed in during creation.

2) The data prep issue depends on the answer to #1: potentially there are 
matrices A, B, C, … All come from data whose rows are IDed by external User 
IDs. The total of these IDs define a row cardinality for all matrices. The 
total number of Mahout row keys will come from the collected number of unique 
User IDs.

If the answer to #1 is “yes you must have at least n => {} for every sequential 
row key 0 through nrow”. Then A, B, C, and so on will need to have the Int row 
Keys inserted at all points in the matrices where no data for the external ID 
was seen. This implies reading them in as a unit. Rbind cannot do this after 
each matrix has bee read in since the row key gaps may not all be at the end of 
a matrix.

If the answer to #1 is that a non-existant row key (a gap in the sequence) is 
exactly the same as having in rdd n => {} then changing only the row 
cardinality of all matrices to match the total number of IDs seen will create 
the correct result. If rbind with drmParallelizeEmpty can be used to only 
change the cardinality then it may work.

I’ll keep poking at #1 but would love a definitive answer.

Reply via email to