We need to back up a bit here. This involves two questions, one for core math
one for data prep:
1) The math question: does a CheckpointedDrm need to have a row for every
sequential row key from 0 to nrow? Can there be missing row keys in the
sequence and still get correct results for B %*% C where C and/or B have rows
that have no representation in the underlying rdd, not even n => {} but have
the same _nrow passed in during creation.
2) The data prep issue depends on the answer to #1: potentially there are
matrices A, B, C, … All come from data whose rows are IDed by external User
IDs. The total of these IDs define a row cardinality for all matrices. The
total number of Mahout row keys will come from the collected number of unique
User IDs.
If the answer to #1 is “yes you must have at least n => {} for every sequential
row key 0 through nrow”. Then A, B, C, and so on will need to have the Int row
Keys inserted at all points in the matrices where no data for the external ID
was seen. This implies reading them in as a unit. Rbind cannot do this after
each matrix has bee read in since the row key gaps may not all be at the end of
a matrix.
If the answer to #1 is that a non-existant row key (a gap in the sequence) is
exactly the same as having in rdd n => {} then changing only the row
cardinality of all matrices to match the total number of IDs seen will create
the correct result. If rbind with drmParallelizeEmpty can be used to only
change the cardinality then it may work.
I’ll keep poking at #1 but would love a definitive answer.