I appreciate your help Anand. You may know the area better than most but my 
question is about _should_ not _does_. It is one of intent, what is our 
covenant with the gods (to: line).

On another thread I’ll send you code that shows A + 1 works with blank rows in 
A. BTW this implies rbind will not solve the problem, it is firmly in data 
prep. But until I know the rules I won’t know how to do the right thing.


 
On Jul 19, 2014, at 5:25 PM, Anand Avati <[email protected]> wrote:

On Sat, Jul 19, 2014 at 5:04 PM, Pat Ferrel <[email protected]> wrote:
We need to back up a bit here. This involves two questions, one for core math 
one for data prep:

1) The math question: does a CheckpointedDrm need to have a row for every 
sequential row key from 0 to nrow? Can there be missing row keys in the 
sequence and still get correct results for B %*% C where C and/or B have rows 
that have no representation in the underlying rdd, not even n => {} but have 
the same _nrow passed in during creation.

2) The data prep issue depends on the answer to #1: potentially there are 
matrices A, B, C, … All come from data whose rows are IDed by external User 
IDs. The total of these IDs define a row cardinality for all matrices. The 
total number of Mahout row keys will come from the collected number of unique 
User IDs.

If the answer to #1 is “yes you must have at least n => {} for every sequential 
row key 0 through nrow”. Then A, B, C, and so on will need to have the Int row 
Keys inserted at all points in the matrices where no data for the external ID 
was seen. This implies reading them in as a unit. Rbind cannot do this after 
each matrix has bee read in since the row key gaps may not all be at the end of 
a matrix.

If the answer to #1 is that a non-existant row key (a gap in the sequence) is 
exactly the same as having in rdd n => {} then changing only the row 
cardinality of all matrices to match the total number of IDs seen will create 
the correct result. If rbind with drmParallelizeEmpty can be used to only 
change the cardinality then it may work.

I’ll keep poking at #1 but would love a definitive answer.

The answer _has_ to be "yes". There cannot be missing row keys for Int keyed 
DRMs. The proof for my claim is in my previous mail, that

val drmB = drmA + 1

will give incorrect result (at least on spark backend) if there are such 
missing rows.

Thanks


Reply via email to