In cooccurrence for the case of B'A the real-world dimensionality of the
matrices can be compatible even though the data read in from tuples would leave
some rows or columns blankāno non-zero elements. At least this is what I
suspect trying to run cooccurrence on the epinions data (ratings_data.txt,
trust_data.txt)
I get:
Exception in thread "main" java.lang.AssertionError: assertion failed:
Incompatible operand geometry
at scala.Predef$.assert(Predef.scala:179)
at org.apache.mahout.math.drm.logical.OpAB.<init>(OpAB.scala:29)
...
from
val drmBtA = drmB.t %*% drmA
Regardless of what is causing this problem there _will_ be cases where the
auto-calculated dimensions of A and B (calculated from the non-blank rows when
the DRM is read in from a text file) are not compatible but the data actually
is. This is the case where the union of all userIDs is greater than the number
of user IDs in one or both of the DRMs.
To do this correctly for all cases the row IDs for all unique row keys would
have to be created across all drms for cooccurrence. This implies using a
single Map for the row space of all drms read in with a single incrementing
integer for the DRM row key. The length of this Map would be the row dimension
for all DRMs. After the row dimension is calculated the Map could be thrown
away since only columns (input items) need to have application specific IDs
applied at output.
Does this sound like the right way to handle this case? Will the drmB.t %*%
drmA do the right thing for non-existent rows/columns, which I think means to
treat a non-existent vector as if it were all 0s. I believe this worked in the
hadoop version.