Got it, that is what I hoped was happening. I agree it’s the prep logic that is the only place that dimensionality can be known. So I think I know how to handle this case unless mr. Schelter has further thoughts.
On Jul 8, 2014, at 10:49 AM, Dmitriy Lyubimov <[email protected]> wrote: Empty row still counts as one row. _all_ rows (or, depending on current backing orientation, columns) are expected to be of A.ncol (A.nrow) size. Algebraic algorithms don't additionally validate that, if it is not true, they may fail. These are algebraic guarantees imposed on to matrices. Optimizer validates, kind of slack-y, these guarantees, which is what you encounter here. It may even miss them, but it doesn't mean that guarantees are not necessary. it only means that some algorithms may fail with more wacky, meaningless messages such as OOB. It is, generally, responsibility of feature prep logic to ensure algebraic guarantees. Optimizer can catch something that can be done inexpensively, or it can even miss it, but it can never repair it. for further details you need to bother mr. Schelter. -d On Tue, Jul 8, 2014 at 10:29 AM, Pat Ferrel <[email protected]> wrote: In cooccurrence for the case of B'A the real-world dimensionality of the matrices can be compatible even though the data read in from tuples would leave some rows or columns blank—no non-zero elements. At least this is what I suspect trying to run cooccurrence on the epinions data (ratings_data.txt, trust_data.txt) I get: Exception in thread "main" java.lang.AssertionError: assertion failed: Incompatible operand geometry at scala.Predef$.assert(Predef.scala:179) at org.apache.mahout.math.drm.logical.OpAB.<init>(OpAB.scala:29) ... from val drmBtA = drmB.t %*% drmA Regardless of what is causing this problem there _will_ be cases where the auto-calculated dimensions of A and B (calculated from the non-blank rows when the DRM is read in from a text file) are not compatible but the data actually is. This is the case where the union of all userIDs is greater than the number of user IDs in one or both of the DRMs. To do this correctly for all cases the row IDs for all unique row keys would have to be created across all drms for cooccurrence. This implies using a single Map for the row space of all drms read in with a single incrementing integer for the DRM row key. The length of this Map would be the row dimension for all DRMs. After the row dimension is calculated the Map could be thrown away since only columns (input items) need to have application specific IDs applied at output. Does this sound like the right way to handle this case? Will the drmB.t %*% drmA do the right thing for non-existent rows/columns, which I think means to treat a non-existent vector as if it were all 0s. I believe this worked in the hadoop version.
