Re: Problem of dimensions

Pat Ferrel Tue, 08 Jul 2014 11:00:29 -0700

Got it, that is what I hoped was happening. I agree it’s the prep logic that is 
the only place that dimensionality can be known. So I think I know how to 
handle this case unless mr. Schelter has further thoughts.


On Jul 8, 2014, at 10:49 AM, Dmitriy Lyubimov <[email protected]> wrote:

Empty row still counts as one row. 

_all_ rows (or, depending on current backing orientation, columns) are expected 
to be of A.ncol (A.nrow) size. Algebraic algorithms don't additionally validate 
that, if it is not true, they may fail. These are algebraic guarantees imposed 
on to matrices.

Optimizer validates, kind of slack-y, these guarantees, which is what you 
encounter here. It may even miss them, but it doesn't mean that guarantees are 
not necessary. it only means that some algorithms may fail with more wacky, 
meaningless messages such as OOB. It is, generally, responsibility of feature 
prep logic to ensure algebraic guarantees. Optimizer can catch something that 
can be done inexpensively, or it can even miss it, but it can never repair it.

for further details you need to bother mr. Schelter.

-d


On Tue, Jul 8, 2014 at 10:29 AM, Pat Ferrel <[email protected]> wrote:
In cooccurrence for the case of B'A the real-world dimensionality of the 
matrices can be compatible even though the data read in from tuples would leave 
some rows or columns blank—no non-zero elements. At least this is what I 
suspect trying to run cooccurrence on the epinions data (ratings_data.txt, 
trust_data.txt)

I get:

    Exception in thread "main" java.lang.AssertionError: assertion failed: 
Incompatible operand geometry
        at scala.Predef$.assert(Predef.scala:179)
        at org.apache.mahout.math.drm.logical.OpAB.<init>(OpAB.scala:29)
        ...

from
    val drmBtA = drmB.t %*% drmA

Regardless of what is causing this problem there _will_ be cases where the 
auto-calculated dimensions of A and B (calculated from the non-blank rows when 
the DRM is read in from a text file) are not compatible but the data actually 
is. This is the case where the union of all userIDs is greater than the number 
of user IDs in one or both of the DRMs.

To do this correctly for all cases the row IDs for all unique row keys would 
have to be created across all drms for cooccurrence. This implies using a 
single Map for the row space of all drms read in with a single incrementing 
integer for the DRM row key. The length of this Map would be the row dimension 
for all DRMs. After the row dimension is calculated the Map could be thrown 
away since only columns (input items) need to have application specific IDs 
applied at output.

Does this sound like the right way to handle this case? Will the drmB.t %*% 
drmA do the right thing for non-existent rows/columns, which I think means to 
treat a non-existent vector as if it were all 0s. I believe this worked in the 
hadoop version.

Re: Problem of dimensions

Reply via email to