Re: Problem of dimensions

Pat Ferrel Mon, 14 Jul 2014 15:03:07 -0700

> 
> On Jul 14, 2014, at 12:10 PM, Anand Avati <[email protected]> wrote:
> 
> On Mon, Jul 14, 2014 at 11:56 AM, Pat Ferrel <[email protected]> wrote:
> In the application, the number of rows will always be increased, adding blank 
> rows. I don’t think shuffle is necessary in this case because there is no 
> actual row, no data in the drm it’s just needed to make the cardinality 
> match, the IDs will take care of data matching . Maybe calling it something 
> else is a good idea to emphasize the special case for it’s use. I went over 
> this with Dmitriy and, though I haven’t checked actual values on large 
> datasets, it works. 
> 
> 
> Does that mean the cardinality is faked at the logical layer with no changes 
> at the engine level? Does that means the physical operators need to be 
> prepared to handle non-matching matrix multiplication by assuming the missing 
> rows or columns are 0's? Does that really work with no changes?


yes, Dmitriy recently confirmed this. But not faked, it is just not possible to 
calculate it from data in some cases since “does not exist” may mean “= 0”.

I’m no R expert but base R seems to assume a 0 value actually exists in the 
Matrix encoded in as much space as it’s type dictates, like Dense things in 
Mahout. I think there are R packages that add support for sparse things (slam?) 
and so assume this is one place where some rethinking is required: 
http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/

Sparse linear algebra with the added complication of foreign IDs makes for some 
odd cases. The number of extrenal/foreign IDs for rows and columns defines the 
true cardinality even though in a sparse matrix the empty row or column is just 
absent. In Mahout the IDs are row and column numbers so there are cases where 
the real-world cardinality does not match the number of Mahout IDs or DRM 
cardinality and the calculations should be fine with that as long as the 
real-world dimensions are supplied for cardinality checking and various values 
calculated from cardinality.

I’m maintaining a mapping of external ID to/from Mahout ID. For instance in 
item similarity it’s use is where the rows have a key = user ID. For a single 
application the row space is defined by all user IDs. In the cross similarity 
A’B reading in A may not ID every user so if B find more or different ones then 
the union of the two is our best guess at the total. And in fact more matrices 
could be added and the user ID space is all users seen in all of the data. If 
we knew how many users are defined in the application we could also use that 
but it’s not needed of there is no data at all for some users.

BTW this mapping seems to be one of the biggest generator of questions on the 
lists. The above issue is one that would likeli further trip up users 
generating their own ID mapping, which is why we are finally doing it for them.

> 
> This sounds like a need to introduce a new R-like rbind() operator. This way 
> you could fix up row cardinality like:
> 
>  drmAnew = drmA rbind drmParallelizeEmpty(extra_rows, drmA.ncol)
> 

true, add an empty slice.

> You could already do this, though twisted::
> 
>  drmAnew = (drmA.t cbind drmParallelizeEmpty(drmA.ncol, extra_rows).t
>  
> 

yes and I can dig deeper and do another drmWrap constructing a new larger 
matrix. 

Still changing the number of rows on Sparse matrices is so much simpler but I 
think "drm.nrow = “ may hide the special nature of what we are doing.

Re: Problem of dimensions

Reply via email to