> > On Jul 14, 2014, at 12:10 PM, Anand Avati <[email protected]> wrote: > > On Mon, Jul 14, 2014 at 11:56 AM, Pat Ferrel <[email protected]> wrote: > In the application, the number of rows will always be increased, adding blank > rows. I don’t think shuffle is necessary in this case because there is no > actual row, no data in the drm it’s just needed to make the cardinality > match, the IDs will take care of data matching . Maybe calling it something > else is a good idea to emphasize the special case for it’s use. I went over > this with Dmitriy and, though I haven’t checked actual values on large > datasets, it works. > > > Does that mean the cardinality is faked at the logical layer with no changes > at the engine level? Does that means the physical operators need to be > prepared to handle non-matching matrix multiplication by assuming the missing > rows or columns are 0's? Does that really work with no changes?
yes, Dmitriy recently confirmed this. But not faked, it is just not possible to calculate it from data in some cases since “does not exist” may mean “= 0”. I’m no R expert but base R seems to assume a 0 value actually exists in the Matrix encoded in as much space as it’s type dictates, like Dense things in Mahout. I think there are R packages that add support for sparse things (slam?) and so assume this is one place where some rethinking is required: http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/ Sparse linear algebra with the added complication of foreign IDs makes for some odd cases. The number of extrenal/foreign IDs for rows and columns defines the true cardinality even though in a sparse matrix the empty row or column is just absent. In Mahout the IDs are row and column numbers so there are cases where the real-world cardinality does not match the number of Mahout IDs or DRM cardinality and the calculations should be fine with that as long as the real-world dimensions are supplied for cardinality checking and various values calculated from cardinality. I’m maintaining a mapping of external ID to/from Mahout ID. For instance in item similarity it’s use is where the rows have a key = user ID. For a single application the row space is defined by all user IDs. In the cross similarity A’B reading in A may not ID every user so if B find more or different ones then the union of the two is our best guess at the total. And in fact more matrices could be added and the user ID space is all users seen in all of the data. If we knew how many users are defined in the application we could also use that but it’s not needed of there is no data at all for some users. BTW this mapping seems to be one of the biggest generator of questions on the lists. The above issue is one that would likeli further trip up users generating their own ID mapping, which is why we are finally doing it for them. > > This sounds like a need to introduce a new R-like rbind() operator. This way > you could fix up row cardinality like: > > drmAnew = drmA rbind drmParallelizeEmpty(extra_rows, drmA.ncol) > true, add an empty slice. > You could already do this, though twisted:: > > drmAnew = (drmA.t cbind drmParallelizeEmpty(drmA.ncol, extra_rows).t > > yes and I can dig deeper and do another drmWrap constructing a new larger matrix. Still changing the number of rows on Sparse matrices is so much simpler but I think "drm.nrow = “ may hide the special nature of what we are doing.
