On Mon, Jul 14, 2014 at 3:02 PM, Pat Ferrel <[email protected]> wrote:
> > On Jul 14, 2014, at 12:10 PM, Anand Avati <[email protected]> wrote: > > On Mon, Jul 14, 2014 at 11:56 AM, Pat Ferrel <[email protected]> > wrote: > >> In the application, the number of rows will always be increased, adding >> blank rows. I don’t think shuffle is necessary in this case because there >> is no actual row, no data in the drm it’s just needed to make the >> cardinality match, the IDs will take care of data matching . Maybe calling >> it something else is a good idea to emphasize the special case for it’s >> use. I went over this with Dmitriy and, though I haven’t checked actual >> values on large datasets, it works. >> > > > Does that mean the cardinality is faked at the logical layer with no > changes at the engine level? Does that means the physical operators need to > be prepared to handle non-matching matrix multiplication by assuming the > missing rows or columns are 0's? Does that really work with no changes? > > > yes, Dmitriy recently confirmed this. But not faked, it is just not > possible to calculate it from data in some cases since “does not exist” may > mean “= 0”. > > I’m no R expert but base R seems to assume a 0 value actually exists in > the Matrix encoded in as much space as it’s type dictates, like Dense > things in Mahout. I think there are R packages that add support for sparse > things (slam?) and so assume this is one place where some rethinking is > required: > http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/ > > Sparse linear algebra with the added complication of foreign IDs makes for > some odd cases. The number of extrenal/foreign IDs for rows and columns > defines the true cardinality even though in a sparse matrix the empty row > or column is just absent. In Mahout the IDs are row and column numbers so > there are cases where the real-world cardinality does not match the number > of Mahout IDs or DRM cardinality and the calculations should be fine with > that as long as the real-world dimensions are supplied for cardinality > checking and various values calculated from cardinality. > > I’m maintaining a mapping of external ID to/from Mahout ID. For instance > in item similarity it’s use is where the rows have a key = user ID. For a > single application the row space is defined by all user IDs. In the cross > similarity A’B reading in A may not ID every user so if B find more or > different ones then the union of the two is our best guess at the total. > And in fact more matrices could be added and the user ID space is all users > seen in all of the data. If we knew how many users are defined in the > application we could also use that but it’s not needed of there is no data > at all for some users. > > BTW this mapping seems to be one of the biggest generator of questions on > the lists. The above issue is one that would likeli further trip up users > generating their own ID mapping, which is why we are finally doing it for > them. > > > This sounds like a need to introduce a new R-like rbind() operator. This > way you could fix up row cardinality like: > > drmAnew = drmA rbind drmParallelizeEmpty(extra_rows, drmA.ncol) > > > true, add an empty slice. > > You could already do this, though twisted:: > > drmAnew = (drmA.t cbind drmParallelizeEmpty(drmA.ncol, extra_rows).t > > > > yes and I can dig deeper and do another drmWrap constructing a new larger > matrix. > > Still changing the number of rows on Sparse matrices is so much simpler > but I think "drm.nrow = “ may hide the special nature of what we are doing. > > So finally how are you changing the cardinality? I just want to make h2o engine "works" with that technique.
