On Mon, Jul 14, 2014 at 3:02 PM, Pat Ferrel <[email protected]> wrote:

>
> On Jul 14, 2014, at 12:10 PM, Anand Avati <[email protected]> wrote:
>
> On Mon, Jul 14, 2014 at 11:56 AM, Pat Ferrel <[email protected]>
> wrote:
>
>> In the application, the number of rows will always be increased, adding
>> blank rows. I don’t think shuffle is necessary in this case because there
>> is no actual row, no data in the drm it’s just needed to make the
>> cardinality match, the IDs will take care of data matching . Maybe calling
>> it something else is a good idea to emphasize the special case for it’s
>> use. I went over this with Dmitriy and, though I haven’t checked actual
>> values on large datasets, it works.
>>
>
>
> Does that mean the cardinality is faked at the logical layer with no
> changes at the engine level? Does that means the physical operators need to
> be prepared to handle non-matching matrix multiplication by assuming the
> missing rows or columns are 0's? Does that really work with no changes?
>
>
> yes, Dmitriy recently confirmed this. But not faked, it is just not
> possible to calculate it from data in some cases since “does not exist” may
> mean “= 0”.
>
> I’m no R expert but base R seems to assume a 0 value actually exists in
> the Matrix encoded in as much space as it’s type dictates, like Dense
> things in Mahout. I think there are R packages that add support for sparse
> things (slam?) and so assume this is one place where some rethinking is
> required:
> http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/
>
> Sparse linear algebra with the added complication of foreign IDs makes for
> some odd cases. The number of extrenal/foreign IDs for rows and columns
> defines the true cardinality even though in a sparse matrix the empty row
> or column is just absent. In Mahout the IDs are row and column numbers so
> there are cases where the real-world cardinality does not match the number
> of Mahout IDs or DRM cardinality and the calculations should be fine with
> that as long as the real-world dimensions are supplied for cardinality
> checking and various values calculated from cardinality.
>
> I’m maintaining a mapping of external ID to/from Mahout ID. For instance
> in item similarity it’s use is where the rows have a key = user ID. For a
> single application the row space is defined by all user IDs. In the cross
> similarity A’B reading in A may not ID every user so if B find more or
> different ones then the union of the two is our best guess at the total.
> And in fact more matrices could be added and the user ID space is all users
> seen in all of the data. If we knew how many users are defined in the
> application we could also use that but it’s not needed of there is no data
> at all for some users.
>
> BTW this mapping seems to be one of the biggest generator of questions on
> the lists. The above issue is one that would likeli further trip up users
> generating their own ID mapping, which is why we are finally doing it for
> them.
>
>
> This sounds like a need to introduce a new R-like rbind() operator. This
> way you could fix up row cardinality like:
>
>  drmAnew = drmA rbind drmParallelizeEmpty(extra_rows, drmA.ncol)
>
>
> true, add an empty slice.
>
> You could already do this, though twisted::
>
>  drmAnew = (drmA.t cbind drmParallelizeEmpty(drmA.ncol, extra_rows).t
>
>
>
> yes and I can dig deeper and do another drmWrap constructing a new larger
> matrix.
>
> Still changing the number of rows on Sparse matrices is so much simpler
> but I think "drm.nrow = “ may hide the special nature of what we are doing.
>
>
So finally how are you changing the cardinality? I just want to make h2o
engine "works" with that technique.

Reply via email to