Re: Problem of dimensions

Anand Avati Mon, 21 Jul 2014 15:51:12 -0700

On Mon, Jul 21, 2014 at 3:46 PM, Pat Ferrel <[email protected]> wrote:


> And the conversion to Matrix instantiates the new rows, why not the
> conversion to Dense?


I addressed this point too in my previous emails. The problem with plus(n)
is not about sparse vs dense. It is in-core Matrix vs DRM.

Thanks



> On Jul 21, 2014, at 3:41 PM, Anand Avati <[email protected]> wrote:
>
> On Mon, Jul 21, 2014 at 3:35 PM, Pat Ferrel <[email protected]> wrote:
>
> > If you do drm.plus(1) this converts to a dense matrix, which is what the
> > result must be anyway, and does add the scalar to all rows, even missing
> > ones.
> >
> >
> Pat, I mentioned this in my previous email already. drm.plus(1) completely
> misses the point. It converts DRM into an in-core matrix and applies plus()
> method on Matrix. The result is a Matrix, not DRM.
>
> drm.plus(1) is EXACTLY the same as:
>
> Matrix m = drm.collect()
> m.plus(1)
>
> The implicit def drm2InCore() syntactic sugar is probably turning out to be
> dangerous in this case, in terms of hinting the wrong meaning.
>
> Thanks
>
>
>
>
>
> > On Jul 21, 2014, at 3:23 PM, Dmitriy Lyubimov <[email protected]> wrote:
> >
> > perhaps just compare row count with max(key)? that's exactly what lazy
> > nrow() currently does in this case.
> >
> >
> > On Mon, Jul 21, 2014 at 3:21 PM, Dmitriy Lyubimov <[email protected]>
> > wrote:
> >
> >>
> >> ok. so it should be easy to fix at least everything but elementwise
> > scalar
> >> i guess.
> >>
> >> Since the notion of "missing rows" is only defined for int-keyed
> > datasets,
> >> then ew scalar technically should work for non-int keyed datasets
> > already.
> >>
> >> as for int-keyed datasets, i am not sure what is the best strategy.
> >> Obviously, one can define sort of normalization/validation of int-keyed
> >> dataset routine, but it would be fairly expensive to run "just because".
> >> Perhaps there's a cheap test (as cheap as row count job) to run to test
> > for
> >> int keys consistency when matrix is first created.
> >>
> >>
> >>
> >> On Mon, Jul 21, 2014 at 3:12 PM, Anand Avati <[email protected]> wrote:
> >>
> >>>
> >>>
> >>>
> >>> On Mon, Jul 21, 2014 at 3:08 PM, Dmitriy Lyubimov <[email protected]>
> >>> wrote:
> >>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Jul 21, 2014 at 3:06 PM, Anand Avati <[email protected]>
> > wrote:
> >>>>
> >>>>> Dmitriy, comments inline -
> >>>>>
> >>>>> On Jul 21, 2014, at 1:12 PM, Dmitriy Lyubimov <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> And no, i suppose it is ok to have "missing" rows even in case of
> >>>>>> int-keyed matrices.
> >>>>>>
> >>>>>> there's one thing that you probably should be aware in this context
> >>>>>> though: many algorithms don't survive empty (row-less) partitions,
> in
> >>>>>> whatever way they may come to be. Other than that, I don't feel
> > every row
> >>>>>> must be present -- even if there's implied order of the rows.
> >>>>>>
> >>>>>
> >>>>> I'm not sure if that is necessarily true. There are three operators
> >>>>> which break pretty badly with with missing rows.
> >>>>>
> >>>>> AewScalar - operation like A + 1 is just not applied on the missing
> >>>>> row, so the final matrix will have 0's in place of 1s.
> >>>>>
> >>>>
> >>>> Indeed. i have no recourse at this point.
> >>>>
> >>>>
> >>>>>
> >>>>> AewB, CbindAB - function after cogroup() throws exception if a row
> was
> >>>>> present on only one matrix. So I guess it is OK to have missing rows
> > as
> >>>>> long as both A and B have the exact same missing row set. Somewhat
> >>>>> quirky/nuanced requirement.
> >>>>>
> >>>>
> >>>> Agree. i actually was not aware that's a cogroup() semantics in spark.
> > I
> >>>> though it would have an outer join semantics (as in Pig, i believe).
> > Alas,
> >>>> no recourse at this point either.
> >>>>
> >>>
> >>> The exception is actually during reduceLeft after cogroup(). Cogroup()
> >>> itself is probably an outer-join.
> >>>
> >>>
> >>>
> >>
> >
> >
>
>

Re: Problem of dimensions

Reply via email to