Re: Problem of dimensions

Dmitriy Lyubimov Mon, 21 Jul 2014 16:05:29 -0700

you are saying multiplication crowd is affected by this too?



On Mon, Jul 21, 2014 at 4:02 PM, Anand Avati <[email protected]> wrote:

> On Mon, Jul 21, 2014 at 3:54 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > ok so it looks like ew scalar needs to do the all-rows test in case of
> > DrmLike[Int], i.e. lazy props nrow == count  (and nrow in this case gives
> > max(key)+1). if test doesn't hold, then it needs to pre-process by
> > cogrouping with parallelizing 0 until nrow and insert stuff where is
> > missing. Some care needs to be taken with co-grouping shuffle, as usual,
> > based on expected number of added rows.
> >
> > ok this is probably an easy fix.
>
>
> Well, we could add this code (though I still don't see a practical use for
> it). But I dont see how this will help Pat's cause. His problem is fixing
> up cardinality for matrix multiplication. Should we add such cardinality
> fixup for all operators? If we have to keep fixing up cardinality for every
> operator which has the dimension check (like matrix multiplication), why
> even "support" missing rows to begin with? It only seems to be adding extra
> ifs and elses. Instead specify the requirement to have full rows for Int
> keyed DRMs and avoid code cruft to fixup things.
>
>
>
>
>
> >
> > On Mon, Jul 21, 2014 at 3:50 PM, Anand Avati <[email protected]> wrote:
> >
> > > On Mon, Jul 21, 2014 at 3:46 PM, Pat Ferrel <[email protected]>
> > wrote:
> > >
> > > > And the conversion to Matrix instantiates the new rows, why not the
> > > > conversion to Dense?
> > >
> > >
> > > I addressed this point too in my previous emails. The problem with
> > plus(n)
> > > is not about sparse vs dense. It is in-core Matrix vs DRM.
> > >
> > > Thanks
> > >
> > >
> > >
> > > > On Jul 21, 2014, at 3:41 PM, Anand Avati <[email protected]> wrote:
> > > >
> > > > On Mon, Jul 21, 2014 at 3:35 PM, Pat Ferrel <[email protected]>
> > > wrote:
> > > >
> > > > > If you do drm.plus(1) this converts to a dense matrix, which is
> what
> > > the
> > > > > result must be anyway, and does add the scalar to all rows, even
> > > missing
> > > > > ones.
> > > > >
> > > > >
> > > > Pat, I mentioned this in my previous email already. drm.plus(1)
> > > completely
> > > > misses the point. It converts DRM into an in-core matrix and applies
> > > plus()
> > > > method on Matrix. The result is a Matrix, not DRM.
> > > >
> > > > drm.plus(1) is EXACTLY the same as:
> > > >
> > > > Matrix m = drm.collect()
> > > > m.plus(1)
> > > >
> > > > The implicit def drm2InCore() syntactic sugar is probably turning out
> > to
> > > be
> > > > dangerous in this case, in terms of hinting the wrong meaning.
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > On Jul 21, 2014, at 3:23 PM, Dmitriy Lyubimov <[email protected]>
> > > wrote:
> > > > >
> > > > > perhaps just compare row count with max(key)? that's exactly what
> > lazy
> > > > > nrow() currently does in this case.
> > > > >
> > > > >
> > > > > On Mon, Jul 21, 2014 at 3:21 PM, Dmitriy Lyubimov <
> [email protected]
> > >
> > > > > wrote:
> > > > >
> > > > >>
> > > > >> ok. so it should be easy to fix at least everything but
> elementwise
> > > > > scalar
> > > > >> i guess.
> > > > >>
> > > > >> Since the notion of "missing rows" is only defined for int-keyed
> > > > > datasets,
> > > > >> then ew scalar technically should work for non-int keyed datasets
> > > > > already.
> > > > >>
> > > > >> as for int-keyed datasets, i am not sure what is the best
> strategy.
> > > > >> Obviously, one can define sort of normalization/validation of
> > > int-keyed
> > > > >> dataset routine, but it would be fairly expensive to run "just
> > > because".
> > > > >> Perhaps there's a cheap test (as cheap as row count job) to run to
> > > test
> > > > > for
> > > > >> int keys consistency when matrix is first created.
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Mon, Jul 21, 2014 at 3:12 PM, Anand Avati <[email protected]>
> > > wrote:
> > > > >>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On Mon, Jul 21, 2014 at 3:08 PM, Dmitriy Lyubimov <
> > [email protected]
> > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> On Mon, Jul 21, 2014 at 3:06 PM, Anand Avati <[email protected]
> >
> > > > > wrote:
> > > > >>>>
> > > > >>>>> Dmitriy, comments inline -
> > > > >>>>>
> > > > >>>>> On Jul 21, 2014, at 1:12 PM, Dmitriy Lyubimov <
> [email protected]
> > >
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> And no, i suppose it is ok to have "missing" rows even in case
> > of
> > > > >>>>>> int-keyed matrices.
> > > > >>>>>>
> > > > >>>>>> there's one thing that you probably should be aware in this
> > > context
> > > > >>>>>> though: many algorithms don't survive empty (row-less)
> > partitions,
> > > > in
> > > > >>>>>> whatever way they may come to be. Other than that, I don't
> feel
> > > > > every row
> > > > >>>>>> must be present -- even if there's implied order of the rows.
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> I'm not sure if that is necessarily true. There are three
> > operators
> > > > >>>>> which break pretty badly with with missing rows.
> > > > >>>>>
> > > > >>>>> AewScalar - operation like A + 1 is just not applied on the
> > missing
> > > > >>>>> row, so the final matrix will have 0's in place of 1s.
> > > > >>>>>
> > > > >>>>
> > > > >>>> Indeed. i have no recourse at this point.
> > > > >>>>
> > > > >>>>
> > > > >>>>>
> > > > >>>>> AewB, CbindAB - function after cogroup() throws exception if a
> > row
> > > > was
> > > > >>>>> present on only one matrix. So I guess it is OK to have missing
> > > rows
> > > > > as
> > > > >>>>> long as both A and B have the exact same missing row set.
> > Somewhat
> > > > >>>>> quirky/nuanced requirement.
> > > > >>>>>
> > > > >>>>
> > > > >>>> Agree. i actually was not aware that's a cogroup() semantics in
> > > spark.
> > > > > I
> > > > >>>> though it would have an outer join semantics (as in Pig, i
> > believe).
> > > > > Alas,
> > > > >>>> no recourse at this point either.
> > > > >>>>
> > > > >>>
> > > > >>> The exception is actually during reduceLeft after cogroup().
> > > Cogroup()
> > > > >>> itself is probably an outer-join.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
>

Re: Problem of dimensions

Reply via email to