Re: Problem of dimensions

Anand Avati Mon, 21 Jul 2014 16:15:56 -0700

No, I don't think multiplication is affected by missing rows (though I'm
not sure). I'm saying prevention is better than cure. Pat's original
problem was A and B were having incompatible dimensions, because missing
rows had resulted in decremented nrow value. So, by solving an AewScalar
problem with missing rows, we would have added new code to a problem which
nobody has yet encountered in real world, and the original problem is still
"unfixed". Instead it might be safer to say missing rows are not supported
and avoid the new code which nobody use in the end.


Int key DRM validations are best done (optionally) at output of mapBlock()
(compare old int array with new int array) and drmHDFS if on-disk drm had
missing rows to begin with. Otherwise there is no chance for rows to go
missing (drmParallelize() guarantees, other operators guarantee).

On Mon, Jul 21, 2014 at 4:04 PM, Dmitriy Lyubimov <[email protected]> wrote:

> you are saying multiplication crowd is affected by this too?
>
>
>
> On Mon, Jul 21, 2014 at 4:02 PM, Anand Avati <[email protected]> wrote:
>
> > On Mon, Jul 21, 2014 at 3:54 PM, Dmitriy Lyubimov <[email protected]>
> > wrote:
> >
> > > ok so it looks like ew scalar needs to do the all-rows test in case of
> > > DrmLike[Int], i.e. lazy props nrow == count  (and nrow in this case
> gives
> > > max(key)+1). if test doesn't hold, then it needs to pre-process by
> > > cogrouping with parallelizing 0 until nrow and insert stuff where is
> > > missing. Some care needs to be taken with co-grouping shuffle, as
> usual,
> > > based on expected number of added rows.
> > >
> > > ok this is probably an easy fix.
> >
> >
> > Well, we could add this code (though I still don't see a practical use
> for
> > it). But I dont see how this will help Pat's cause. His problem is fixing
> > up cardinality for matrix multiplication. Should we add such cardinality
> > fixup for all operators? If we have to keep fixing up cardinality for
> every
> > operator which has the dimension check (like matrix multiplication), why
> > even "support" missing rows to begin with? It only seems to be adding
> extra
> > ifs and elses. Instead specify the requirement to have full rows for Int
> > keyed DRMs and avoid code cruft to fixup things.
> >
> >
> >
> >
> >
> > >
> > > On Mon, Jul 21, 2014 at 3:50 PM, Anand Avati <[email protected]>
> wrote:
> > >
> > > > On Mon, Jul 21, 2014 at 3:46 PM, Pat Ferrel <[email protected]>
> > > wrote:
> > > >
> > > > > And the conversion to Matrix instantiates the new rows, why not the
> > > > > conversion to Dense?
> > > >
> > > >
> > > > I addressed this point too in my previous emails. The problem with
> > > plus(n)
> > > > is not about sparse vs dense. It is in-core Matrix vs DRM.
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > > > On Jul 21, 2014, at 3:41 PM, Anand Avati <[email protected]>
> wrote:
> > > > >
> > > > > On Mon, Jul 21, 2014 at 3:35 PM, Pat Ferrel <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > If you do drm.plus(1) this converts to a dense matrix, which is
> > what
> > > > the
> > > > > > result must be anyway, and does add the scalar to all rows, even
> > > > missing
> > > > > > ones.
> > > > > >
> > > > > >
> > > > > Pat, I mentioned this in my previous email already. drm.plus(1)
> > > > completely
> > > > > misses the point. It converts DRM into an in-core matrix and
> applies
> > > > plus()
> > > > > method on Matrix. The result is a Matrix, not DRM.
> > > > >
> > > > > drm.plus(1) is EXACTLY the same as:
> > > > >
> > > > > Matrix m = drm.collect()
> > > > > m.plus(1)
> > > > >
> > > > > The implicit def drm2InCore() syntactic sugar is probably turning
> out
> > > to
> > > > be
> > > > > dangerous in this case, in terms of hinting the wrong meaning.
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > On Jul 21, 2014, at 3:23 PM, Dmitriy Lyubimov <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > perhaps just compare row count with max(key)? that's exactly what
> > > lazy
> > > > > > nrow() currently does in this case.
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 21, 2014 at 3:21 PM, Dmitriy Lyubimov <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >>
> > > > > >> ok. so it should be easy to fix at least everything but
> > elementwise
> > > > > > scalar
> > > > > >> i guess.
> > > > > >>
> > > > > >> Since the notion of "missing rows" is only defined for int-keyed
> > > > > > datasets,
> > > > > >> then ew scalar technically should work for non-int keyed
> datasets
> > > > > > already.
> > > > > >>
> > > > > >> as for int-keyed datasets, i am not sure what is the best
> > strategy.
> > > > > >> Obviously, one can define sort of normalization/validation of
> > > > int-keyed
> > > > > >> dataset routine, but it would be fairly expensive to run "just
> > > > because".
> > > > > >> Perhaps there's a cheap test (as cheap as row count job) to run
> to
> > > > test
> > > > > > for
> > > > > >> int keys consistency when matrix is first created.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Mon, Jul 21, 2014 at 3:12 PM, Anand Avati <[email protected]
> >
> > > > wrote:
> > > > > >>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On Mon, Jul 21, 2014 at 3:08 PM, Dmitriy Lyubimov <
> > > [email protected]
> > > > >
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On Mon, Jul 21, 2014 at 3:06 PM, Anand Avati <
> [email protected]
> > >
> > > > > > wrote:
> > > > > >>>>
> > > > > >>>>> Dmitriy, comments inline -
> > > > > >>>>>
> > > > > >>>>> On Jul 21, 2014, at 1:12 PM, Dmitriy Lyubimov <
> > [email protected]
> > > >
> > > > > >>>>> wrote:
> > > > > >>>>>
> > > > > >>>>>> And no, i suppose it is ok to have "missing" rows even in
> case
> > > of
> > > > > >>>>>> int-keyed matrices.
> > > > > >>>>>>
> > > > > >>>>>> there's one thing that you probably should be aware in this
> > > > context
> > > > > >>>>>> though: many algorithms don't survive empty (row-less)
> > > partitions,
> > > > > in
> > > > > >>>>>> whatever way they may come to be. Other than that, I don't
> > feel
> > > > > > every row
> > > > > >>>>>> must be present -- even if there's implied order of the
> rows.
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>> I'm not sure if that is necessarily true. There are three
> > > operators
> > > > > >>>>> which break pretty badly with with missing rows.
> > > > > >>>>>
> > > > > >>>>> AewScalar - operation like A + 1 is just not applied on the
> > > missing
> > > > > >>>>> row, so the final matrix will have 0's in place of 1s.
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>> Indeed. i have no recourse at this point.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>>
> > > > > >>>>> AewB, CbindAB - function after cogroup() throws exception if
> a
> > > row
> > > > > was
> > > > > >>>>> present on only one matrix. So I guess it is OK to have
> missing
> > > > rows
> > > > > > as
> > > > > >>>>> long as both A and B have the exact same missing row set.
> > > Somewhat
> > > > > >>>>> quirky/nuanced requirement.
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>> Agree. i actually was not aware that's a cogroup() semantics
> in
> > > > spark.
> > > > > > I
> > > > > >>>> though it would have an outer join semantics (as in Pig, i
> > > believe).
> > > > > > Alas,
> > > > > >>>> no recourse at this point either.
> > > > > >>>>
> > > > > >>>
> > > > > >>> The exception is actually during reduceLeft after cogroup().
> > > > Cogroup()
> > > > > >>> itself is probably an outer-join.
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Problem of dimensions

Reply via email to