On Mon, Jul 21, 2014 at 3:54 PM, Dmitriy Lyubimov <[email protected]> wrote:
> ok so it looks like ew scalar needs to do the all-rows test in case of > DrmLike[Int], i.e. lazy props nrow == count (and nrow in this case gives > max(key)+1). if test doesn't hold, then it needs to pre-process by > cogrouping with parallelizing 0 until nrow and insert stuff where is > missing. Some care needs to be taken with co-grouping shuffle, as usual, > based on expected number of added rows. > > ok this is probably an easy fix. Well, we could add this code (though I still don't see a practical use for it). But I dont see how this will help Pat's cause. His problem is fixing up cardinality for matrix multiplication. Should we add such cardinality fixup for all operators? If we have to keep fixing up cardinality for every operator which has the dimension check (like matrix multiplication), why even "support" missing rows to begin with? It only seems to be adding extra ifs and elses. Instead specify the requirement to have full rows for Int keyed DRMs and avoid code cruft to fixup things. > > On Mon, Jul 21, 2014 at 3:50 PM, Anand Avati <[email protected]> wrote: > > > On Mon, Jul 21, 2014 at 3:46 PM, Pat Ferrel <[email protected]> > wrote: > > > > > And the conversion to Matrix instantiates the new rows, why not the > > > conversion to Dense? > > > > > > I addressed this point too in my previous emails. The problem with > plus(n) > > is not about sparse vs dense. It is in-core Matrix vs DRM. > > > > Thanks > > > > > > > > > On Jul 21, 2014, at 3:41 PM, Anand Avati <[email protected]> wrote: > > > > > > On Mon, Jul 21, 2014 at 3:35 PM, Pat Ferrel <[email protected]> > > wrote: > > > > > > > If you do drm.plus(1) this converts to a dense matrix, which is what > > the > > > > result must be anyway, and does add the scalar to all rows, even > > missing > > > > ones. > > > > > > > > > > > Pat, I mentioned this in my previous email already. drm.plus(1) > > completely > > > misses the point. It converts DRM into an in-core matrix and applies > > plus() > > > method on Matrix. The result is a Matrix, not DRM. > > > > > > drm.plus(1) is EXACTLY the same as: > > > > > > Matrix m = drm.collect() > > > m.plus(1) > > > > > > The implicit def drm2InCore() syntactic sugar is probably turning out > to > > be > > > dangerous in this case, in terms of hinting the wrong meaning. > > > > > > Thanks > > > > > > > > > > > > > > > > > > > On Jul 21, 2014, at 3:23 PM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > > > > > > perhaps just compare row count with max(key)? that's exactly what > lazy > > > > nrow() currently does in this case. > > > > > > > > > > > > On Mon, Jul 21, 2014 at 3:21 PM, Dmitriy Lyubimov <[email protected] > > > > > > wrote: > > > > > > > >> > > > >> ok. so it should be easy to fix at least everything but elementwise > > > > scalar > > > >> i guess. > > > >> > > > >> Since the notion of "missing rows" is only defined for int-keyed > > > > datasets, > > > >> then ew scalar technically should work for non-int keyed datasets > > > > already. > > > >> > > > >> as for int-keyed datasets, i am not sure what is the best strategy. > > > >> Obviously, one can define sort of normalization/validation of > > int-keyed > > > >> dataset routine, but it would be fairly expensive to run "just > > because". > > > >> Perhaps there's a cheap test (as cheap as row count job) to run to > > test > > > > for > > > >> int keys consistency when matrix is first created. > > > >> > > > >> > > > >> > > > >> On Mon, Jul 21, 2014 at 3:12 PM, Anand Avati <[email protected]> > > wrote: > > > >> > > > >>> > > > >>> > > > >>> > > > >>> On Mon, Jul 21, 2014 at 3:08 PM, Dmitriy Lyubimov < > [email protected] > > > > > > >>> wrote: > > > >>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> On Mon, Jul 21, 2014 at 3:06 PM, Anand Avati <[email protected]> > > > > wrote: > > > >>>> > > > >>>>> Dmitriy, comments inline - > > > >>>>> > > > >>>>> On Jul 21, 2014, at 1:12 PM, Dmitriy Lyubimov <[email protected] > > > > > >>>>> wrote: > > > >>>>> > > > >>>>>> And no, i suppose it is ok to have "missing" rows even in case > of > > > >>>>>> int-keyed matrices. > > > >>>>>> > > > >>>>>> there's one thing that you probably should be aware in this > > context > > > >>>>>> though: many algorithms don't survive empty (row-less) > partitions, > > > in > > > >>>>>> whatever way they may come to be. Other than that, I don't feel > > > > every row > > > >>>>>> must be present -- even if there's implied order of the rows. > > > >>>>>> > > > >>>>> > > > >>>>> I'm not sure if that is necessarily true. There are three > operators > > > >>>>> which break pretty badly with with missing rows. > > > >>>>> > > > >>>>> AewScalar - operation like A + 1 is just not applied on the > missing > > > >>>>> row, so the final matrix will have 0's in place of 1s. > > > >>>>> > > > >>>> > > > >>>> Indeed. i have no recourse at this point. > > > >>>> > > > >>>> > > > >>>>> > > > >>>>> AewB, CbindAB - function after cogroup() throws exception if a > row > > > was > > > >>>>> present on only one matrix. So I guess it is OK to have missing > > rows > > > > as > > > >>>>> long as both A and B have the exact same missing row set. > Somewhat > > > >>>>> quirky/nuanced requirement. > > > >>>>> > > > >>>> > > > >>>> Agree. i actually was not aware that's a cogroup() semantics in > > spark. > > > > I > > > >>>> though it would have an outer join semantics (as in Pig, i > believe). > > > > Alas, > > > >>>> no recourse at this point either. > > > >>>> > > > >>> > > > >>> The exception is actually during reduceLeft after cogroup(). > > Cogroup() > > > >>> itself is probably an outer-join. > > > >>> > > > >>> > > > >>> > > > >> > > > > > > > > > > > > > > > > >
