On Mon, Jul 21, 2014 at 3:46 PM, Pat Ferrel <[email protected]> wrote:
> And the conversion to Matrix instantiates the new rows, why not the > conversion to Dense? I addressed this point too in my previous emails. The problem with plus(n) is not about sparse vs dense. It is in-core Matrix vs DRM. Thanks > On Jul 21, 2014, at 3:41 PM, Anand Avati <[email protected]> wrote: > > On Mon, Jul 21, 2014 at 3:35 PM, Pat Ferrel <[email protected]> wrote: > > > If you do drm.plus(1) this converts to a dense matrix, which is what the > > result must be anyway, and does add the scalar to all rows, even missing > > ones. > > > > > Pat, I mentioned this in my previous email already. drm.plus(1) completely > misses the point. It converts DRM into an in-core matrix and applies plus() > method on Matrix. The result is a Matrix, not DRM. > > drm.plus(1) is EXACTLY the same as: > > Matrix m = drm.collect() > m.plus(1) > > The implicit def drm2InCore() syntactic sugar is probably turning out to be > dangerous in this case, in terms of hinting the wrong meaning. > > Thanks > > > > > > > On Jul 21, 2014, at 3:23 PM, Dmitriy Lyubimov <[email protected]> wrote: > > > > perhaps just compare row count with max(key)? that's exactly what lazy > > nrow() currently does in this case. > > > > > > On Mon, Jul 21, 2014 at 3:21 PM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > >> > >> ok. so it should be easy to fix at least everything but elementwise > > scalar > >> i guess. > >> > >> Since the notion of "missing rows" is only defined for int-keyed > > datasets, > >> then ew scalar technically should work for non-int keyed datasets > > already. > >> > >> as for int-keyed datasets, i am not sure what is the best strategy. > >> Obviously, one can define sort of normalization/validation of int-keyed > >> dataset routine, but it would be fairly expensive to run "just because". > >> Perhaps there's a cheap test (as cheap as row count job) to run to test > > for > >> int keys consistency when matrix is first created. > >> > >> > >> > >> On Mon, Jul 21, 2014 at 3:12 PM, Anand Avati <[email protected]> wrote: > >> > >>> > >>> > >>> > >>> On Mon, Jul 21, 2014 at 3:08 PM, Dmitriy Lyubimov <[email protected]> > >>> wrote: > >>> > >>>> > >>>> > >>>> > >>>> On Mon, Jul 21, 2014 at 3:06 PM, Anand Avati <[email protected]> > > wrote: > >>>> > >>>>> Dmitriy, comments inline - > >>>>> > >>>>> On Jul 21, 2014, at 1:12 PM, Dmitriy Lyubimov <[email protected]> > >>>>> wrote: > >>>>> > >>>>>> And no, i suppose it is ok to have "missing" rows even in case of > >>>>>> int-keyed matrices. > >>>>>> > >>>>>> there's one thing that you probably should be aware in this context > >>>>>> though: many algorithms don't survive empty (row-less) partitions, > in > >>>>>> whatever way they may come to be. Other than that, I don't feel > > every row > >>>>>> must be present -- even if there's implied order of the rows. > >>>>>> > >>>>> > >>>>> I'm not sure if that is necessarily true. There are three operators > >>>>> which break pretty badly with with missing rows. > >>>>> > >>>>> AewScalar - operation like A + 1 is just not applied on the missing > >>>>> row, so the final matrix will have 0's in place of 1s. > >>>>> > >>>> > >>>> Indeed. i have no recourse at this point. > >>>> > >>>> > >>>>> > >>>>> AewB, CbindAB - function after cogroup() throws exception if a row > was > >>>>> present on only one matrix. So I guess it is OK to have missing rows > > as > >>>>> long as both A and B have the exact same missing row set. Somewhat > >>>>> quirky/nuanced requirement. > >>>>> > >>>> > >>>> Agree. i actually was not aware that's a cogroup() semantics in spark. > > I > >>>> though it would have an outer join semantics (as in Pig, i believe). > > Alas, > >>>> no recourse at this point either. > >>>> > >>> > >>> The exception is actually during reduceLeft after cogroup(). Cogroup() > >>> itself is probably an outer-join. > >>> > >>> > >>> > >> > > > > > >
