On Mon, Jul 21, 2014 at 3:08 PM, Dmitriy Lyubimov <[email protected]> wrote:
> > > > On Mon, Jul 21, 2014 at 3:06 PM, Anand Avati <[email protected]> wrote: > >> Dmitriy, comments inline - >> >> On Jul 21, 2014, at 1:12 PM, Dmitriy Lyubimov <[email protected]> wrote: >> >>> And no, i suppose it is ok to have "missing" rows even in case of >>> int-keyed matrices. >>> >>> there's one thing that you probably should be aware in this context >>> though: many algorithms don't survive empty (row-less) partitions, in >>> whatever way they may come to be. Other than that, I don't feel every row >>> must be present -- even if there's implied order of the rows. >>> >> >> I'm not sure if that is necessarily true. There are three operators which >> break pretty badly with with missing rows. >> >> AewScalar - operation like A + 1 is just not applied on the missing row, >> so the final matrix will have 0's in place of 1s. >> > > Indeed. i have no recourse at this point. > > >> >> AewB, CbindAB - function after cogroup() throws exception if a row was >> present on only one matrix. So I guess it is OK to have missing rows as >> long as both A and B have the exact same missing row set. Somewhat >> quirky/nuanced requirement. >> > > Agree. i actually was not aware that's a cogroup() semantics in spark. I > though it would have an outer join semantics (as in Pig, i believe). Alas, > no recourse at this point either. > The exception is actually during reduceLeft after cogroup(). Cogroup() itself is probably an outer-join.
