Since the rdd is not being changed why are there missing partitions? On Jul 21, 2014, at 3:37 PM, Anand Avati <[email protected]> wrote:
easy to fix except elementwise scalar and the problems associated with missing partitions (if the missing rows amounts to that extent). As far as consistency check, maybe a combination of max, count, sum computed in a single reduce sweep should be reasonably sufficient? max == nrow -1 count == nrow sum == nrow * (nrow -1) / 2 /* since keys are 0 based */ It might be expensive to run after all int-DRM creation. On Mon, Jul 21, 2014 at 3:23 PM, Dmitriy Lyubimov <[email protected]> wrote: > perhaps just compare row count with max(key)? that's exactly what lazy > nrow() currently does in this case. > > On Mon, Jul 21, 2014 at 3:21 PM, Dmitriy Lyubimov <[email protected]> > wrote: > >> >> ok. so it should be easy to fix at least everything but elementwise >> scalar i guess. >> >> Since the notion of "missing rows" is only defined for int-keyed >> datasets, then ew scalar technically should work for non-int keyed datasets >> already. >> >> as for int-keyed datasets, i am not sure what is the best strategy. >> Obviously, one can define sort of normalization/validation of int-keyed >> dataset routine, but it would be fairly expensive to run "just because". >> Perhaps there's a cheap test (as cheap as row count job) to run to test for >> int keys consistency when matrix is first created. >> >> >> >> On Mon, Jul 21, 2014 at 3:12 PM, Anand Avati <[email protected]> wrote: >> >>> >>> >>> >>> On Mon, Jul 21, 2014 at 3:08 PM, Dmitriy Lyubimov <[email protected]> >>> wrote: >>> >>>> >>>> >>>> >>>> On Mon, Jul 21, 2014 at 3:06 PM, Anand Avati <[email protected]> wrote: >>>> >>>>> Dmitriy, comments inline - >>>>> >>>>> On Jul 21, 2014, at 1:12 PM, Dmitriy Lyubimov <[email protected]> >>>>> wrote: >>>>> >>>>>> And no, i suppose it is ok to have "missing" rows even in case of >>>>>> int-keyed matrices. >>>>>> >>>>>> there's one thing that you probably should be aware in this context >>>>>> though: many algorithms don't survive empty (row-less) partitions, in >>>>>> whatever way they may come to be. Other than that, I don't feel every row >>>>>> must be present -- even if there's implied order of the rows. >>>>>> >>>>> >>>>> I'm not sure if that is necessarily true. There are three operators >>>>> which break pretty badly with with missing rows. >>>>> >>>>> AewScalar - operation like A + 1 is just not applied on the missing >>>>> row, so the final matrix will have 0's in place of 1s. >>>>> >>>> >>>> Indeed. i have no recourse at this point. >>>> >>>> >>>>> >>>>> AewB, CbindAB - function after cogroup() throws exception if a row was >>>>> present on only one matrix. So I guess it is OK to have missing rows as >>>>> long as both A and B have the exact same missing row set. Somewhat >>>>> quirky/nuanced requirement. >>>>> >>>> >>>> Agree. i actually was not aware that's a cogroup() semantics in spark. >>>> I though it would have an outer join semantics (as in Pig, i believe). >>>> Alas, no recourse at this point either. >>>> >>> >>> The exception is actually during reduceLeft after cogroup(). Cogroup() >>> itself is probably an outer-join. >>> >>> >>> >> >
