lazy nrow actually computes max + 1 in case of DrmLike[Int] and count in every other case. so it would seem to me it should compute both max and count and assert max == count-1 iff count >0 whenever int-based matrix comes to be. (which is drmWrap, loaders and perhaps there are some corner cases in something like big gram matrix computation). how/if we want to repair this, i am not sure. We should be able to parallelize a 0 until nrow rdd and cogroup rows with it, i guess, if fix is really needed.
On Mon, Jul 21, 2014 at 3:37 PM, Anand Avati <[email protected]> wrote: > easy to fix except elementwise scalar and the problems associated with > missing partitions (if the missing rows amounts to that extent). > > As far as consistency check, maybe a combination of max, count, sum > computed in a single reduce sweep should be reasonably sufficient? > > max == nrow -1 > count == nrow > sum == nrow * (nrow -1) / 2 /* since keys are 0 based */ > > It might be expensive to run after all int-DRM creation. > > On Mon, Jul 21, 2014 at 3:23 PM, Dmitriy Lyubimov <[email protected]> > wrote: > >> perhaps just compare row count with max(key)? that's exactly what lazy >> nrow() currently does in this case. >> >> On Mon, Jul 21, 2014 at 3:21 PM, Dmitriy Lyubimov <[email protected]> >> wrote: >> >>> >>> ok. so it should be easy to fix at least everything but elementwise >>> scalar i guess. >>> >>> Since the notion of "missing rows" is only defined for int-keyed >>> datasets, then ew scalar technically should work for non-int keyed datasets >>> already. >>> >>> as for int-keyed datasets, i am not sure what is the best strategy. >>> Obviously, one can define sort of normalization/validation of int-keyed >>> dataset routine, but it would be fairly expensive to run "just because". >>> Perhaps there's a cheap test (as cheap as row count job) to run to test for >>> int keys consistency when matrix is first created. >>> >>> >>> >>> On Mon, Jul 21, 2014 at 3:12 PM, Anand Avati <[email protected]> wrote: >>> >>>> >>>> >>>> >>>> On Mon, Jul 21, 2014 at 3:08 PM, Dmitriy Lyubimov <[email protected]> >>>> wrote: >>>> >>>>> >>>>> >>>>> >>>>> On Mon, Jul 21, 2014 at 3:06 PM, Anand Avati <[email protected]> >>>>> wrote: >>>>> >>>>>> Dmitriy, comments inline - >>>>>> >>>>>> On Jul 21, 2014, at 1:12 PM, Dmitriy Lyubimov <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> And no, i suppose it is ok to have "missing" rows even in case of >>>>>>> int-keyed matrices. >>>>>>> >>>>>>> there's one thing that you probably should be aware in this context >>>>>>> though: many algorithms don't survive empty (row-less) partitions, in >>>>>>> whatever way they may come to be. Other than that, I don't feel every >>>>>>> row >>>>>>> must be present -- even if there's implied order of the rows. >>>>>>> >>>>>> >>>>>> I'm not sure if that is necessarily true. There are three operators >>>>>> which break pretty badly with with missing rows. >>>>>> >>>>>> AewScalar - operation like A + 1 is just not applied on the missing >>>>>> row, so the final matrix will have 0's in place of 1s. >>>>>> >>>>> >>>>> Indeed. i have no recourse at this point. >>>>> >>>>> >>>>>> >>>>>> AewB, CbindAB - function after cogroup() throws exception if a row >>>>>> was present on only one matrix. So I guess it is OK to have missing rows >>>>>> as >>>>>> long as both A and B have the exact same missing row set. Somewhat >>>>>> quirky/nuanced requirement. >>>>>> >>>>> >>>>> Agree. i actually was not aware that's a cogroup() semantics in spark. >>>>> I though it would have an outer join semantics (as in Pig, i believe). >>>>> Alas, no recourse at this point either. >>>>> >>>> >>>> The exception is actually during reduceLeft after cogroup(). Cogroup() >>>> itself is probably an outer-join. >>>> >>>> >>>> >>> >> >
