lazy nrow actually computes max + 1 in case of DrmLike[Int] and count in
every other case. so it would seem to me it should compute both max and
count and assert max == count-1 iff count >0 whenever int-based matrix
comes to be. (which is drmWrap, loaders and perhaps there are some corner
cases in something like big gram matrix computation). how/if we want to
repair this, i am not sure. We should be able to parallelize a 0 until nrow
rdd and cogroup rows with it, i guess, if fix is really needed.


On Mon, Jul 21, 2014 at 3:37 PM, Anand Avati <[email protected]> wrote:

> easy to fix except elementwise scalar and the problems associated with
>  missing partitions (if the missing rows amounts to that extent).
>
> As far as consistency check, maybe a combination of max, count, sum
> computed in a single reduce sweep should be reasonably sufficient?
>
> max == nrow -1
> count == nrow
> sum == nrow * (nrow -1) / 2 /* since keys are 0 based */
>
> It might be expensive to run after all int-DRM creation.
>
> On Mon, Jul 21, 2014 at 3:23 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
>> perhaps just compare row count with max(key)? that's exactly what lazy
>> nrow() currently does in this case.
>>
>> On Mon, Jul 21, 2014 at 3:21 PM, Dmitriy Lyubimov <[email protected]>
>> wrote:
>>
>>>
>>> ok. so it should be easy to fix at least everything but elementwise
>>> scalar i guess.
>>>
>>> Since the notion of "missing rows" is only defined for int-keyed
>>> datasets, then ew scalar technically should work for non-int keyed datasets
>>> already.
>>>
>>> as for int-keyed datasets, i am not sure what is the best strategy.
>>> Obviously, one can define sort of normalization/validation of int-keyed
>>> dataset routine, but it would be fairly expensive to run "just because".
>>> Perhaps there's a cheap test (as cheap as row count job) to run to test for
>>> int keys consistency when matrix is first created.
>>>
>>>
>>>
>>> On Mon, Jul 21, 2014 at 3:12 PM, Anand Avati <[email protected]> wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jul 21, 2014 at 3:08 PM, Dmitriy Lyubimov <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jul 21, 2014 at 3:06 PM, Anand Avati <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Dmitriy, comments inline -
>>>>>>
>>>>>>  On Jul 21, 2014, at 1:12 PM, Dmitriy Lyubimov <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> And no, i suppose it is ok to have "missing" rows even in case of
>>>>>>> int-keyed matrices.
>>>>>>>
>>>>>>> there's one thing that you probably should be aware in this context
>>>>>>> though: many algorithms don't survive empty (row-less) partitions, in
>>>>>>> whatever way they may come to be. Other than that, I don't feel every 
>>>>>>> row
>>>>>>> must be present -- even if there's implied order of the rows.
>>>>>>>
>>>>>>
>>>>>> I'm not sure if that is necessarily true. There are three operators
>>>>>> which break pretty badly with with missing rows.
>>>>>>
>>>>>> AewScalar - operation like A + 1 is just not applied on the missing
>>>>>> row, so the final matrix will have 0's in place of 1s.
>>>>>>
>>>>>
>>>>> Indeed. i have no recourse at this point.
>>>>>
>>>>>
>>>>>>
>>>>>> AewB, CbindAB - function after cogroup() throws exception if a row
>>>>>> was present on only one matrix. So I guess it is OK to have missing rows 
>>>>>> as
>>>>>> long as both A and B have the exact same missing row set. Somewhat
>>>>>> quirky/nuanced requirement.
>>>>>>
>>>>>
>>>>> Agree. i actually was not aware that's a cogroup() semantics in spark.
>>>>> I though it would have an outer join semantics (as in Pig, i believe).
>>>>> Alas, no recourse at this point either.
>>>>>
>>>>
>>>> The exception is actually during reduceLeft after cogroup(). Cogroup()
>>>> itself is probably an outer-join.
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to