Re: Problem of dimensions

Dmitriy Lyubimov Mon, 21 Jul 2014 15:22:28 -0700

ok. so it should be easy to fix at least everything but elementwise scalar
i guess.


Since the notion of "missing rows" is only defined for int-keyed datasets,
then ew scalar technically should work for non-int keyed datasets already.

as for int-keyed datasets, i am not sure what is the best strategy.
Obviously, one can define sort of normalization/validation of int-keyed
dataset routine, but it would be fairly expensive to run "just because".
Perhaps there's a cheap test (as cheap as row count job) to run to test for
int keys consistency when matrix is first created.



On Mon, Jul 21, 2014 at 3:12 PM, Anand Avati <[email protected]> wrote:

>
>
>
> On Mon, Jul 21, 2014 at 3:08 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
>>
>>
>>
>> On Mon, Jul 21, 2014 at 3:06 PM, Anand Avati <[email protected]> wrote:
>>
>>> Dmitriy, comments inline -
>>>
>>>  On Jul 21, 2014, at 1:12 PM, Dmitriy Lyubimov <[email protected]>
>>> wrote:
>>>
>>>> And no, i suppose it is ok to have "missing" rows even in case of
>>>> int-keyed matrices.
>>>>
>>>> there's one thing that you probably should be aware in this context
>>>> though: many algorithms don't survive empty (row-less) partitions, in
>>>> whatever way they may come to be. Other than that, I don't feel every row
>>>> must be present -- even if there's implied order of the rows.
>>>>
>>>
>>> I'm not sure if that is necessarily true. There are three operators
>>> which break pretty badly with with missing rows.
>>>
>>> AewScalar - operation like A + 1 is just not applied on the missing row,
>>> so the final matrix will have 0's in place of 1s.
>>>
>>
>> Indeed. i have no recourse at this point.
>>
>>
>>>
>>> AewB, CbindAB - function after cogroup() throws exception if a row was
>>> present on only one matrix. So I guess it is OK to have missing rows as
>>> long as both A and B have the exact same missing row set. Somewhat
>>> quirky/nuanced requirement.
>>>
>>
>> Agree. i actually was not aware that's a cogroup() semantics in spark. I
>> though it would have an outer join semantics (as in Pig, i believe). Alas,
>> no recourse at this point either.
>>
>
> The exception is actually during reduceLeft after cogroup(). Cogroup()
> itself is probably an outer-join.
>
>
>

Re: Problem of dimensions

Reply via email to