Hi,
While testing a use case where the query had an outer join such that
joining key of left outer table either had a valid value or a random value(
salting to avoid skew).
The case was reported to have incorrect results in case of node failure,
with retry.
On debugging the code, have found following, which has left me confused as
to what is spark's strategy for indeterministic fields.
Some serious issues are :
1) All the leaf expressions, like AttributeReference  are always considered
deterministic. Which means if an attribute is pointing to an Alias which
itself is indeterministic,  the attribute will still be considered
deterministic
2) In CheckAnalysis there is code which checks whether each Operator either
supports indeterministic value or not . Join is not included in the list of
supported, but it passes even if the joining key is pointing to an
indeterministic alias. ( When I tried fixing it, found a plethora of
operators failing Like DeserializedObject, LocalRelation etc which are not
supposed to contain indeterministic attributes ( because they are not in
the list of supporting operators).
3) The ShuffleDependency does not check for indeterministic nature of
partitioner ( fixed it locally and then realized that there is the bug #1
which needs to be fixed too).

The code in DagScheduler / TaskSet, TaskScheduler etc, seems to have been
written , keeping in mind the indeterministic nature of the previous and
current stages , so as to rexecute previous stages as a whole, instead of
just missing tasks, but the above  3 points, do not seem to support the
code of DagScheduler / TaskScheduler.

Regards
Asif

Reply via email to