Hi,
On further thoughts, I concur that leaf expressions like AttributeRefs can
always be considered to be  deterministic, as , as a java variable the
value contained in it per iteration is invariant ( except when changed by
some deterministic logic). So in that sense what I said in the above mail
as that an issue is incorrect.
But I think that AttributeRef should have a boolean method which tells,
whether the value it represents is from an indeterminate source or not.
Regards
Asif



On Fri, Jan 24, 2025 at 5:18 PM Asif Shahid <asif.sha...@gmail.com> wrote:

> Hi,
> While testing a use case where the query had an outer join such that
> joining key of left outer table either had a valid value or a random value(
> salting to avoid skew).
> The case was reported to have incorrect results in case of node failure,
> with retry.
> On debugging the code, have found following, which has left me confused as
> to what is spark's strategy for indeterministic fields.
> Some serious issues are :
> 1) All the leaf expressions, like AttributeReference  are always
> considered deterministic. Which means if an attribute is pointing to an
> Alias which itself is indeterministic,  the attribute will still be
> considered deterministic
> 2) In CheckAnalysis there is code which checks whether each Operator
> either supports indeterministic value or not . Join is not included in the
> list of supported, but it passes even if the joining key is pointing to an
> indeterministic alias. ( When I tried fixing it, found a plethora of
> operators failing Like DeserializedObject, LocalRelation etc which are not
> supposed to contain indeterministic attributes ( because they are not in
> the list of supporting operators).
> 3) The ShuffleDependency does not check for indeterministic nature of
> partitioner ( fixed it locally and then realized that there is the bug #1
> which needs to be fixed too).
>
> The code in DagScheduler / TaskSet, TaskScheduler etc, seems to have been
> written , keeping in mind the indeterministic nature of the previous and
> current stages , so as to rexecute previous stages as a whole, instead of
> just missing tasks, but the above  3 points, do not seem to support the
> code of DagScheduler / TaskScheduler.
>
> Regards
> Asif
>
>

Reply via email to