Hi Asif,

Could you provide an example (code+dataset) to analize this? Looks
interesting ...


Regards,
Ángel

El dom, 26 ene 2025 a las 20:58, Asif Shahid (<asif.sha...@gmail.com>)
escribió:

> Hi,
> On further thoughts, I concur that leaf expressions like AttributeRefs can
> always be considered to be  deterministic, as , as a java variable the
> value contained in it per iteration is invariant ( except when changed by
> some deterministic logic). So in that sense what I said in the above mail
> as that an issue is incorrect.
> But I think that AttributeRef should have a boolean method which tells,
> whether the value it represents is from an indeterminate source or not.
> Regards
> Asif
>
>
>
> On Fri, Jan 24, 2025 at 5:18 PM Asif Shahid <asif.sha...@gmail.com> wrote:
>
>> Hi,
>> While testing a use case where the query had an outer join such that
>> joining key of left outer table either had a valid value or a random value(
>> salting to avoid skew).
>> The case was reported to have incorrect results in case of node failure,
>> with retry.
>> On debugging the code, have found following, which has left me confused
>> as to what is spark's strategy for indeterministic fields.
>> Some serious issues are :
>> 1) All the leaf expressions, like AttributeReference  are always
>> considered deterministic. Which means if an attribute is pointing to an
>> Alias which itself is indeterministic,  the attribute will still be
>> considered deterministic
>> 2) In CheckAnalysis there is code which checks whether each Operator
>> either supports indeterministic value or not . Join is not included in the
>> list of supported, but it passes even if the joining key is pointing to an
>> indeterministic alias. ( When I tried fixing it, found a plethora of
>> operators failing Like DeserializedObject, LocalRelation etc which are not
>> supposed to contain indeterministic attributes ( because they are not in
>> the list of supporting operators).
>> 3) The ShuffleDependency does not check for indeterministic nature of
>> partitioner ( fixed it locally and then realized that there is the bug #1
>> which needs to be fixed too).
>>
>> The code in DagScheduler / TaskSet, TaskScheduler etc, seems to have been
>> written , keeping in mind the indeterministic nature of the previous and
>> current stages , so as to rexecute previous stages as a whole, instead of
>> just missing tasks, but the above  3 points, do not seem to support the
>> code of DagScheduler / TaskScheduler.
>>
>> Regards
>> Asif
>>
>>

Reply via email to