Hi, On further thoughts, I concur that leaf expressions like AttributeRefs can always be considered to be deterministic, as , as a java variable the value contained in it per iteration is invariant ( except when changed by some deterministic logic). So in that sense what I said in the above mail as that an issue is incorrect. But I think that AttributeRef should have a boolean method which tells, whether the value it represents is from an indeterminate source or not. Regards Asif
On Fri, Jan 24, 2025 at 5:18 PM Asif Shahid <asif.sha...@gmail.com> wrote: > Hi, > While testing a use case where the query had an outer join such that > joining key of left outer table either had a valid value or a random value( > salting to avoid skew). > The case was reported to have incorrect results in case of node failure, > with retry. > On debugging the code, have found following, which has left me confused as > to what is spark's strategy for indeterministic fields. > Some serious issues are : > 1) All the leaf expressions, like AttributeReference are always > considered deterministic. Which means if an attribute is pointing to an > Alias which itself is indeterministic, the attribute will still be > considered deterministic > 2) In CheckAnalysis there is code which checks whether each Operator > either supports indeterministic value or not . Join is not included in the > list of supported, but it passes even if the joining key is pointing to an > indeterministic alias. ( When I tried fixing it, found a plethora of > operators failing Like DeserializedObject, LocalRelation etc which are not > supposed to contain indeterministic attributes ( because they are not in > the list of supporting operators). > 3) The ShuffleDependency does not check for indeterministic nature of > partitioner ( fixed it locally and then realized that there is the bug #1 > which needs to be fixed too). > > The code in DagScheduler / TaskSet, TaskScheduler etc, seems to have been > written , keeping in mind the indeterministic nature of the previous and > current stages , so as to rexecute previous stages as a whole, instead of > just missing tasks, but the above 3 points, do not seem to support the > code of DagScheduler / TaskScheduler. > > Regards > Asif > >