Hi Asif, Could you provide an example (code+dataset) to analize this? Looks interesting ...
Regards, Ángel El dom, 26 ene 2025 a las 20:58, Asif Shahid (<asif.sha...@gmail.com>) escribió: > Hi, > On further thoughts, I concur that leaf expressions like AttributeRefs can > always be considered to be deterministic, as , as a java variable the > value contained in it per iteration is invariant ( except when changed by > some deterministic logic). So in that sense what I said in the above mail > as that an issue is incorrect. > But I think that AttributeRef should have a boolean method which tells, > whether the value it represents is from an indeterminate source or not. > Regards > Asif > > > > On Fri, Jan 24, 2025 at 5:18 PM Asif Shahid <asif.sha...@gmail.com> wrote: > >> Hi, >> While testing a use case where the query had an outer join such that >> joining key of left outer table either had a valid value or a random value( >> salting to avoid skew). >> The case was reported to have incorrect results in case of node failure, >> with retry. >> On debugging the code, have found following, which has left me confused >> as to what is spark's strategy for indeterministic fields. >> Some serious issues are : >> 1) All the leaf expressions, like AttributeReference are always >> considered deterministic. Which means if an attribute is pointing to an >> Alias which itself is indeterministic, the attribute will still be >> considered deterministic >> 2) In CheckAnalysis there is code which checks whether each Operator >> either supports indeterministic value or not . Join is not included in the >> list of supported, but it passes even if the joining key is pointing to an >> indeterministic alias. ( When I tried fixing it, found a plethora of >> operators failing Like DeserializedObject, LocalRelation etc which are not >> supposed to contain indeterministic attributes ( because they are not in >> the list of supporting operators). >> 3) The ShuffleDependency does not check for indeterministic nature of >> partitioner ( fixed it locally and then realized that there is the bug #1 >> which needs to be fixed too). >> >> The code in DagScheduler / TaskSet, TaskScheduler etc, seems to have been >> written , keeping in mind the indeterministic nature of the previous and >> current stages , so as to rexecute previous stages as a whole, instead of >> just missing tasks, but the above 3 points, do not seem to support the >> code of DagScheduler / TaskScheduler. >> >> Regards >> Asif >> >>