Re: Behaviour of operators like Outer Join when using indeterministic joining keys seems to be full of contradictions

Asif Shahid Thu, 13 Feb 2025 23:07:11 -0800

Hi Wenchen,
Apologies... I thought it was Santosh asking for the PR. and race
condition...
Yes the race condition causes only some partitions to be retried ..
May be a checksum can be used as fallback mechanism?
I suppose checksum would be implemented on executor sides..? Then there
might be condition where shuffle files could be lost before
driver/executors are communicated checksum ?
Regards
Asif



On Thu, Feb 13, 2025 at 7:39 PM Asif Shahid <asif.sha...@gmail.com> wrote:

> The bugrepro patch , when applied on current master, will show failure
> with incorrect results.
> While on the PR branch , it will pass.
> The number of iterations in the test is 100.
>
> Regards
> Asif
>
> On Thu, Feb 13, 2025 at 7:35 PM Asif Shahid <asif.sha...@gmail.com> wrote:
>
>> Hi,
>> Following up on this issue.
>> The PR opened is:
>> 1)https://github.com/apache/spark/pull/49708
>> The PR IMO fixes the 3 issues which I have described below.
>> The race condition fix requires use of read write lock.
>>
>> 2) Jira is:
>> https://issues.apache.org/jira/browse/SPARK-51016
>>
>> A checksum approach would result in greater number of retries, I believe:
>> There is already mechanism in Expression to identify indeterminism
>>
>> Anyways, the issues which I think are , multiple, which the opened PR
>> addresses.
>> 1)  The stage.isIndeterminate is buggy as it is not able to identify
>> inDeterminate, and the fix  spans some files like ShuffleDependency, RDD
>> etc.
>> 2) The race condition in DagScheduler:
>>    a) Say a task1 corresponding to Partition P1 of an inDeterminate Stage
>> Result fails with Fetch Failure
>>    b) The code rightly checks if no previous partitions of result stage
>> are successful yet, instead of throwing  Exception by aborting , it decides
>> to retry.
>>     c) During this window of retry, to check for dependencies and making
>> a new attempt initiated by Task1, Task2 corresponding to Partition 2 of
>> result stage becomes successful.
>>    d) The new attempt, now sees instead of all missing partitions, sees n
>> -1.  and  that creates  Data corruption. as it is no longer retrying on all
>> partitions.
>>
>> 3)  The DagSchedulerSuite's test
>>      a)    test("SPARK-25341: retry all the succeeding stages when the
>> map stage is indeterminate")
>>           has following assertion:
>>
>> assert(failedStages.collect {
>>       case stage: ShuffleMapStage if stage.shuffleDep.shuffleId ==
>> shuffleId2 => stage
>>     }.head.findMissingPartitions() == Seq(0))
>>
>> If my understanding is right and as seen from the outcome of the opened
>> PR, as the ShuffleMapSage is turning out to be inDeterminate, all tasks
>> should be retried:
>> so the assertion should be , as expected value Seq(0, 1)
>>
>> b) If  the above #a is valid, then the second test
>>       test("SPARK-32923: handle stage failure for indeterminate map stage
>> with push-based shuffle") {
>> which has assertion:
>>
>>          assert(failedStages.collect {
>>       case stage: ShuffleMapStage if stage.shuffleDep.shuffleId ==
>> shuffleId2 => stage
>>     }.head.findMissingPartitions() == Seq(0))
>>
>> should also be instead,  Seq(0, 1)
>>
>>
>> I am also attaching a patch which I was using a basis to debug and fix.
>> The patch modifies the source code by using inline code interception and
>> tweaks to reproduce the issue of data corruption using a unit test.
>>
>> And I am also attaching the BugTest which I used.
>>
>> The test is supposed to pass  if either it encounters exception from
>> spark layer,  due to exception
>> "Please eliminate the ndeterminacy by checkpointing the RDD before
>> repartition and try again"
>> or if the results of retry are as expected ( 12 rows along witrh data
>> correctness check)
>>
>> Meanwhile I will try to productize the bug test.
>>
>> Would appreciate clarification on Test issue #3. And a review of PR, if
>> done, would be appreciate much more.
>>
>> Sanotsh,
>> I believe my PR would do a full partition retry ( if the Fetch failed
>> happend prior to any other final task being successful, else it will abort
>> query throwing Exception). No wrong result. I believe.
>> Your review will be much appreciated.
>> Regards
>> Asif
>>
>>
>> On Thu, Feb 13, 2025 at 4:54 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> > Have identified a race condition , in the current DagSchdeduler code
>>> base
>>>
>>> Does it lead to the failure of full-stage retry? Can you also share your
>>> PR so that we can take a closer look?
>>>
>>> BTW, I'm working with my colleagues on this runtime checksum idea. We
>>> have something similar in our internal infra and I think we can create a
>>> working solution shortly.
>>>
>>> On Thu, Feb 13, 2025 at 12:53 PM Santosh Pingale <
>>> santosh.ping...@adyen.com> wrote:
>>>
>>>> > Maybe we should do it at runtime: if Spark retries a shuffle stage
>>>> but the data becomes different (e.g. use checksum to check it), then Spark
>>>> should retry all the partitions of this stage.
>>>>
>>>> Having bitten hard recently by this behavior in spark and after having
>>>> gone down the rabbit hole to investigate, we should definitely try to
>>>> refine and implement this approach. The non-deterministicness is dependent
>>>> on several factors, including datatypes of the columns or use of
>>>> repartition; it is not easy for the users the reason about this behavior.
>>>>
>>>> PS: The JIRA that I found and helped me confirm my understanding of the
>>>> problem, plus educate internal users and platform team:
>>>> https://issues.apache.org/jira/browse/SPARK-38388. Checksum approach
>>>> was brought up in that JIRA too and I feel that is the balanced way to look
>>>> at this problem.
>>>>
>>>> Kind regards
>>>> Santosh
>>>>
>>>> On Thu, Feb 13, 2025, 12:41 AM Asif Shahid <asif.sha...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> Following up on the issue with some information:
>>>>> 1) have opened a PR related to fix the isInDeterminate method in Stage
>>>>> , RDD etc.
>>>>> 2) After tweaking the source code , have been able to reproduce the
>>>>> data loss issue reliably, but cannot productize the test, as it requires a
>>>>> lot of inline interception and coaxing to induce test failure in a single
>>>>> VM unit test. will post a diff of the source code patch and bug test.
>>>>> 3) Have identified a race condition , in the current
>>>>> DagSchdeduler code base:
>>>>>     In the current DagScheduler code, function
>>>>> *handleTaskCompletion, * when a task fails, while resubmitting the
>>>>> failure as ResubmitFailedStage, prior to submitting, it checks if the
>>>>> shuffle is inDeterminate.   If it is , then stage is submitted ONLY IF the
>>>>> ResultStage has ZERO partitions processed. Which is right,  But it is not 
>>>>> a
>>>>> atomic operation till the point , where missingPartitions from stage is
>>>>> invoked.  As a result it is possible that one of the task from the  
>>>>> current
>>>>> attempt, marks one of the partition as successful, and while the new
>>>>> attempt is made, instead of getting missing partitions as  total
>>>>> partitions,  it fetches only some of them.
>>>>> Will be modifying the PR to fix the race.
>>>>> Regards
>>>>> Asif
>>>>>
>>>>> On Sun, Jan 26, 2025 at 11:19 PM Asif Shahid <asif.sha...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Shouldn't it be possible to determine with static data , if output
>>>>>> will be deterministic ?. Expressions already have deterministic flag. So
>>>>>> when an attribute is created from alias, it will be possible to know if
>>>>>> attribute is pointing to an inDeterminate component.
>>>>>>
>>>>>> On Sun, Jan 26, 2025, 11:09 PM Wenchen Fan <cloud0...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> It looks like a hard problem to statically analyze the query plan
>>>>>>> and decide whether a Spark stage is deterministic or not. When I added
>>>>>>> RDD DeterministicLevel, I thought it was not a hard problem for the 
>>>>>>> callers
>>>>>>> to specify it, but seems I was wrong.
>>>>>>>
>>>>>>> Maybe we should do it at runtime: if Spark retries a shuffle stage
>>>>>>> but the data becomes different (e.g. use checksum to check it), then 
>>>>>>> Spark
>>>>>>> should retry all the partitions of this stage. I'll look into this repro
>>>>>>> after I'm back from the national holiday.
>>>>>>>
>>>>>>> On Mon, Jan 27, 2025 at 2:16 PM Asif Shahid <asif.sha...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I am using below test.  as a unit test though , it will pass, as
>>>>>>>> to simulate executor lost in a single vm is difficult, but there is
>>>>>>>> definitely a bug.
>>>>>>>> Using debugger if you check, the ShuffleStage.isDeterminate is
>>>>>>>> turning out to be true, though it clearly should not be.
>>>>>>>> As result if you look at DagScheduler, TaskSchedulerImpl and
>>>>>>>> TaskSet code, it will not retry the ShuffleStage fully, instead would 
>>>>>>>> only
>>>>>>>> retry missing task.
>>>>>>>> Which means on retry if the join used a random value which puts it
>>>>>>>> in already completed shuffle task, the results will be missed.
>>>>>>>>
>>>>>>>>> package org.apache.spark.sql.vectorized
>>>>>>>>>
>>>>>>>>> import org.apache.spark.rdd.ZippedPartitionsRDD2
>>>>>>>>> import org.apache.spark.sql.{DataFrame, Encoders, QueryTest}
>>>>>>>>> import org.apache.spark.sql.catalyst.expressions.Literal
>>>>>>>>> import org.apache.spark.sql.execution.datasources.FileScanRDD
>>>>>>>>> import org.apache.spark.sql.functions._
>>>>>>>>> import org.apache.spark.sql.internal.SQLConf
>>>>>>>>> import org.apache.spark.sql.test.SharedSparkSession
>>>>>>>>> import org.apache.spark.sql.types.LongType
>>>>>>>>>
>>>>>>>>> class BugTest extends QueryTest with SharedSparkSession {
>>>>>>>>>   import testImplicits._
>>>>>>>>>  /* test("no retries") {
>>>>>>>>>     withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
>>>>>>>>>       val baseDf = spark.createDataset(
>>>>>>>>>         Seq((1L, "a"), (2L, "b"), (3L, "c"), (null, "w"), (null, 
>>>>>>>>> "x"), (null, "y"), (null, "z")))(
>>>>>>>>>         Encoders.tupleEncoder(Encoders.LONG, 
>>>>>>>>> Encoders.STRING)).toDF("pkLeft", "strleft")
>>>>>>>>>
>>>>>>>>>       val leftOuter = baseDf.select(
>>>>>>>>>         $"strleft", when(isnull($"pkLeft"), 
>>>>>>>>> monotonically_increasing_id() + Literal(100)).
>>>>>>>>>           otherwise($"pkLeft").as("pkLeft"))
>>>>>>>>>       leftOuter.show(10000)
>>>>>>>>>
>>>>>>>>>       val innerRight = spark.createDataset(
>>>>>>>>>         Seq((1L, "11"), (2L, "22"), (3L, "33")))(
>>>>>>>>>         Encoders.tupleEncoder(Encoders.LONG, 
>>>>>>>>> Encoders.STRING)).toDF("pkRight", "strright")
>>>>>>>>>
>>>>>>>>>       val innerjoin = leftOuter.join(innerRight, $"pkLeft" === 
>>>>>>>>> $"pkRight", "inner")
>>>>>>>>>
>>>>>>>>>       innerjoin.show(1000)
>>>>>>>>>
>>>>>>>>>       val outerjoin = leftOuter.join(innerRight, $"pkLeft" === 
>>>>>>>>> $"pkRight", "left_outer")
>>>>>>>>>
>>>>>>>>>       outerjoin.show(1000)
>>>>>>>>>     }
>>>>>>>>>   } */
>>>>>>>>>
>>>>>>>>>   test("with retries") {
>>>>>>>>>     withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
>>>>>>>>>       withTable("outer", "inner") {
>>>>>>>>>         createBaseTables()
>>>>>>>>>         val outerjoin: DataFrame = getOuterJoinDF
>>>>>>>>>
>>>>>>>>>         println("Initial data")
>>>>>>>>>         outerjoin.show(1000)
>>>>>>>>>         val correctRows = outerjoin.collect()
>>>>>>>>>         for( i <- 0 until 100) {
>>>>>>>>>          FileScanRDD.throwException = false
>>>>>>>>>           ZippedPartitionsRDD2.throwException = true
>>>>>>>>>           val rowsAfterRetry = getOuterJoinDF.collect()
>>>>>>>>>           import scala.jdk.CollectionConverters._
>>>>>>>>>           val temp = 
>>>>>>>>> spark.createDataFrame(rowsAfterRetry.toSeq.asJava, outerjoin.schema)
>>>>>>>>>           println("after retry data")
>>>>>>>>>           temp.show(1000)
>>>>>>>>>           assert(correctRows.length == rowsAfterRetry.length)
>>>>>>>>>           val retriedResults = rowsAfterRetry.toBuffer
>>>>>>>>>           correctRows.foreach(r => {
>>>>>>>>>             val index = retriedResults.indexWhere(x =>
>>>>>>>>>               r.getString(0) == x.getString(0) &&
>>>>>>>>>                 (r.getLong(1) == x.getLong(1) || (x.isNullAt(2) && 
>>>>>>>>> r.isNullAt(2) &&
>>>>>>>>>                   x.isNullAt(3) && r.isNullAt(3))) &&
>>>>>>>>>                 ((r.isNullAt(2) && x.isNullAt(2)) || r.getLong(2) == 
>>>>>>>>> x.getLong(2)) &&
>>>>>>>>>                 ((r.isNullAt(3) && x.isNullAt(3)) || r.getString(3) 
>>>>>>>>> == x.getString(3)))
>>>>>>>>>             assert(index >= 0)
>>>>>>>>>             retriedResults.remove(index)
>>>>>>>>>           }
>>>>>>>>>           )
>>>>>>>>>           assert(retriedResults.isEmpty)
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>>      //   Thread.sleep(10000000)
>>>>>>>>>       }
>>>>>>>>>     }
>>>>>>>>>   }
>>>>>>>>>
>>>>>>>>>   private def createBaseTables(): Unit = {
>>>>>>>>>     /*val outerDf = spark.createDataset(
>>>>>>>>>       Seq((1L, "aa"), (null, "aa"), (2L, "bb"), (null, "bb"), (3L, 
>>>>>>>>> "cc"), (null, "cc")))(
>>>>>>>>>       Encoders.tupleEncoder(Encoders.LONG, 
>>>>>>>>> Encoders.STRING)).toDF("pkLeft", "strleft")
>>>>>>>>>     outerDf.write.format("parquet").saveAsTable("outer")*/
>>>>>>>>>
>>>>>>>>>     val outerDf = spark.createDataset(
>>>>>>>>>       Seq((1L, "aa"), (null, "aa"), (2L, "aa"), (null, "bb"), (3L, 
>>>>>>>>> "bb"), (null, "bb")))(
>>>>>>>>>       Encoders.tupleEncoder(Encoders.LONG, 
>>>>>>>>> Encoders.STRING)).toDF("pkLeftt", "strleft")
>>>>>>>>>     
>>>>>>>>> outerDf.write.format("parquet").partitionBy("strleft").saveAsTable("outer")
>>>>>>>>>
>>>>>>>>>     /*val innerDf = spark.createDataset(
>>>>>>>>>       Seq((1L, "11"), (2L, "22"), (3L, "33")))(
>>>>>>>>>       Encoders.tupleEncoder(Encoders.LONG, 
>>>>>>>>> Encoders.STRING)).toDF("pkRight", "strright")*/
>>>>>>>>>
>>>>>>>>>     val innerDf = spark.createDataset(
>>>>>>>>>       Seq((1L, "11"), (2L, "11"), (3L, "33")))(
>>>>>>>>>       Encoders.tupleEncoder(Encoders.LONG, 
>>>>>>>>> Encoders.STRING)).toDF("pkRight", "strright")
>>>>>>>>>
>>>>>>>>>     
>>>>>>>>> innerDf.write.format("parquet").partitionBy("strright").saveAsTable("inner")
>>>>>>>>>
>>>>>>>>>     val innerInnerDf = spark.createDataset(
>>>>>>>>>       Seq((1L, "111"), (2L, "222"), (3L, "333")))(
>>>>>>>>>       Encoders.tupleEncoder(Encoders.LONG, 
>>>>>>>>> Encoders.STRING)).toDF("pkpkRight", "strstrright")
>>>>>>>>>
>>>>>>>>>   //  innerInnerDf.write.format("parquet").saveAsTable("innerinner")
>>>>>>>>>   }
>>>>>>>>>
>>>>>>>>>   private def getOuterJoinDF = {
>>>>>>>>>     val leftOuter = spark.table("outer").select(
>>>>>>>>>       $"strleft", when(isnull($"pkLeftt"), floor(rand() * 
>>>>>>>>> Literal(10000000L)).
>>>>>>>>>         cast(LongType)).
>>>>>>>>>         otherwise($"pkLeftt").as("pkLeft"))
>>>>>>>>>
>>>>>>>>>    val innerRight = spark.table("inner")
>>>>>>>>>  //   val innerinnerRight = spark.table("innerinner")
>>>>>>>>>
>>>>>>>>>     val outerjoin = leftOuter.hint("shuffle_hash").
>>>>>>>>>       join(innerRight, $"pkLeft" === $"pkRight", "left_outer")
>>>>>>>>>     outerjoin
>>>>>>>>>     /*
>>>>>>>>>     val outerOuterJoin = outerjoin.hint("shuffle_hash").
>>>>>>>>>       join(innerinnerRight, $"pkLeft" === $"pkpkRight", "left_outer")
>>>>>>>>>     outerOuterJoin
>>>>>>>>>
>>>>>>>>>      */
>>>>>>>>>   }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> On Sun, Jan 26, 2025 at 10:05 PM Asif Shahid <asif.sha...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Sure. I will send prototypical query tomorrow. Though its
>>>>>>>>> difficult to simulate issue using unit test , but I think the issue is
>>>>>>>>> Rdd.isIndeterminate is not returning true for the query. As a
>>>>>>>>> result, on retry, the shuffle stage is not reattempted fully.
>>>>>>>>> And rdd is not returning inDeterminate as true , is due to
>>>>>>>>> ShuffleRdd not taking into account , as the ShuffleDependency is not 
>>>>>>>>> taking
>>>>>>>>> into account inDeterminate nature of HashPartitioner.
>>>>>>>>> Moreover the attribute ref provided to HashPartitioner though is
>>>>>>>>> pointing to an inDeterminate alias, is having deterministic flag as 
>>>>>>>>> true (
>>>>>>>>> which in itself is logically correct).
>>>>>>>>> So I feel apart from ShuffleDependency code change, all
>>>>>>>>> expressions should have a lazy Boolean say containsIndeterministic
>>>>>>>>> component. Which will be true if the expression is indeterministic or
>>>>>>>>> contains any attribute ref which has is indeterministicComponent true.
>>>>>>>>>
>>>>>>>>> And on personal note.. thanks for your interest..  this is very
>>>>>>>>> rare attitude.
>>>>>>>>> Regards
>>>>>>>>> Asif
>>>>>>>>>
>>>>>>>>> On Sun, Jan 26, 2025, 9:45 PM Ángel <
>>>>>>>>> angel.alvarez.pas...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Asif,
>>>>>>>>>>
>>>>>>>>>> Could you provide an example (code+dataset) to analize this?
>>>>>>>>>> Looks interesting ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Ángel
>>>>>>>>>>
>>>>>>>>>> El dom, 26 ene 2025 a las 20:58, Asif Shahid (<
>>>>>>>>>> asif.sha...@gmail.com>) escribió:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>> On further thoughts, I concur that leaf expressions like
>>>>>>>>>>> AttributeRefs can always be considered to be  deterministic, as , 
>>>>>>>>>>> as a java
>>>>>>>>>>> variable the value contained in it per iteration is invariant ( 
>>>>>>>>>>> except when
>>>>>>>>>>> changed by some deterministic logic). So in that sense what I said 
>>>>>>>>>>> in the
>>>>>>>>>>> above mail as that an issue is incorrect.
>>>>>>>>>>> But I think that AttributeRef should have a boolean method which
>>>>>>>>>>> tells, whether the value it represents is from an indeterminate 
>>>>>>>>>>> source or
>>>>>>>>>>> not.
>>>>>>>>>>> Regards
>>>>>>>>>>> Asif
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jan 24, 2025 at 5:18 PM Asif Shahid <
>>>>>>>>>>> asif.sha...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> While testing a use case where the query had an outer join such
>>>>>>>>>>>> that joining key of left outer table either had a valid value or a 
>>>>>>>>>>>> random
>>>>>>>>>>>> value( salting to avoid skew).
>>>>>>>>>>>> The case was reported to have incorrect results in case of node
>>>>>>>>>>>> failure, with retry.
>>>>>>>>>>>> On debugging the code, have found following, which has left me
>>>>>>>>>>>> confused as to what is spark's strategy for indeterministic fields.
>>>>>>>>>>>> Some serious issues are :
>>>>>>>>>>>> 1) All the leaf expressions, like AttributeReference  are
>>>>>>>>>>>> always considered deterministic. Which means if an attribute is 
>>>>>>>>>>>> pointing to
>>>>>>>>>>>> an Alias which itself is indeterministic,  the attribute will 
>>>>>>>>>>>> still be
>>>>>>>>>>>> considered deterministic
>>>>>>>>>>>> 2) In CheckAnalysis there is code which checks whether each
>>>>>>>>>>>> Operator either supports indeterministic value or not . Join is not
>>>>>>>>>>>> included in the list of supported, but it passes even if the 
>>>>>>>>>>>> joining key is
>>>>>>>>>>>> pointing to an indeterministic alias. ( When I tried fixing it, 
>>>>>>>>>>>> found a
>>>>>>>>>>>> plethora of operators failing Like DeserializedObject, 
>>>>>>>>>>>> LocalRelation etc
>>>>>>>>>>>> which are not supposed to contain indeterministic attributes ( 
>>>>>>>>>>>> because they
>>>>>>>>>>>> are not in the list of supporting operators).
>>>>>>>>>>>> 3) The ShuffleDependency does not check for indeterministic
>>>>>>>>>>>> nature of partitioner ( fixed it locally and then realized that 
>>>>>>>>>>>> there is
>>>>>>>>>>>> the bug #1 which needs to be fixed too).
>>>>>>>>>>>>
>>>>>>>>>>>> The code in DagScheduler / TaskSet, TaskScheduler etc, seems to
>>>>>>>>>>>> have been written , keeping in mind the indeterministic nature of 
>>>>>>>>>>>> the
>>>>>>>>>>>> previous and current stages , so as to rexecute previous stages as 
>>>>>>>>>>>> a whole,
>>>>>>>>>>>> instead of just missing tasks, but the above  3 points, do not 
>>>>>>>>>>>> seem to
>>>>>>>>>>>> support the code of DagScheduler / TaskScheduler.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Asif
>>>>>>>>>>>>
>>>>>>>>>>>>

Re: Behaviour of operators like Outer Join when using indeterministic joining keys seems to be full of contradictions

Reply via email to