Sorry I did't describe clearly, RDD id itself is thread-safe, how about cached data?
See codes from BlockManager def getOrElseUpdate(...) = { get[T](blockId)(classTag) match { case ... case _ => // 1. no data is cached. // Need to compute the block } // Initially we hold no locks on this block doPutIterator(...) match{..} } Considering two DAGs (contain the same cached RDD ) runs simultaneously, if both returns none when they get same block from BlockManager(i.e. #1 above), then I guess the same data would be cached twice. If the later cache could override the previous data, and no memory is waste, then this is OK Thanks Chang Weichen Xu <weichen...@databricks.com> 于2019年11月25日周一 上午11:52写道: > Rdd id is immutable and when rdd object created, the rdd id is generated. > So why there is race condition in "rdd id" ? > > On Mon, Nov 25, 2019 at 11:31 AM Chang Chen <baibaic...@gmail.com> wrote: > >> I am wonder the concurrent semantics for reason about the correctness. If >> the two query simultaneously run the DAGs which use the same cached >> DF\RDD,but before cache data actually happen, what will happen? >> >> By looking into code a litter, I suspect they have different BlockID for >> same Dataset which is unexpected behavior, but there is no race condition. >> >> However RDD id is not lazy, so there is race condition. >> >> Thanks >> Chang >> >> >> Weichen Xu <weichen...@databricks.com> 于2019年11月12日周二 下午1:22写道: >> >>> Hi Chang, >>> >>> RDD/Dataframe is immutable and lazy computed. They are thread safe. >>> >>> Thanks! >>> >>> On Tue, Nov 12, 2019 at 12:31 PM Chang Chen <baibaic...@gmail.com> >>> wrote: >>> >>>> Hi all >>>> >>>> I meet a case where I need cache a source RDD, and then create >>>> different DataFrame from it in different threads to accelerate query. >>>> >>>> I know that SparkSession is thread safe( >>>> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure >>>> whether RDD is thread safe or not >>>> >>>> Thanks >>>> >>>