Re: Is RDD thread safe?
Sorry I did't describe clearly, RDD id itself is thread-safe, how about cached data? See codes from BlockManager def getOrElseUpdate(...) = { get[T](blockId)(classTag) match { case ... case _ => // 1. no data is cached. // Need to compute the block } // Initially we hold no locks on this block doPutIterator(...) match{..} } Considering two DAGs (contain the same cached RDD ) runs simultaneously, if both returns none when they get same block from BlockManager(i.e. #1 above), then I guess the same data would be cached twice. If the later cache could override the previous data, and no memory is waste, then this is OK Thanks Chang Weichen Xu 于2019年11月25日周一 上午11:52写道: > Rdd id is immutable and when rdd object created, the rdd id is generated. > So why there is race condition in "rdd id" ? > > On Mon, Nov 25, 2019 at 11:31 AM Chang Chen wrote: > >> I am wonder the concurrent semantics for reason about the correctness. If >> the two query simultaneously run the DAGs which use the same cached >> DF\RDD,but before cache data actually happen, what will happen? >> >> By looking into code a litter, I suspect they have different BlockID for >> same Dataset which is unexpected behavior, but there is no race condition. >> >> However RDD id is not lazy, so there is race condition. >> >> Thanks >> Chang >> >> >> Weichen Xu 于2019年11月12日周二 下午1:22写道: >> >>> Hi Chang, >>> >>> RDD/Dataframe is immutable and lazy computed. They are thread safe. >>> >>> Thanks! >>> >>> On Tue, Nov 12, 2019 at 12:31 PM Chang Chen >>> wrote: >>> Hi all I meet a case where I need cache a source RDD, and then create different DataFrame from it in different threads to accelerate query. I know that SparkSession is thread safe( https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure whether RDD is thread safe or not Thanks >>>
Re: Is RDD thread safe?
Rdd id is immutable and when rdd object created, the rdd id is generated. So why there is race condition in "rdd id" ? On Mon, Nov 25, 2019 at 11:31 AM Chang Chen wrote: > I am wonder the concurrent semantics for reason about the correctness. If > the two query simultaneously run the DAGs which use the same cached > DF\RDD,but before cache data actually happen, what will happen? > > By looking into code a litter, I suspect they have different BlockID for > same Dataset which is unexpected behavior, but there is no race condition. > > However RDD id is not lazy, so there is race condition. > > Thanks > Chang > > > Weichen Xu 于2019年11月12日周二 下午1:22写道: > >> Hi Chang, >> >> RDD/Dataframe is immutable and lazy computed. They are thread safe. >> >> Thanks! >> >> On Tue, Nov 12, 2019 at 12:31 PM Chang Chen wrote: >> >>> Hi all >>> >>> I meet a case where I need cache a source RDD, and then create different >>> DataFrame from it in different threads to accelerate query. >>> >>> I know that SparkSession is thread safe( >>> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure >>> whether RDD is thread safe or not >>> >>> Thanks >>> >>
Re: Is RDD thread safe?
I am wonder the concurrent semantics for reason about the correctness. If the two query simultaneously run the DAGs which use the same cached DF\RDD,but before cache data actually happen, what will happen? By looking into code a litter, I suspect they have different BlockID for same Dataset which is unexpected behavior, but there is no race condition. However RDD id is not lazy, so there is race condition. Thanks Chang Weichen Xu 于2019年11月12日周二 下午1:22写道: > Hi Chang, > > RDD/Dataframe is immutable and lazy computed. They are thread safe. > > Thanks! > > On Tue, Nov 12, 2019 at 12:31 PM Chang Chen wrote: > >> Hi all >> >> I meet a case where I need cache a source RDD, and then create different >> DataFrame from it in different threads to accelerate query. >> >> I know that SparkSession is thread safe( >> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure >> whether RDD is thread safe or not >> >> Thanks >> >