Re: Is RDD thread safe?

2019-11-24 Thread Chang Chen
Sorry I did't describe clearly,  RDD id itself is thread-safe, how about
cached data?

See codes from BlockManager

def getOrElseUpdate(...)   = {
  get[T](blockId)(classTag) match {
   case ...
   case _ =>  // 1. no data is cached.
// Need to compute the block
 }
 // Initially we hold no locks on this block
 doPutIterator(...) match{..}
}

Considering  two DAGs (contain the same cached RDD ) runs simultaneously,
if both returns none  when they get same block from BlockManager(i.e. #1
above), then I guess the same data would be cached twice.

If the later cache could override the previous data, and no memory is
waste, then this is OK

Thanks
Chang


Weichen Xu  于2019年11月25日周一 上午11:52写道:

> Rdd id is immutable and when rdd object created, the rdd id is generated.
> So why there is race condition in "rdd id" ?
>
> On Mon, Nov 25, 2019 at 11:31 AM Chang Chen  wrote:
>
>> I am wonder the concurrent semantics for reason about the correctness. If
>> the two query simultaneously run the DAGs which use the same cached
>> DF\RDD,but before cache data actually happen, what will happen?
>>
>> By looking into code a litter, I suspect they have different BlockID for
>> same Dataset which is unexpected behavior, but there is no race condition.
>>
>> However RDD id is not lazy, so there is race condition.
>>
>> Thanks
>> Chang
>>
>>
>> Weichen Xu  于2019年11月12日周二 下午1:22写道:
>>
>>> Hi Chang,
>>>
>>> RDD/Dataframe is immutable and lazy computed. They are thread safe.
>>>
>>> Thanks!
>>>
>>> On Tue, Nov 12, 2019 at 12:31 PM Chang Chen 
>>> wrote:
>>>
 Hi all

 I meet a case where I need cache a source RDD, and then create
 different DataFrame from it in different threads to accelerate query.

 I know that SparkSession is thread safe(
 https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
 whether RDD  is thread safe or not

 Thanks

>>>


Re: Is RDD thread safe?

2019-11-24 Thread Weichen Xu
Rdd id is immutable and when rdd object created, the rdd id is generated.
So why there is race condition in "rdd id" ?

On Mon, Nov 25, 2019 at 11:31 AM Chang Chen  wrote:

> I am wonder the concurrent semantics for reason about the correctness. If
> the two query simultaneously run the DAGs which use the same cached
> DF\RDD,but before cache data actually happen, what will happen?
>
> By looking into code a litter, I suspect they have different BlockID for
> same Dataset which is unexpected behavior, but there is no race condition.
>
> However RDD id is not lazy, so there is race condition.
>
> Thanks
> Chang
>
>
> Weichen Xu  于2019年11月12日周二 下午1:22写道:
>
>> Hi Chang,
>>
>> RDD/Dataframe is immutable and lazy computed. They are thread safe.
>>
>> Thanks!
>>
>> On Tue, Nov 12, 2019 at 12:31 PM Chang Chen  wrote:
>>
>>> Hi all
>>>
>>> I meet a case where I need cache a source RDD, and then create different
>>> DataFrame from it in different threads to accelerate query.
>>>
>>> I know that SparkSession is thread safe(
>>> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
>>> whether RDD  is thread safe or not
>>>
>>> Thanks
>>>
>>


Re: Is RDD thread safe?

2019-11-24 Thread Chang Chen
I am wonder the concurrent semantics for reason about the correctness. If
the two query simultaneously run the DAGs which use the same cached
DF\RDD,but before cache data actually happen, what will happen?

By looking into code a litter, I suspect they have different BlockID for
same Dataset which is unexpected behavior, but there is no race condition.

However RDD id is not lazy, so there is race condition.

Thanks
Chang


Weichen Xu  于2019年11月12日周二 下午1:22写道:

> Hi Chang,
>
> RDD/Dataframe is immutable and lazy computed. They are thread safe.
>
> Thanks!
>
> On Tue, Nov 12, 2019 at 12:31 PM Chang Chen  wrote:
>
>> Hi all
>>
>> I meet a case where I need cache a source RDD, and then create different
>> DataFrame from it in different threads to accelerate query.
>>
>> I know that SparkSession is thread safe(
>> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
>> whether RDD  is thread safe or not
>>
>> Thanks
>>
>