Re: Spark streaming and ThreadLocal

2016-02-01 Thread N B
Is each partition guaranteed to execute in a single thread in a worker?

Thanks
N B


On Fri, Jan 29, 2016 at 6:53 PM, Shixiong(Ryan) Zhu  wrote:

> I see. Then you should use `mapPartitions` rather than using ThreadLocal.
> E.g.,
>
> dstream.mapPartitions( iter ->
> val d = new SomeClass();
> return iter.map { p =>
>somefunc(p, d.get())
> };
> }; );
>
>
> On Fri, Jan 29, 2016 at 5:29 PM, N B  wrote:
>
>> Well won't the code in lambda execute inside multiple threads in the
>> worker because it has to process many records? I would just want to have a
>> single copy of SomeClass instantiated per thread rather than once per each
>> record being processed. That was what triggered this thought anyways.
>>
>> Thanks
>> NB
>>
>>
>> On Fri, Jan 29, 2016 at 5:09 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> It looks weird. Why don't you just pass "new SomeClass()" to
>>> "somefunc"? You don't need to use ThreadLocal if there are no multiple
>>> threads in your codes.
>>>
>>> On Fri, Jan 29, 2016 at 4:39 PM, N B  wrote:
>>>
 Fixed a typo in the code to avoid any confusion Please comment on
 the code below...

 dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() {
  public SomeClass initialValue() { return new SomeClass(); }
 };
 somefunc(p, d.get());
 d.remove();
 return p;
 }; );

 On Fri, Jan 29, 2016 at 4:32 PM, N B  wrote:

> So this use of ThreadLocal will be inside the code of a function
> executing on the workers i.e. within a call from one of the lambdas. Would
> it just look like this then:
>
> dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() {
>  public SomeClass initialValue() { return new SomeClass(); }
> };
> somefunc(p, d.get());
> d.remove();
> return p;
> }; );
>
> Will this make sure that all threads inside the worker clean up the
> ThreadLocal once they are done with processing this task?
>
> Thanks
> NB
>
>
> On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Spark Streaming uses threadpools so you need to remove ThreadLocal
>> when it's not used.
>>
>> On Fri, Jan 29, 2016 at 12:55 PM, N B  wrote:
>>
>>> Thanks for the response Ryan. So I would say that it is in fact the
>>> purpose of a ThreadLocal i.e. to have a copy of the variable as long as 
>>> the
>>> thread lives. I guess my concern is around usage of threadpools and 
>>> whether
>>> Spark streaming will internally create many threads that rotate between
>>> tasks on purpose thereby holding onto ThreadLocals that may actually 
>>> never
>>> be used again.
>>>
>>> Thanks
>>>
>>> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu <
>>> shixi...@databricks.com> wrote:
>>>
 Of cause. If you use a ThreadLocal in a long living thread and
 forget to remove it, it's definitely a memory leak.

 On Thu, Jan 28, 2016 at 9:31 PM, N B  wrote:

> Hello,
>
> Does anyone know if there are any potential pitfalls associated
> with using ThreadLocal variables in a Spark streaming application? One
> things I have seen mentioned in the context of app servers that use 
> thread
> pools is that ThreadLocals can leak memory. Could this happen in Spark
> streaming also?
>
> Thanks
> Nikunj
>
>

>>>
>>
>

>>>
>>
>


Re: Spark streaming and ThreadLocal

2016-01-29 Thread N B
Fixed a typo in the code to avoid any confusion Please comment on the
code below...

dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() {
 public SomeClass initialValue() { return new SomeClass(); }
};
somefunc(p, d.get());
d.remove();
return p;
}; );

On Fri, Jan 29, 2016 at 4:32 PM, N B  wrote:

> So this use of ThreadLocal will be inside the code of a function executing
> on the workers i.e. within a call from one of the lambdas. Would it just
> look like this then:
>
> dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() {
>  public SomeClass initialValue() { return new SomeClass(); }
> };
> somefunc(p, d.get());
> d.remove();
> return p;
> }; );
>
> Will this make sure that all threads inside the worker clean up the
> ThreadLocal once they are done with processing this task?
>
> Thanks
> NB
>
>
> On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Spark Streaming uses threadpools so you need to remove ThreadLocal when
>> it's not used.
>>
>> On Fri, Jan 29, 2016 at 12:55 PM, N B  wrote:
>>
>>> Thanks for the response Ryan. So I would say that it is in fact the
>>> purpose of a ThreadLocal i.e. to have a copy of the variable as long as the
>>> thread lives. I guess my concern is around usage of threadpools and whether
>>> Spark streaming will internally create many threads that rotate between
>>> tasks on purpose thereby holding onto ThreadLocals that may actually never
>>> be used again.
>>>
>>> Thanks
>>>
>>> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu <
>>> shixi...@databricks.com> wrote:
>>>
 Of cause. If you use a ThreadLocal in a long living thread and forget
 to remove it, it's definitely a memory leak.

 On Thu, Jan 28, 2016 at 9:31 PM, N B  wrote:

> Hello,
>
> Does anyone know if there are any potential pitfalls associated with
> using ThreadLocal variables in a Spark streaming application? One things I
> have seen mentioned in the context of app servers that use thread pools is
> that ThreadLocals can leak memory. Could this happen in Spark streaming
> also?
>
> Thanks
> Nikunj
>
>

>>>
>>
>


Re: Spark streaming and ThreadLocal

2016-01-29 Thread N B
So this use of ThreadLocal will be inside the code of a function executing
on the workers i.e. within a call from one of the lambdas. Would it just
look like this then:

dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() {
 public SomeClass initialValue() { return new SomeClass(); }
};
somefunc(p, d.get());
d.remove();
return p;
}; );

Will this make sure that all threads inside the worker clean up the
ThreadLocal once they are done with processing this task?

Thanks
NB


On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu  wrote:

> Spark Streaming uses threadpools so you need to remove ThreadLocal when
> it's not used.
>
> On Fri, Jan 29, 2016 at 12:55 PM, N B  wrote:
>
>> Thanks for the response Ryan. So I would say that it is in fact the
>> purpose of a ThreadLocal i.e. to have a copy of the variable as long as the
>> thread lives. I guess my concern is around usage of threadpools and whether
>> Spark streaming will internally create many threads that rotate between
>> tasks on purpose thereby holding onto ThreadLocals that may actually never
>> be used again.
>>
>> Thanks
>>
>> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> Of cause. If you use a ThreadLocal in a long living thread and forget to
>>> remove it, it's definitely a memory leak.
>>>
>>> On Thu, Jan 28, 2016 at 9:31 PM, N B  wrote:
>>>
 Hello,

 Does anyone know if there are any potential pitfalls associated with
 using ThreadLocal variables in a Spark streaming application? One things I
 have seen mentioned in the context of app servers that use thread pools is
 that ThreadLocals can leak memory. Could this happen in Spark streaming
 also?

 Thanks
 Nikunj


>>>
>>
>


Re: Spark streaming and ThreadLocal

2016-01-29 Thread N B
Well won't the code in lambda execute inside multiple threads in the worker
because it has to process many records? I would just want to have a single
copy of SomeClass instantiated per thread rather than once per each record
being processed. That was what triggered this thought anyways.

Thanks
NB


On Fri, Jan 29, 2016 at 5:09 PM, Shixiong(Ryan) Zhu  wrote:

> It looks weird. Why don't you just pass "new SomeClass()" to "somefunc"?
> You don't need to use ThreadLocal if there are no multiple threads in your
> codes.
>
> On Fri, Jan 29, 2016 at 4:39 PM, N B  wrote:
>
>> Fixed a typo in the code to avoid any confusion Please comment on the
>> code below...
>>
>> dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() {
>>  public SomeClass initialValue() { return new SomeClass(); }
>> };
>> somefunc(p, d.get());
>> d.remove();
>> return p;
>> }; );
>>
>> On Fri, Jan 29, 2016 at 4:32 PM, N B  wrote:
>>
>>> So this use of ThreadLocal will be inside the code of a function
>>> executing on the workers i.e. within a call from one of the lambdas. Would
>>> it just look like this then:
>>>
>>> dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() {
>>>  public SomeClass initialValue() { return new SomeClass(); }
>>> };
>>> somefunc(p, d.get());
>>> d.remove();
>>> return p;
>>> }; );
>>>
>>> Will this make sure that all threads inside the worker clean up the
>>> ThreadLocal once they are done with processing this task?
>>>
>>> Thanks
>>> NB
>>>
>>>
>>> On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu <
>>> shixi...@databricks.com> wrote:
>>>
 Spark Streaming uses threadpools so you need to remove ThreadLocal when
 it's not used.

 On Fri, Jan 29, 2016 at 12:55 PM, N B  wrote:

> Thanks for the response Ryan. So I would say that it is in fact the
> purpose of a ThreadLocal i.e. to have a copy of the variable as long as 
> the
> thread lives. I guess my concern is around usage of threadpools and 
> whether
> Spark streaming will internally create many threads that rotate between
> tasks on purpose thereby holding onto ThreadLocals that may actually never
> be used again.
>
> Thanks
>
> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Of cause. If you use a ThreadLocal in a long living thread and forget
>> to remove it, it's definitely a memory leak.
>>
>> On Thu, Jan 28, 2016 at 9:31 PM, N B  wrote:
>>
>>> Hello,
>>>
>>> Does anyone know if there are any potential pitfalls associated with
>>> using ThreadLocal variables in a Spark streaming application? One 
>>> things I
>>> have seen mentioned in the context of app servers that use thread pools 
>>> is
>>> that ThreadLocals can leak memory. Could this happen in Spark streaming
>>> also?
>>>
>>> Thanks
>>> Nikunj
>>>
>>>
>>
>

>>>
>>
>


Re: Spark streaming and ThreadLocal

2016-01-29 Thread Shixiong(Ryan) Zhu
It looks weird. Why don't you just pass "new SomeClass()" to "somefunc"?
You don't need to use ThreadLocal if there are no multiple threads in your
codes.

On Fri, Jan 29, 2016 at 4:39 PM, N B  wrote:

> Fixed a typo in the code to avoid any confusion Please comment on the
> code below...
>
> dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() {
>  public SomeClass initialValue() { return new SomeClass(); }
> };
> somefunc(p, d.get());
> d.remove();
> return p;
> }; );
>
> On Fri, Jan 29, 2016 at 4:32 PM, N B  wrote:
>
>> So this use of ThreadLocal will be inside the code of a function
>> executing on the workers i.e. within a call from one of the lambdas. Would
>> it just look like this then:
>>
>> dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() {
>>  public SomeClass initialValue() { return new SomeClass(); }
>> };
>> somefunc(p, d.get());
>> d.remove();
>> return p;
>> }; );
>>
>> Will this make sure that all threads inside the worker clean up the
>> ThreadLocal once they are done with processing this task?
>>
>> Thanks
>> NB
>>
>>
>> On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> Spark Streaming uses threadpools so you need to remove ThreadLocal when
>>> it's not used.
>>>
>>> On Fri, Jan 29, 2016 at 12:55 PM, N B  wrote:
>>>
 Thanks for the response Ryan. So I would say that it is in fact the
 purpose of a ThreadLocal i.e. to have a copy of the variable as long as the
 thread lives. I guess my concern is around usage of threadpools and whether
 Spark streaming will internally create many threads that rotate between
 tasks on purpose thereby holding onto ThreadLocals that may actually never
 be used again.

 Thanks

 On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu <
 shixi...@databricks.com> wrote:

> Of cause. If you use a ThreadLocal in a long living thread and forget
> to remove it, it's definitely a memory leak.
>
> On Thu, Jan 28, 2016 at 9:31 PM, N B  wrote:
>
>> Hello,
>>
>> Does anyone know if there are any potential pitfalls associated with
>> using ThreadLocal variables in a Spark streaming application? One things 
>> I
>> have seen mentioned in the context of app servers that use thread pools 
>> is
>> that ThreadLocals can leak memory. Could this happen in Spark streaming
>> also?
>>
>> Thanks
>> Nikunj
>>
>>
>

>>>
>>
>


Re: Spark streaming and ThreadLocal

2016-01-29 Thread Shixiong(Ryan) Zhu
I see. Then you should use `mapPartitions` rather than using ThreadLocal.
E.g.,

dstream.mapPartitions( iter ->
val d = new SomeClass();
return iter.map { p =>
   somefunc(p, d.get())
};
}; );


On Fri, Jan 29, 2016 at 5:29 PM, N B  wrote:

> Well won't the code in lambda execute inside multiple threads in the
> worker because it has to process many records? I would just want to have a
> single copy of SomeClass instantiated per thread rather than once per each
> record being processed. That was what triggered this thought anyways.
>
> Thanks
> NB
>
>
> On Fri, Jan 29, 2016 at 5:09 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> It looks weird. Why don't you just pass "new SomeClass()" to "somefunc"?
>> You don't need to use ThreadLocal if there are no multiple threads in your
>> codes.
>>
>> On Fri, Jan 29, 2016 at 4:39 PM, N B  wrote:
>>
>>> Fixed a typo in the code to avoid any confusion Please comment on
>>> the code below...
>>>
>>> dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() {
>>>  public SomeClass initialValue() { return new SomeClass(); }
>>> };
>>> somefunc(p, d.get());
>>> d.remove();
>>> return p;
>>> }; );
>>>
>>> On Fri, Jan 29, 2016 at 4:32 PM, N B  wrote:
>>>
 So this use of ThreadLocal will be inside the code of a function
 executing on the workers i.e. within a call from one of the lambdas. Would
 it just look like this then:

 dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() {
  public SomeClass initialValue() { return new SomeClass(); }
 };
 somefunc(p, d.get());
 d.remove();
 return p;
 }; );

 Will this make sure that all threads inside the worker clean up the
 ThreadLocal once they are done with processing this task?

 Thanks
 NB


 On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu <
 shixi...@databricks.com> wrote:

> Spark Streaming uses threadpools so you need to remove ThreadLocal
> when it's not used.
>
> On Fri, Jan 29, 2016 at 12:55 PM, N B  wrote:
>
>> Thanks for the response Ryan. So I would say that it is in fact the
>> purpose of a ThreadLocal i.e. to have a copy of the variable as long as 
>> the
>> thread lives. I guess my concern is around usage of threadpools and 
>> whether
>> Spark streaming will internally create many threads that rotate between
>> tasks on purpose thereby holding onto ThreadLocals that may actually 
>> never
>> be used again.
>>
>> Thanks
>>
>> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> Of cause. If you use a ThreadLocal in a long living thread and
>>> forget to remove it, it's definitely a memory leak.
>>>
>>> On Thu, Jan 28, 2016 at 9:31 PM, N B  wrote:
>>>
 Hello,

 Does anyone know if there are any potential pitfalls associated
 with using ThreadLocal variables in a Spark streaming application? One
 things I have seen mentioned in the context of app servers that use 
 thread
 pools is that ThreadLocals can leak memory. Could this happen in Spark
 streaming also?

 Thanks
 Nikunj


>>>
>>
>

>>>
>>
>


Re: Spark streaming and ThreadLocal

2016-01-29 Thread Shixiong(Ryan) Zhu
Of cause. If you use a ThreadLocal in a long living thread and forget to
remove it, it's definitely a memory leak.

On Thu, Jan 28, 2016 at 9:31 PM, N B  wrote:

> Hello,
>
> Does anyone know if there are any potential pitfalls associated with using
> ThreadLocal variables in a Spark streaming application? One things I have
> seen mentioned in the context of app servers that use thread pools is that
> ThreadLocals can leak memory. Could this happen in Spark streaming also?
>
> Thanks
> Nikunj
>
>


Re: Spark streaming and ThreadLocal

2016-01-29 Thread N B
Thanks for the response Ryan. So I would say that it is in fact the purpose
of a ThreadLocal i.e. to have a copy of the variable as long as the thread
lives. I guess my concern is around usage of threadpools and whether Spark
streaming will internally create many threads that rotate between tasks on
purpose thereby holding onto ThreadLocals that may actually never be used
again.

Thanks

On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu <
shixi...@databricks.com> wrote:

> Of cause. If you use a ThreadLocal in a long living thread and forget to
> remove it, it's definitely a memory leak.
>
> On Thu, Jan 28, 2016 at 9:31 PM, N B  wrote:
>
>> Hello,
>>
>> Does anyone know if there are any potential pitfalls associated with
>> using ThreadLocal variables in a Spark streaming application? One things I
>> have seen mentioned in the context of app servers that use thread pools is
>> that ThreadLocals can leak memory. Could this happen in Spark streaming
>> also?
>>
>> Thanks
>> Nikunj
>>
>>
>


Re: Spark streaming and ThreadLocal

2016-01-29 Thread Shixiong(Ryan) Zhu
Spark Streaming uses threadpools so you need to remove ThreadLocal when
it's not used.

On Fri, Jan 29, 2016 at 12:55 PM, N B  wrote:

> Thanks for the response Ryan. So I would say that it is in fact the
> purpose of a ThreadLocal i.e. to have a copy of the variable as long as the
> thread lives. I guess my concern is around usage of threadpools and whether
> Spark streaming will internally create many threads that rotate between
> tasks on purpose thereby holding onto ThreadLocals that may actually never
> be used again.
>
> Thanks
>
> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Of cause. If you use a ThreadLocal in a long living thread and forget to
>> remove it, it's definitely a memory leak.
>>
>> On Thu, Jan 28, 2016 at 9:31 PM, N B  wrote:
>>
>>> Hello,
>>>
>>> Does anyone know if there are any potential pitfalls associated with
>>> using ThreadLocal variables in a Spark streaming application? One things I
>>> have seen mentioned in the context of app servers that use thread pools is
>>> that ThreadLocals can leak memory. Could this happen in Spark streaming
>>> also?
>>>
>>> Thanks
>>> Nikunj
>>>
>>>
>>
>


Spark streaming and ThreadLocal

2016-01-28 Thread N B
Hello,

Does anyone know if there are any potential pitfalls associated with using
ThreadLocal variables in a Spark streaming application? One things I have
seen mentioned in the context of app servers that use thread pools is that
ThreadLocals can leak memory. Could this happen in Spark streaming also?

Thanks
Nikunj