Re: Spark streaming and ThreadLocal
Is each partition guaranteed to execute in a single thread in a worker? Thanks N B On Fri, Jan 29, 2016 at 6:53 PM, Shixiong(Ryan) Zhuwrote: > I see. Then you should use `mapPartitions` rather than using ThreadLocal. > E.g., > > dstream.mapPartitions( iter -> > val d = new SomeClass(); > return iter.map { p => >somefunc(p, d.get()) > }; > }; ); > > > On Fri, Jan 29, 2016 at 5:29 PM, N B wrote: > >> Well won't the code in lambda execute inside multiple threads in the >> worker because it has to process many records? I would just want to have a >> single copy of SomeClass instantiated per thread rather than once per each >> record being processed. That was what triggered this thought anyways. >> >> Thanks >> NB >> >> >> On Fri, Jan 29, 2016 at 5:09 PM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com> wrote: >> >>> It looks weird. Why don't you just pass "new SomeClass()" to >>> "somefunc"? You don't need to use ThreadLocal if there are no multiple >>> threads in your codes. >>> >>> On Fri, Jan 29, 2016 at 4:39 PM, N B wrote: >>> Fixed a typo in the code to avoid any confusion Please comment on the code below... dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { public SomeClass initialValue() { return new SomeClass(); } }; somefunc(p, d.get()); d.remove(); return p; }; ); On Fri, Jan 29, 2016 at 4:32 PM, N B wrote: > So this use of ThreadLocal will be inside the code of a function > executing on the workers i.e. within a call from one of the lambdas. Would > it just look like this then: > > dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { > public SomeClass initialValue() { return new SomeClass(); } > }; > somefunc(p, d.get()); > d.remove(); > return p; > }; ); > > Will this make sure that all threads inside the worker clean up the > ThreadLocal once they are done with processing this task? > > Thanks > NB > > > On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Spark Streaming uses threadpools so you need to remove ThreadLocal >> when it's not used. >> >> On Fri, Jan 29, 2016 at 12:55 PM, N B wrote: >> >>> Thanks for the response Ryan. So I would say that it is in fact the >>> purpose of a ThreadLocal i.e. to have a copy of the variable as long as >>> the >>> thread lives. I guess my concern is around usage of threadpools and >>> whether >>> Spark streaming will internally create many threads that rotate between >>> tasks on purpose thereby holding onto ThreadLocals that may actually >>> never >>> be used again. >>> >>> Thanks >>> >>> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu < >>> shixi...@databricks.com> wrote: >>> Of cause. If you use a ThreadLocal in a long living thread and forget to remove it, it's definitely a memory leak. On Thu, Jan 28, 2016 at 9:31 PM, N B wrote: > Hello, > > Does anyone know if there are any potential pitfalls associated > with using ThreadLocal variables in a Spark streaming application? One > things I have seen mentioned in the context of app servers that use > thread > pools is that ThreadLocals can leak memory. Could this happen in Spark > streaming also? > > Thanks > Nikunj > > >>> >> > >>> >> >
Re: Spark streaming and ThreadLocal
Fixed a typo in the code to avoid any confusion Please comment on the code below... dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { public SomeClass initialValue() { return new SomeClass(); } }; somefunc(p, d.get()); d.remove(); return p; }; ); On Fri, Jan 29, 2016 at 4:32 PM, N Bwrote: > So this use of ThreadLocal will be inside the code of a function executing > on the workers i.e. within a call from one of the lambdas. Would it just > look like this then: > > dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { > public SomeClass initialValue() { return new SomeClass(); } > }; > somefunc(p, d.get()); > d.remove(); > return p; > }; ); > > Will this make sure that all threads inside the worker clean up the > ThreadLocal once they are done with processing this task? > > Thanks > NB > > > On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Spark Streaming uses threadpools so you need to remove ThreadLocal when >> it's not used. >> >> On Fri, Jan 29, 2016 at 12:55 PM, N B wrote: >> >>> Thanks for the response Ryan. So I would say that it is in fact the >>> purpose of a ThreadLocal i.e. to have a copy of the variable as long as the >>> thread lives. I guess my concern is around usage of threadpools and whether >>> Spark streaming will internally create many threads that rotate between >>> tasks on purpose thereby holding onto ThreadLocals that may actually never >>> be used again. >>> >>> Thanks >>> >>> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu < >>> shixi...@databricks.com> wrote: >>> Of cause. If you use a ThreadLocal in a long living thread and forget to remove it, it's definitely a memory leak. On Thu, Jan 28, 2016 at 9:31 PM, N B wrote: > Hello, > > Does anyone know if there are any potential pitfalls associated with > using ThreadLocal variables in a Spark streaming application? One things I > have seen mentioned in the context of app servers that use thread pools is > that ThreadLocals can leak memory. Could this happen in Spark streaming > also? > > Thanks > Nikunj > > >>> >> >
Re: Spark streaming and ThreadLocal
So this use of ThreadLocal will be inside the code of a function executing on the workers i.e. within a call from one of the lambdas. Would it just look like this then: dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { public SomeClass initialValue() { return new SomeClass(); } }; somefunc(p, d.get()); d.remove(); return p; }; ); Will this make sure that all threads inside the worker clean up the ThreadLocal once they are done with processing this task? Thanks NB On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhuwrote: > Spark Streaming uses threadpools so you need to remove ThreadLocal when > it's not used. > > On Fri, Jan 29, 2016 at 12:55 PM, N B wrote: > >> Thanks for the response Ryan. So I would say that it is in fact the >> purpose of a ThreadLocal i.e. to have a copy of the variable as long as the >> thread lives. I guess my concern is around usage of threadpools and whether >> Spark streaming will internally create many threads that rotate between >> tasks on purpose thereby holding onto ThreadLocals that may actually never >> be used again. >> >> Thanks >> >> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com> wrote: >> >>> Of cause. If you use a ThreadLocal in a long living thread and forget to >>> remove it, it's definitely a memory leak. >>> >>> On Thu, Jan 28, 2016 at 9:31 PM, N B wrote: >>> Hello, Does anyone know if there are any potential pitfalls associated with using ThreadLocal variables in a Spark streaming application? One things I have seen mentioned in the context of app servers that use thread pools is that ThreadLocals can leak memory. Could this happen in Spark streaming also? Thanks Nikunj >>> >> >
Re: Spark streaming and ThreadLocal
Well won't the code in lambda execute inside multiple threads in the worker because it has to process many records? I would just want to have a single copy of SomeClass instantiated per thread rather than once per each record being processed. That was what triggered this thought anyways. Thanks NB On Fri, Jan 29, 2016 at 5:09 PM, Shixiong(Ryan) Zhuwrote: > It looks weird. Why don't you just pass "new SomeClass()" to "somefunc"? > You don't need to use ThreadLocal if there are no multiple threads in your > codes. > > On Fri, Jan 29, 2016 at 4:39 PM, N B wrote: > >> Fixed a typo in the code to avoid any confusion Please comment on the >> code below... >> >> dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { >> public SomeClass initialValue() { return new SomeClass(); } >> }; >> somefunc(p, d.get()); >> d.remove(); >> return p; >> }; ); >> >> On Fri, Jan 29, 2016 at 4:32 PM, N B wrote: >> >>> So this use of ThreadLocal will be inside the code of a function >>> executing on the workers i.e. within a call from one of the lambdas. Would >>> it just look like this then: >>> >>> dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { >>> public SomeClass initialValue() { return new SomeClass(); } >>> }; >>> somefunc(p, d.get()); >>> d.remove(); >>> return p; >>> }; ); >>> >>> Will this make sure that all threads inside the worker clean up the >>> ThreadLocal once they are done with processing this task? >>> >>> Thanks >>> NB >>> >>> >>> On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu < >>> shixi...@databricks.com> wrote: >>> Spark Streaming uses threadpools so you need to remove ThreadLocal when it's not used. On Fri, Jan 29, 2016 at 12:55 PM, N B wrote: > Thanks for the response Ryan. So I would say that it is in fact the > purpose of a ThreadLocal i.e. to have a copy of the variable as long as > the > thread lives. I guess my concern is around usage of threadpools and > whether > Spark streaming will internally create many threads that rotate between > tasks on purpose thereby holding onto ThreadLocals that may actually never > be used again. > > Thanks > > On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Of cause. If you use a ThreadLocal in a long living thread and forget >> to remove it, it's definitely a memory leak. >> >> On Thu, Jan 28, 2016 at 9:31 PM, N B wrote: >> >>> Hello, >>> >>> Does anyone know if there are any potential pitfalls associated with >>> using ThreadLocal variables in a Spark streaming application? One >>> things I >>> have seen mentioned in the context of app servers that use thread pools >>> is >>> that ThreadLocals can leak memory. Could this happen in Spark streaming >>> also? >>> >>> Thanks >>> Nikunj >>> >>> >> > >>> >> >
Re: Spark streaming and ThreadLocal
It looks weird. Why don't you just pass "new SomeClass()" to "somefunc"? You don't need to use ThreadLocal if there are no multiple threads in your codes. On Fri, Jan 29, 2016 at 4:39 PM, N Bwrote: > Fixed a typo in the code to avoid any confusion Please comment on the > code below... > > dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { > public SomeClass initialValue() { return new SomeClass(); } > }; > somefunc(p, d.get()); > d.remove(); > return p; > }; ); > > On Fri, Jan 29, 2016 at 4:32 PM, N B wrote: > >> So this use of ThreadLocal will be inside the code of a function >> executing on the workers i.e. within a call from one of the lambdas. Would >> it just look like this then: >> >> dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { >> public SomeClass initialValue() { return new SomeClass(); } >> }; >> somefunc(p, d.get()); >> d.remove(); >> return p; >> }; ); >> >> Will this make sure that all threads inside the worker clean up the >> ThreadLocal once they are done with processing this task? >> >> Thanks >> NB >> >> >> On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com> wrote: >> >>> Spark Streaming uses threadpools so you need to remove ThreadLocal when >>> it's not used. >>> >>> On Fri, Jan 29, 2016 at 12:55 PM, N B wrote: >>> Thanks for the response Ryan. So I would say that it is in fact the purpose of a ThreadLocal i.e. to have a copy of the variable as long as the thread lives. I guess my concern is around usage of threadpools and whether Spark streaming will internally create many threads that rotate between tasks on purpose thereby holding onto ThreadLocals that may actually never be used again. Thanks On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu < shixi...@databricks.com> wrote: > Of cause. If you use a ThreadLocal in a long living thread and forget > to remove it, it's definitely a memory leak. > > On Thu, Jan 28, 2016 at 9:31 PM, N B wrote: > >> Hello, >> >> Does anyone know if there are any potential pitfalls associated with >> using ThreadLocal variables in a Spark streaming application? One things >> I >> have seen mentioned in the context of app servers that use thread pools >> is >> that ThreadLocals can leak memory. Could this happen in Spark streaming >> also? >> >> Thanks >> Nikunj >> >> > >>> >> >
Re: Spark streaming and ThreadLocal
I see. Then you should use `mapPartitions` rather than using ThreadLocal. E.g., dstream.mapPartitions( iter -> val d = new SomeClass(); return iter.map { p => somefunc(p, d.get()) }; }; ); On Fri, Jan 29, 2016 at 5:29 PM, N Bwrote: > Well won't the code in lambda execute inside multiple threads in the > worker because it has to process many records? I would just want to have a > single copy of SomeClass instantiated per thread rather than once per each > record being processed. That was what triggered this thought anyways. > > Thanks > NB > > > On Fri, Jan 29, 2016 at 5:09 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> It looks weird. Why don't you just pass "new SomeClass()" to "somefunc"? >> You don't need to use ThreadLocal if there are no multiple threads in your >> codes. >> >> On Fri, Jan 29, 2016 at 4:39 PM, N B wrote: >> >>> Fixed a typo in the code to avoid any confusion Please comment on >>> the code below... >>> >>> dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { >>> public SomeClass initialValue() { return new SomeClass(); } >>> }; >>> somefunc(p, d.get()); >>> d.remove(); >>> return p; >>> }; ); >>> >>> On Fri, Jan 29, 2016 at 4:32 PM, N B wrote: >>> So this use of ThreadLocal will be inside the code of a function executing on the workers i.e. within a call from one of the lambdas. Would it just look like this then: dstream.map( p -> { ThreadLocal d = new ThreadLocal<>() { public SomeClass initialValue() { return new SomeClass(); } }; somefunc(p, d.get()); d.remove(); return p; }; ); Will this make sure that all threads inside the worker clean up the ThreadLocal once they are done with processing this task? Thanks NB On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu < shixi...@databricks.com> wrote: > Spark Streaming uses threadpools so you need to remove ThreadLocal > when it's not used. > > On Fri, Jan 29, 2016 at 12:55 PM, N B wrote: > >> Thanks for the response Ryan. So I would say that it is in fact the >> purpose of a ThreadLocal i.e. to have a copy of the variable as long as >> the >> thread lives. I guess my concern is around usage of threadpools and >> whether >> Spark streaming will internally create many threads that rotate between >> tasks on purpose thereby holding onto ThreadLocals that may actually >> never >> be used again. >> >> Thanks >> >> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com> wrote: >> >>> Of cause. If you use a ThreadLocal in a long living thread and >>> forget to remove it, it's definitely a memory leak. >>> >>> On Thu, Jan 28, 2016 at 9:31 PM, N B wrote: >>> Hello, Does anyone know if there are any potential pitfalls associated with using ThreadLocal variables in a Spark streaming application? One things I have seen mentioned in the context of app servers that use thread pools is that ThreadLocals can leak memory. Could this happen in Spark streaming also? Thanks Nikunj >>> >> > >>> >> >
Re: Spark streaming and ThreadLocal
Of cause. If you use a ThreadLocal in a long living thread and forget to remove it, it's definitely a memory leak. On Thu, Jan 28, 2016 at 9:31 PM, N Bwrote: > Hello, > > Does anyone know if there are any potential pitfalls associated with using > ThreadLocal variables in a Spark streaming application? One things I have > seen mentioned in the context of app servers that use thread pools is that > ThreadLocals can leak memory. Could this happen in Spark streaming also? > > Thanks > Nikunj > >
Re: Spark streaming and ThreadLocal
Thanks for the response Ryan. So I would say that it is in fact the purpose of a ThreadLocal i.e. to have a copy of the variable as long as the thread lives. I guess my concern is around usage of threadpools and whether Spark streaming will internally create many threads that rotate between tasks on purpose thereby holding onto ThreadLocals that may actually never be used again. Thanks On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu < shixi...@databricks.com> wrote: > Of cause. If you use a ThreadLocal in a long living thread and forget to > remove it, it's definitely a memory leak. > > On Thu, Jan 28, 2016 at 9:31 PM, N Bwrote: > >> Hello, >> >> Does anyone know if there are any potential pitfalls associated with >> using ThreadLocal variables in a Spark streaming application? One things I >> have seen mentioned in the context of app servers that use thread pools is >> that ThreadLocals can leak memory. Could this happen in Spark streaming >> also? >> >> Thanks >> Nikunj >> >> >
Re: Spark streaming and ThreadLocal
Spark Streaming uses threadpools so you need to remove ThreadLocal when it's not used. On Fri, Jan 29, 2016 at 12:55 PM, N Bwrote: > Thanks for the response Ryan. So I would say that it is in fact the > purpose of a ThreadLocal i.e. to have a copy of the variable as long as the > thread lives. I guess my concern is around usage of threadpools and whether > Spark streaming will internally create many threads that rotate between > tasks on purpose thereby holding onto ThreadLocals that may actually never > be used again. > > Thanks > > On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Of cause. If you use a ThreadLocal in a long living thread and forget to >> remove it, it's definitely a memory leak. >> >> On Thu, Jan 28, 2016 at 9:31 PM, N B wrote: >> >>> Hello, >>> >>> Does anyone know if there are any potential pitfalls associated with >>> using ThreadLocal variables in a Spark streaming application? One things I >>> have seen mentioned in the context of app servers that use thread pools is >>> that ThreadLocals can leak memory. Could this happen in Spark streaming >>> also? >>> >>> Thanks >>> Nikunj >>> >>> >> >
Spark streaming and ThreadLocal
Hello, Does anyone know if there are any potential pitfalls associated with using ThreadLocal variables in a Spark streaming application? One things I have seen mentioned in the context of app servers that use thread pools is that ThreadLocals can leak memory. Could this happen in Spark streaming also? Thanks Nikunj