Re: Spark 3.1.2 full thread dumps

2022-02-04 Thread Mich Talebzadeh
Indeed. Apologies for going on a tangent.



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 5 Feb 2022 at 01:46, Maksim Grinman  wrote:

> Not that this discussion is not interesting (it is), but this has strayed
> pretty far from my original question. Which was: How do I prevent spark
> from dumping huge Java Full Thread dumps when an executor appears to not be
> doing anything (in my case, there's a loop where it sleeps waiting for a
> service to come up). The service happens to be set up using an auto-scaling
> group, a coincidental and unimportant detail that seems to have derailed
> the conversation.
>
> On Fri, Feb 4, 2022 at 7:18 PM Mich Talebzadeh 
> wrote:
>
>> OK basically, do we have a scenario where Spark or for that matter any
>> cluster manager can deploy a new node (after the loss of  an existing node)
>> with the view of running the failed tasks on the new executor(s) deployed
>> on that newly spun node?
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 5 Feb 2022 at 00:00, Holden Karau  wrote:
>>
>>> We don’t block scaling up after node failure in classic Spark if that’s
>>> the question.
>>>
>>> On Fri, Feb 4, 2022 at 6:30 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 From what I can see in auto scaling setup, you will always need a min
 of two worker nodes as primary. It also states and I quote "Scaling
 primary workers is not recommended due to HDFS limitations which result in
 instability while scaling. These limitations do not exist for secondary
 workers". So the scaling comes with the secondary workers specifying the
 min and max instances. It also defaults to 2 minutes for the so-called auto
 scaling cooldown duration hence that delay observed. I presume task
 allocation to the new executors is FIFO for new tasks. This link
 
 does some explanation on autoscaling.

 Handling Spot Node Loss and Spot Blocks in Spark Clusters
 "When the Spark AM receives the spot loss (Spot Node Loss or Spot
 Blocks) notification from the RM, it notifies the Spark driver. The driver
 then performs the following actions:

1. Identifies all the executors affected by the upcoming node loss.
2. Moves all of the affected executors to a decommissioning state,
and no new tasks are scheduled on these executors.
3. Kills all the executors after reaching 50% of the termination
time.
4. *Starts the failed tasks (if any) on other executors.*
5. For these nodes, it removes all the entries of the shuffle data
from the map output tracker on driver after reaching 90% of the 
 termination
time. This helps in preventing the shuffle-fetch failures due to spot 
 loss.
6. Recomputes the shuffle data from the lost node by stage
resubmission and at the time shuffles data of spot node if required."
7.
8. So basically when a node fails classic spark comes into play and
no new nodes are added etc (no rescaling) and tasks are redistributed 
 among
the existing executors as I read it?

view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Fri, 4 Feb 2022 at 13:55, Sean Owen  wrote:

> I have not seen stack traces under autoscaling, so not even sure what
> the error in question is.
> There is always delay in acquiring a whole new executor in the cloud
> as it usually means a 

Re: Spark 3.1.2 full thread dumps

2022-02-04 Thread Maksim Grinman
Not that this discussion is not interesting (it is), but this has strayed
pretty far from my original question. Which was: How do I prevent spark
from dumping huge Java Full Thread dumps when an executor appears to not be
doing anything (in my case, there's a loop where it sleeps waiting for a
service to come up). The service happens to be set up using an auto-scaling
group, a coincidental and unimportant detail that seems to have derailed
the conversation.

On Fri, Feb 4, 2022 at 7:18 PM Mich Talebzadeh 
wrote:

> OK basically, do we have a scenario where Spark or for that matter any
> cluster manager can deploy a new node (after the loss of  an existing node)
> with the view of running the failed tasks on the new executor(s) deployed
> on that newly spun node?
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 5 Feb 2022 at 00:00, Holden Karau  wrote:
>
>> We don’t block scaling up after node failure in classic Spark if that’s
>> the question.
>>
>> On Fri, Feb 4, 2022 at 6:30 PM Mich Talebzadeh 
>> wrote:
>>
>>> From what I can see in auto scaling setup, you will always need a min of
>>> two worker nodes as primary. It also states and I quote "Scaling
>>> primary workers is not recommended due to HDFS limitations which result in
>>> instability while scaling. These limitations do not exist for secondary
>>> workers". So the scaling comes with the secondary workers specifying the
>>> min and max instances. It also defaults to 2 minutes for the so-called auto
>>> scaling cooldown duration hence that delay observed. I presume task
>>> allocation to the new executors is FIFO for new tasks. This link
>>> 
>>> does some explanation on autoscaling.
>>>
>>> Handling Spot Node Loss and Spot Blocks in Spark Clusters
>>> "When the Spark AM receives the spot loss (Spot Node Loss or Spot
>>> Blocks) notification from the RM, it notifies the Spark driver. The driver
>>> then performs the following actions:
>>>
>>>1. Identifies all the executors affected by the upcoming node loss.
>>>2. Moves all of the affected executors to a decommissioning state,
>>>and no new tasks are scheduled on these executors.
>>>3. Kills all the executors after reaching 50% of the termination
>>>time.
>>>4. *Starts the failed tasks (if any) on other executors.*
>>>5. For these nodes, it removes all the entries of the shuffle data
>>>from the map output tracker on driver after reaching 90% of the 
>>> termination
>>>time. This helps in preventing the shuffle-fetch failures due to spot 
>>> loss.
>>>6. Recomputes the shuffle data from the lost node by stage
>>>resubmission and at the time shuffles data of spot node if required."
>>>7.
>>>8. So basically when a node fails classic spark comes into play and
>>>no new nodes are added etc (no rescaling) and tasks are redistributed 
>>> among
>>>the existing executors as I read it?
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 4 Feb 2022 at 13:55, Sean Owen  wrote:
>>>
 I have not seen stack traces under autoscaling, so not even sure what
 the error in question is.
 There is always delay in acquiring a whole new executor in the cloud as
 it usually means a new VM is provisioned.
 Spark treats the new executor like any other, available for executing
 tasks.

 On Fri, Feb 4, 2022 at 4:28 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Thanks for the info.
>
> My concern has always been on how Spark handles autoscaling (adding
> new executors) when the load pattern changes.I have tried to test this 
> with
> setting the following parameters (Spark 3.1.2 on GCP)
>
> spark-submit --verbose \
> ...
>   --conf spark.dynamicAllocation.enabled="true" \
>--conf 

Re: Spark 3.1.2 full thread dumps

2022-02-04 Thread Mich Talebzadeh
OK basically, do we have a scenario where Spark or for that matter any
cluster manager can deploy a new node (after the loss of  an existing node)
with the view of running the failed tasks on the new executor(s) deployed
on that newly spun node?



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 5 Feb 2022 at 00:00, Holden Karau  wrote:

> We don’t block scaling up after node failure in classic Spark if that’s
> the question.
>
> On Fri, Feb 4, 2022 at 6:30 PM Mich Talebzadeh 
> wrote:
>
>> From what I can see in auto scaling setup, you will always need a min of
>> two worker nodes as primary. It also states and I quote "Scaling primary
>> workers is not recommended due to HDFS limitations which result in
>> instability while scaling. These limitations do not exist for secondary
>> workers". So the scaling comes with the secondary workers specifying the
>> min and max instances. It also defaults to 2 minutes for the so-called auto
>> scaling cooldown duration hence that delay observed. I presume task
>> allocation to the new executors is FIFO for new tasks. This link
>> 
>> does some explanation on autoscaling.
>>
>> Handling Spot Node Loss and Spot Blocks in Spark Clusters
>> "When the Spark AM receives the spot loss (Spot Node Loss or Spot
>> Blocks) notification from the RM, it notifies the Spark driver. The driver
>> then performs the following actions:
>>
>>1. Identifies all the executors affected by the upcoming node loss.
>>2. Moves all of the affected executors to a decommissioning state,
>>and no new tasks are scheduled on these executors.
>>3. Kills all the executors after reaching 50% of the termination time.
>>4. *Starts the failed tasks (if any) on other executors.*
>>5. For these nodes, it removes all the entries of the shuffle data
>>from the map output tracker on driver after reaching 90% of the 
>> termination
>>time. This helps in preventing the shuffle-fetch failures due to spot 
>> loss.
>>6. Recomputes the shuffle data from the lost node by stage
>>resubmission and at the time shuffles data of spot node if required."
>>7.
>>8. So basically when a node fails classic spark comes into play and
>>no new nodes are added etc (no rescaling) and tasks are redistributed 
>> among
>>the existing executors as I read it?
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 4 Feb 2022 at 13:55, Sean Owen  wrote:
>>
>>> I have not seen stack traces under autoscaling, so not even sure what
>>> the error in question is.
>>> There is always delay in acquiring a whole new executor in the cloud as
>>> it usually means a new VM is provisioned.
>>> Spark treats the new executor like any other, available for executing
>>> tasks.
>>>
>>> On Fri, Feb 4, 2022 at 4:28 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Thanks for the info.

 My concern has always been on how Spark handles autoscaling (adding new
 executors) when the load pattern changes.I have tried to test this with
 setting the following parameters (Spark 3.1.2 on GCP)

 spark-submit --verbose \
 ...
   --conf spark.dynamicAllocation.enabled="true" \
--conf spark.shuffle.service.enabled="true" \
--conf spark.dynamicAllocation.minExecutors=2 \
--conf spark.dynamicAllocation.maxExecutors=10 \
--conf spark.dynamicAllocation.initialExecutors=4 \

 It is not very clear to me how Spark distributes tasks on the added
 executors and the source of delay. As you have observed there is a delay in
 adding new resources and allocating tasks. If that process is efficient?

 Thanks

view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all 

Re: Spark 3.1.2 full thread dumps

2022-02-04 Thread Holden Karau
We don’t block scaling up after node failure in classic Spark if that’s the
question.

On Fri, Feb 4, 2022 at 6:30 PM Mich Talebzadeh 
wrote:

> From what I can see in auto scaling setup, you will always need a min of
> two worker nodes as primary. It also states and I quote "Scaling primary
> workers is not recommended due to HDFS limitations which result in
> instability while scaling. These limitations do not exist for secondary
> workers". So the scaling comes with the secondary workers specifying the
> min and max instances. It also defaults to 2 minutes for the so-called auto
> scaling cooldown duration hence that delay observed. I presume task
> allocation to the new executors is FIFO for new tasks. This link
> 
> does some explanation on autoscaling.
>
> Handling Spot Node Loss and Spot Blocks in Spark Clusters
> "When the Spark AM receives the spot loss (Spot Node Loss or Spot Blocks)
> notification from the RM, it notifies the Spark driver. The driver then
> performs the following actions:
>
>1. Identifies all the executors affected by the upcoming node loss.
>2. Moves all of the affected executors to a decommissioning state, and
>no new tasks are scheduled on these executors.
>3. Kills all the executors after reaching 50% of the termination time.
>4. *Starts the failed tasks (if any) on other executors.*
>5. For these nodes, it removes all the entries of the shuffle data
>from the map output tracker on driver after reaching 90% of the termination
>time. This helps in preventing the shuffle-fetch failures due to spot loss.
>6. Recomputes the shuffle data from the lost node by stage
>resubmission and at the time shuffles data of spot node if required."
>7.
>8. So basically when a node fails classic spark comes into play and no
>new nodes are added etc (no rescaling) and tasks are redistributed among
>the existing executors as I read it?
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 4 Feb 2022 at 13:55, Sean Owen  wrote:
>
>> I have not seen stack traces under autoscaling, so not even sure what the
>> error in question is.
>> There is always delay in acquiring a whole new executor in the cloud as
>> it usually means a new VM is provisioned.
>> Spark treats the new executor like any other, available for executing
>> tasks.
>>
>> On Fri, Feb 4, 2022 at 4:28 AM Mich Talebzadeh 
>> wrote:
>>
>>> Thanks for the info.
>>>
>>> My concern has always been on how Spark handles autoscaling (adding new
>>> executors) when the load pattern changes.I have tried to test this with
>>> setting the following parameters (Spark 3.1.2 on GCP)
>>>
>>> spark-submit --verbose \
>>> ...
>>>   --conf spark.dynamicAllocation.enabled="true" \
>>>--conf spark.shuffle.service.enabled="true" \
>>>--conf spark.dynamicAllocation.minExecutors=2 \
>>>--conf spark.dynamicAllocation.maxExecutors=10 \
>>>--conf spark.dynamicAllocation.initialExecutors=4 \
>>>
>>> It is not very clear to me how Spark distributes tasks on the added
>>> executors and the source of delay. As you have observed there is a delay in
>>> adding new resources and allocating tasks. If that process is efficient?
>>>
>>> Thanks
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 4 Feb 2022 at 03:04, Maksim Grinman  wrote:
>>>
 It's actually on AWS EMR. The job bootstraps and runs fine -- the
 autoscaling group is to bring up a service that spark will be calling. Some
 code waits for the autoscaling group to come up before continuing
 processing in Spark, since the Spark cluster will need to make requests to
 the service in the autoscaling group. It takes several minutes for the
 service to come up, and during the wait, Spark starts to show these thread
 dumps, as presumably it thinks 

Re: Spark 3.1.2 full thread dumps

2022-02-04 Thread Mich Talebzadeh
>From what I can see in auto scaling setup, you will always need a min of
two worker nodes as primary. It also states and I quote "Scaling primary
workers is not recommended due to HDFS limitations which result in
instability while scaling. These limitations do not exist for secondary
workers". So the scaling comes with the secondary workers specifying the
min and max instances. It also defaults to 2 minutes for the so-called auto
scaling cooldown duration hence that delay observed. I presume task
allocation to the new executors is FIFO for new tasks. This link

does some explanation on autoscaling.

Handling Spot Node Loss and Spot Blocks in Spark Clusters
"When the Spark AM receives the spot loss (Spot Node Loss or Spot Blocks)
notification from the RM, it notifies the Spark driver. The driver then
performs the following actions:

   1. Identifies all the executors affected by the upcoming node loss.
   2. Moves all of the affected executors to a decommissioning state, and
   no new tasks are scheduled on these executors.
   3. Kills all the executors after reaching 50% of the termination time.
   4. *Starts the failed tasks (if any) on other executors.*
   5. For these nodes, it removes all the entries of the shuffle data from
   the map output tracker on driver after reaching 90% of the termination
   time. This helps in preventing the shuffle-fetch failures due to spot loss.
   6. Recomputes the shuffle data from the lost node by stage resubmission
   and at the time shuffles data of spot node if required."
   7.
   8. So basically when a node fails classic spark comes into play and no
   new nodes are added etc (no rescaling) and tasks are redistributed among
   the existing executors as I read it?

   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 4 Feb 2022 at 13:55, Sean Owen  wrote:

> I have not seen stack traces under autoscaling, so not even sure what the
> error in question is.
> There is always delay in acquiring a whole new executor in the cloud as it
> usually means a new VM is provisioned.
> Spark treats the new executor like any other, available for executing
> tasks.
>
> On Fri, Feb 4, 2022 at 4:28 AM Mich Talebzadeh 
> wrote:
>
>> Thanks for the info.
>>
>> My concern has always been on how Spark handles autoscaling (adding new
>> executors) when the load pattern changes.I have tried to test this with
>> setting the following parameters (Spark 3.1.2 on GCP)
>>
>> spark-submit --verbose \
>> ...
>>   --conf spark.dynamicAllocation.enabled="true" \
>>--conf spark.shuffle.service.enabled="true" \
>>--conf spark.dynamicAllocation.minExecutors=2 \
>>--conf spark.dynamicAllocation.maxExecutors=10 \
>>--conf spark.dynamicAllocation.initialExecutors=4 \
>>
>> It is not very clear to me how Spark distributes tasks on the added
>> executors and the source of delay. As you have observed there is a delay in
>> adding new resources and allocating tasks. If that process is efficient?
>>
>> Thanks
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 4 Feb 2022 at 03:04, Maksim Grinman  wrote:
>>
>>> It's actually on AWS EMR. The job bootstraps and runs fine -- the
>>> autoscaling group is to bring up a service that spark will be calling. Some
>>> code waits for the autoscaling group to come up before continuing
>>> processing in Spark, since the Spark cluster will need to make requests to
>>> the service in the autoscaling group. It takes several minutes for the
>>> service to come up, and during the wait, Spark starts to show these thread
>>> dumps, as presumably it thinks something is wrong since the executor is
>>> busy waiting and not doing anything. The previous version of Spark did not
>>> do this (2.4.4).
>>>
>>> On Thu, Feb 3, 2022 at 6:59 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Sounds like you are running this on 

Re: Spark 3.1 Json4s-native jar compatibility

2022-02-04 Thread Amit Sharma
Thanks Sean/Martin, my bad, Spark version was 3.0.1 so after using json
3.6.6 it fixed the issue.


Thanks
Amit

On Fri, Feb 4, 2022 at 3:37 PM Sean Owen  wrote:

> My guess is that something else you depend on is actually bringing in a
> different json4s, or you're otherwise mixing library/Spark versions. Use
> mvn dependency:tree or equivalent on your build to see what you actually
> build in. You probably do not need to include json4s at all as it is in
> Spark anway
>
> On Fri, Feb 4, 2022 at 2:35 PM Amit Sharma  wrote:
>
>> Martin Sean, changed it to  3.7.0-MS still getting the below error.
>> I am still getting the same issue
>> Exception in thread "streaming-job-executor-0"
>> java.lang.NoSuchMethodError:
>> org.json4s.ShortTypeHints$.apply$default$2()Ljava/lang/String;
>>
>>
>> Thanks
>> Amit
>>
>> On Fri, Feb 4, 2022 at 9:03 AM Martin Grigorov 
>> wrote:
>>
>>> Hi,
>>>
>>> Amit said that he uses Spark 3.1, so the link should be
>>> https://github.com/apache/spark/blob/branch-3.1/pom.xml#L879 (3.7.0-M5)
>>>
>>> @Amit: check your classpath. Maybe there are more jars of this
>>> dependency.
>>>
>>> On Thu, Feb 3, 2022 at 10:53 PM Sean Owen  wrote:
>>>
 You can look it up:
 https://github.com/apache/spark/blob/branch-3.2/pom.xml#L916
 3.7.0-M11

 On Thu, Feb 3, 2022 at 1:57 PM Amit Sharma 
 wrote:

> Hello, everyone. I am migrating my spark stream to spark version 3.1.
> I also upgraded  json version  as below
>
> libraryDependencies += "org.json4s" %% "json4s-native" % "3.7.0-M5"
>
>
> While running the job I getting an error for the below code where I am
> serializing the given inputs.
>
> implicit val formats = 
> Serialization.formats(ShortTypeHints(List(classOf[ForecastResponse], 
> classOf[OverlayRequest],
>   classOf[FTEResponseFromSpark], classOf[QuotaResponse], 
> classOf[CloneResponse]
>
> )))
>
>
> Exception in thread "streaming-job-executor-4" 
> java.lang.NoSuchMethodError: 
> org.json4s.ShortTypeHints$.apply$default$2()Ljava/lang/String;
>
> It seems to me jar issue, not sure which version of json4s-native should 
> I use with spark 3.1.
>
>


Re: Spark 3.1 Json4s-native jar compatibility

2022-02-04 Thread Sean Owen
My guess is that something else you depend on is actually bringing in a
different json4s, or you're otherwise mixing library/Spark versions. Use
mvn dependency:tree or equivalent on your build to see what you actually
build in. You probably do not need to include json4s at all as it is in
Spark anway

On Fri, Feb 4, 2022 at 2:35 PM Amit Sharma  wrote:

> Martin Sean, changed it to  3.7.0-MS still getting the below error.
> I am still getting the same issue
> Exception in thread "streaming-job-executor-0"
> java.lang.NoSuchMethodError:
> org.json4s.ShortTypeHints$.apply$default$2()Ljava/lang/String;
>
>
> Thanks
> Amit
>
> On Fri, Feb 4, 2022 at 9:03 AM Martin Grigorov 
> wrote:
>
>> Hi,
>>
>> Amit said that he uses Spark 3.1, so the link should be
>> https://github.com/apache/spark/blob/branch-3.1/pom.xml#L879 (3.7.0-M5)
>>
>> @Amit: check your classpath. Maybe there are more jars of this dependency.
>>
>> On Thu, Feb 3, 2022 at 10:53 PM Sean Owen  wrote:
>>
>>> You can look it up:
>>> https://github.com/apache/spark/blob/branch-3.2/pom.xml#L916
>>> 3.7.0-M11
>>>
>>> On Thu, Feb 3, 2022 at 1:57 PM Amit Sharma  wrote:
>>>
 Hello, everyone. I am migrating my spark stream to spark version 3.1. I
 also upgraded  json version  as below

 libraryDependencies += "org.json4s" %% "json4s-native" % "3.7.0-M5"


 While running the job I getting an error for the below code where I am
 serializing the given inputs.

 implicit val formats = 
 Serialization.formats(ShortTypeHints(List(classOf[ForecastResponse], 
 classOf[OverlayRequest],
   classOf[FTEResponseFromSpark], classOf[QuotaResponse], 
 classOf[CloneResponse]

 )))


 Exception in thread "streaming-job-executor-4" 
 java.lang.NoSuchMethodError: 
 org.json4s.ShortTypeHints$.apply$default$2()Ljava/lang/String;

 It seems to me jar issue, not sure which version of json4s-native should I 
 use with spark 3.1.




Re: Spark 3.1 Json4s-native jar compatibility

2022-02-04 Thread Amit Sharma
Martin Sean, changed it to  3.7.0-MS still getting the below error.
I am still getting the same issue
Exception in thread "streaming-job-executor-0" java.lang.NoSuchMethodError:
org.json4s.ShortTypeHints$.apply$default$2()Ljava/lang/String;


Thanks
Amit

On Fri, Feb 4, 2022 at 9:03 AM Martin Grigorov  wrote:

> Hi,
>
> Amit said that he uses Spark 3.1, so the link should be
> https://github.com/apache/spark/blob/branch-3.1/pom.xml#L879 (3.7.0-M5)
>
> @Amit: check your classpath. Maybe there are more jars of this dependency.
>
> On Thu, Feb 3, 2022 at 10:53 PM Sean Owen  wrote:
>
>> You can look it up:
>> https://github.com/apache/spark/blob/branch-3.2/pom.xml#L916
>> 3.7.0-M11
>>
>> On Thu, Feb 3, 2022 at 1:57 PM Amit Sharma  wrote:
>>
>>> Hello, everyone. I am migrating my spark stream to spark version 3.1. I
>>> also upgraded  json version  as below
>>>
>>> libraryDependencies += "org.json4s" %% "json4s-native" % "3.7.0-M5"
>>>
>>>
>>> While running the job I getting an error for the below code where I am
>>> serializing the given inputs.
>>>
>>> implicit val formats = 
>>> Serialization.formats(ShortTypeHints(List(classOf[ForecastResponse], 
>>> classOf[OverlayRequest],
>>>   classOf[FTEResponseFromSpark], classOf[QuotaResponse], 
>>> classOf[CloneResponse]
>>>
>>> )))
>>>
>>>
>>> Exception in thread "streaming-job-executor-4" java.lang.NoSuchMethodError: 
>>> org.json4s.ShortTypeHints$.apply$default$2()Ljava/lang/String;
>>>
>>> It seems to me jar issue, not sure which version of json4s-native should I 
>>> use with spark 3.1.
>>>
>>>


Re: how can I remove the warning message

2022-02-04 Thread Martin Grigorov
Hi,

This is a JVM warning, as Sean explained. You cannot control it via loggers.
You can disable it by passing --illegal-access=permit to java.
Read more about it at
https://softwaregarden.dev/en/posts/new-java/illegal-access-in-java-16/


On Sun, Jan 30, 2022 at 4:32 PM Sean Owen  wrote:

> This one you can ignore. It's from the JVM so you might be able to disable
> it by configuring the right JVM logger as well, but it also tells you right
> in the message how to turn it off!
>
> But this is saying that some reflective operations are discouraged in Java
> 9+. They still work and Spark needs them, but they cause a warning now. You
> can however ignore it.
>
> On Sun, Jan 30, 2022 at 2:56 AM Gourav Sengupta 
> wrote:
>
>> Hi,
>>
>> I have often found that logging in the warnings is extremely useful, they
>> are just logs, and provide a lot of insights during upgrades, external
>> package loading, deprecation, debugging, etc.
>>
>> Do you have any particular reason to disable the warnings in a submitted
>> job?
>>
>> I used to disable warnings in spark-shell  using the
>> Logger.getLogger("akka").setLevel(Level.OFF) in case I have not completely
>> forgotten. Other details are mentioned here:
>> https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.setLogLevel.html
>>
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Fri, Jan 28, 2022 at 11:14 AM  wrote:
>>
>>> When I submitted the job from scala client, I got the warning messages:
>>>
>>> WARNING: An illegal reflective access operation has occurred
>>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
>>> (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor
>>> java.nio.DirectByteBuffer(long,int)
>>> WARNING: Please consider reporting this to the maintainers of
>>> org.apache.spark.unsafe.Platform
>>> WARNING: Use --illegal-access=warn to enable warnings of further illegal
>>> reflective access operations
>>> WARNING: All illegal access operations will be denied in a future
>>> release
>>>
>>> How can I just remove those messages?
>>>
>>> spark: 3.2.0
>>> scala: 2.13.7
>>>
>>> Thank you.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: Spark 3.1 Json4s-native jar compatibility

2022-02-04 Thread Martin Grigorov
Hi,

Amit said that he uses Spark 3.1, so the link should be
https://github.com/apache/spark/blob/branch-3.1/pom.xml#L879 (3.7.0-M5)

@Amit: check your classpath. Maybe there are more jars of this dependency.

On Thu, Feb 3, 2022 at 10:53 PM Sean Owen  wrote:

> You can look it up:
> https://github.com/apache/spark/blob/branch-3.2/pom.xml#L916
> 3.7.0-M11
>
> On Thu, Feb 3, 2022 at 1:57 PM Amit Sharma  wrote:
>
>> Hello, everyone. I am migrating my spark stream to spark version 3.1. I
>> also upgraded  json version  as below
>>
>> libraryDependencies += "org.json4s" %% "json4s-native" % "3.7.0-M5"
>>
>>
>> While running the job I getting an error for the below code where I am
>> serializing the given inputs.
>>
>> implicit val formats = 
>> Serialization.formats(ShortTypeHints(List(classOf[ForecastResponse], 
>> classOf[OverlayRequest],
>>   classOf[FTEResponseFromSpark], classOf[QuotaResponse], 
>> classOf[CloneResponse]
>>
>> )))
>>
>>
>> Exception in thread "streaming-job-executor-4" java.lang.NoSuchMethodError: 
>> org.json4s.ShortTypeHints$.apply$default$2()Ljava/lang/String;
>>
>> It seems to me jar issue, not sure which version of json4s-native should I 
>> use with spark 3.1.
>>
>>


Re: Python performance

2022-02-04 Thread Sean Owen
Yes, in the sense that any transformation that can be expressed in the
SQL-like DataFrame API will push down to the JVM, and take advantage of
other optimizations, avoiding the data movement to/from Python and more.
But you can't do this if you're expressing operations that are not in the
DataFrame API, custom logic. They are not always alternatives.

There, pandas UDFs are a better choice in python as you can take advantage
of arrow for data movement, and that is also a reason to use DataFrames in
a case like this. It still has to execute code in Python though.

On Fri, Feb 4, 2022 at 3:20 AM Bitfox  wrote:

> Please see my this test:
>
> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>
> Don’t use Python RDD, using dataframe instead.
>
> Regards
>
> On Fri, Feb 4, 2022 at 5:02 PM Hinko Kocevar 
> wrote:
>
>> I'm looking into using Python interface with Spark and came across this
>> [1] chart showing some performance hit when going with Python RDD. Data is
>> ~ 7 years and for older version of Spark. Is this still the case with more
>> recent Spark releases?
>>
>> I'm trying to understand what to expect from Python and Spark and under
>> what conditions.
>>
>> [1]
>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>
>> Thanks,
>> //hinko
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Spark 3.1.2 full thread dumps

2022-02-04 Thread Sean Owen
I have not seen stack traces under autoscaling, so not even sure what the
error in question is.
There is always delay in acquiring a whole new executor in the cloud as it
usually means a new VM is provisioned.
Spark treats the new executor like any other, available for executing tasks.

On Fri, Feb 4, 2022 at 4:28 AM Mich Talebzadeh 
wrote:

> Thanks for the info.
>
> My concern has always been on how Spark handles autoscaling (adding new
> executors) when the load pattern changes.I have tried to test this with
> setting the following parameters (Spark 3.1.2 on GCP)
>
> spark-submit --verbose \
> ...
>   --conf spark.dynamicAllocation.enabled="true" \
>--conf spark.shuffle.service.enabled="true" \
>--conf spark.dynamicAllocation.minExecutors=2 \
>--conf spark.dynamicAllocation.maxExecutors=10 \
>--conf spark.dynamicAllocation.initialExecutors=4 \
>
> It is not very clear to me how Spark distributes tasks on the added
> executors and the source of delay. As you have observed there is a delay in
> adding new resources and allocating tasks. If that process is efficient?
>
> Thanks
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 4 Feb 2022 at 03:04, Maksim Grinman  wrote:
>
>> It's actually on AWS EMR. The job bootstraps and runs fine -- the
>> autoscaling group is to bring up a service that spark will be calling. Some
>> code waits for the autoscaling group to come up before continuing
>> processing in Spark, since the Spark cluster will need to make requests to
>> the service in the autoscaling group. It takes several minutes for the
>> service to come up, and during the wait, Spark starts to show these thread
>> dumps, as presumably it thinks something is wrong since the executor is
>> busy waiting and not doing anything. The previous version of Spark did not
>> do this (2.4.4).
>>
>> On Thu, Feb 3, 2022 at 6:59 PM Mich Talebzadeh 
>> wrote:
>>
>>> Sounds like you are running this on Google Dataproc cluster (spark
>>> 3.1.2)  with auto scaling policy?
>>>
>>>  Can you describe if this happens before Spark starts a new job on the
>>> cluster or somehow half way through processing an existing job?
>>>
>>> Also is the job involved doing Spark Structured Streaming?
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 3 Feb 2022 at 21:29, Maksim Grinman  wrote:
>>>
 We've got a spark task that, after some processing, starts an
 autoscaling group and waits for it to be up before continuing processing.
 While waiting for the autoscaling group, spark starts throwing full thread
 dumps, presumably at the spark.executor.heartbeat interval. Is there a way
 to prevent the thread dumps?

 --
 Maksim Grinman
 VP Engineering
 Resolute AI

>>>
>>
>> --
>> Maksim Grinman
>> VP Engineering
>> Resolute AI
>>
>


Re: Spark on K8s : property simillar to yarn.max.application.attempt

2022-02-04 Thread Mich Talebzadeh
Not as far as I know. If your driver pod fails, then you need to resubmit
the job. I cannot see what else can be done?


HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 4 Feb 2022 at 10:22, Pralabh Kumar  wrote:

> Hi Spark Team
>
> I am running spark on K8s and looking for a
> property/mechanism similar to  yarn.max.application.attempt . I know this
> is not really a spark question , but i thought if anyone have faced the
> similar issue,
>
> Basically I want if my driver pod fails , it should be retried on a
> different machine . Is there a way to do the same .
>
> Regards
> Pralabh Kumar
>


Re: Spark 3.1.2 full thread dumps

2022-02-04 Thread Mich Talebzadeh
Thanks for the info.

My concern has always been on how Spark handles autoscaling (adding new
executors) when the load pattern changes.I have tried to test this with
setting the following parameters (Spark 3.1.2 on GCP)

spark-submit --verbose \
...
  --conf spark.dynamicAllocation.enabled="true" \
   --conf spark.shuffle.service.enabled="true" \
   --conf spark.dynamicAllocation.minExecutors=2 \
   --conf spark.dynamicAllocation.maxExecutors=10 \
   --conf spark.dynamicAllocation.initialExecutors=4 \

It is not very clear to me how Spark distributes tasks on the added
executors and the source of delay. As you have observed there is a delay in
adding new resources and allocating tasks. If that process is efficient?

Thanks

   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 4 Feb 2022 at 03:04, Maksim Grinman  wrote:

> It's actually on AWS EMR. The job bootstraps and runs fine -- the
> autoscaling group is to bring up a service that spark will be calling. Some
> code waits for the autoscaling group to come up before continuing
> processing in Spark, since the Spark cluster will need to make requests to
> the service in the autoscaling group. It takes several minutes for the
> service to come up, and during the wait, Spark starts to show these thread
> dumps, as presumably it thinks something is wrong since the executor is
> busy waiting and not doing anything. The previous version of Spark did not
> do this (2.4.4).
>
> On Thu, Feb 3, 2022 at 6:59 PM Mich Talebzadeh 
> wrote:
>
>> Sounds like you are running this on Google Dataproc cluster (spark
>> 3.1.2)  with auto scaling policy?
>>
>>  Can you describe if this happens before Spark starts a new job on the
>> cluster or somehow half way through processing an existing job?
>>
>> Also is the job involved doing Spark Structured Streaming?
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 3 Feb 2022 at 21:29, Maksim Grinman  wrote:
>>
>>> We've got a spark task that, after some processing, starts an
>>> autoscaling group and waits for it to be up before continuing processing.
>>> While waiting for the autoscaling group, spark starts throwing full thread
>>> dumps, presumably at the spark.executor.heartbeat interval. Is there a way
>>> to prevent the thread dumps?
>>>
>>> --
>>> Maksim Grinman
>>> VP Engineering
>>> Resolute AI
>>>
>>
>
> --
> Maksim Grinman
> VP Engineering
> Resolute AI
>


Spark on K8s : property simillar to yarn.max.application.attempt

2022-02-04 Thread Pralabh Kumar
Hi Spark Team

I am running spark on K8s and looking for a
property/mechanism similar to  yarn.max.application.attempt . I know this
is not really a spark question , but i thought if anyone have faced the
similar issue,

Basically I want if my driver pod fails , it should be retried on a
different machine . Is there a way to do the same .

Regards
Pralabh Kumar


Re: Python performance

2022-02-04 Thread Bitfox
Please see my this test:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/

Don’t use Python RDD, using dataframe instead.

Regards

On Fri, Feb 4, 2022 at 5:02 PM Hinko Kocevar 
wrote:

> I'm looking into using Python interface with Spark and came across this
> [1] chart showing some performance hit when going with Python RDD. Data is
> ~ 7 years and for older version of Spark. Is this still the case with more
> recent Spark releases?
>
> I'm trying to understand what to expect from Python and Spark and under
> what conditions.
>
> [1]
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>
> Thanks,
> //hinko
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Python performance

2022-02-04 Thread Hinko Kocevar
I'm looking into using Python interface with Spark and came across this [1] 
chart showing some performance hit when going with Python RDD. Data is ~ 7 
years and for older version of Spark. Is this still the case with more recent 
Spark releases?

I'm trying to understand what to expect from Python and Spark and under what 
conditions.

[1] 
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Thanks,
//hinko
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org