[jira] [Comment Edited] (SPARK-24815) Structured Streaming should support dynamic allocation

Stavros Kontopoulos (JIRA) Sun, 02 Jun 2019 05:02:56 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853955#comment-16853955
 ]


Stavros Kontopoulos edited comment on SPARK-24815 at 6/2/19 12:01 PM:
----------------------------------------------------------------------

Yes its 1-1. What I am trying to say is that if you have N Spark 
tasks/partitions processed in parallel and you want to do dynamic 
re-partitioning (Spark side only, _repartitioning_ is an _expensive_ activity 
at the Kafka side for sure afaik, for spark depends) because these N tasks got 
"fatter" you need the processing/batch_duration ratio metric, as your backlog 
task number is zero anyway, and you have no idea otherwise what is happening. 
After re-partitioning is done at Spark side you can fallback to the backlog 
metric (as now you will have tasks queued), but no reason, you could just keep 
the processing/batch_duration ratio. 

The alternative is not to have a 1-1 relationship as with the other sources and 
then backlog is good enough.

Regarding references there was a discussion last year: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html#a25942]

There is this project that shows how the APIs can be extended: 
[https://github.com/chermenin/spark-states/tree/master/src/main/scala/ru/chermenin/spark/sql/execution/streaming/state.]

In general the source of truth is the code as you know. The related Jira for 
the state backend is here: https://issues.apache.org/jira/browse/SPARK-13809 
and related doc is here: 
[https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254]

 

 

 


was (Author: skonto):
Yes its 1-1. What I am trying to say is that if you have N Spark 
tasks/partitions processed in parallel and you want to do dynamic 
re-partitioning (Spark side only, _repartitioning_ is an _expensive_ activity 
at the Kafka side for sure afaik, for spark depends) because these N tasks got 
"fatter" you need the processing/batch_duration ratio metric, as your backlog 
task number is zero anyway, and you have no idea otherwise what is happening. 
After re-partitioning is done at Spark side you can fallback to the backlog 
metric (as now you will have tasks queued), but no reason, you could just keep 
the processing/batch_duration ratio. 

Regarding references there was a discussion last year: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html#a25942]

There is this project that shows how the APIs can be extended: 
[https://github.com/chermenin/spark-states/tree/master/src/main/scala/ru/chermenin/spark/sql/execution/streaming/state.]

In general the source of truth is the code as you know. The related Jira for 
the state backend is here: https://issues.apache.org/jira/browse/SPARK-13809 
and related doc is here: 
[https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254]

 

 

 

> Structured Streaming should support dynamic allocation
> ------------------------------------------------------
>
>                 Key: SPARK-24815
>                 URL: https://issues.apache.org/jira/browse/SPARK-24815
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Structured Streaming
>    Affects Versions: 2.3.1
>            Reporter: Karthik Palaniappan
>            Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-24815) Structured Streaming should support dynamic allocation

Reply via email to