[
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853955#comment-16853955
]
Stavros Kontopoulos edited comment on SPARK-24815 at 6/2/19 12:01 PM:
----------------------------------------------------------------------
Yes its 1-1. What I am trying to say is that if you have N Spark
tasks/partitions processed in parallel and you want to do dynamic
re-partitioning (Spark side only, _repartitioning_ is an _expensive_ activity
at the Kafka side for sure afaik, for spark depends) because these N tasks got
"fatter" you need the processing/batch_duration ratio metric, as your backlog
task number is zero anyway, and you have no idea otherwise what is happening.
After re-partitioning is done at Spark side you can fallback to the backlog
metric (as now you will have tasks queued), but no reason, you could just keep
the processing/batch_duration ratio.
The alternative is not to have a 1-1 relationship as with the other sources and
then backlog is good enough.
Regarding references there was a discussion last year:
[http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html#a25942]
There is this project that shows how the APIs can be extended:
[https://github.com/chermenin/spark-states/tree/master/src/main/scala/ru/chermenin/spark/sql/execution/streaming/state.]
In general the source of truth is the code as you know. The related Jira for
the state backend is here: https://issues.apache.org/jira/browse/SPARK-13809
and related doc is here:
[https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254]
was (Author: skonto):
Yes its 1-1. What I am trying to say is that if you have N Spark
tasks/partitions processed in parallel and you want to do dynamic
re-partitioning (Spark side only, _repartitioning_ is an _expensive_ activity
at the Kafka side for sure afaik, for spark depends) because these N tasks got
"fatter" you need the processing/batch_duration ratio metric, as your backlog
task number is zero anyway, and you have no idea otherwise what is happening.
After re-partitioning is done at Spark side you can fallback to the backlog
metric (as now you will have tasks queued), but no reason, you could just keep
the processing/batch_duration ratio.
Regarding references there was a discussion last year:
[http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html#a25942]
There is this project that shows how the APIs can be extended:
[https://github.com/chermenin/spark-states/tree/master/src/main/scala/ru/chermenin/spark/sql/execution/streaming/state.]
In general the source of truth is the code as you know. The related Jira for
the state backend is here: https://issues.apache.org/jira/browse/SPARK-13809
and related doc is here:
[https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254]
> Structured Streaming should support dynamic allocation
> ------------------------------------------------------
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
> Issue Type: Improvement
> Components: Scheduler, Structured Streaming
> Affects Versions: 2.3.1
> Reporter: Karthik Palaniappan
> Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing
> containers to match the actual workload. On multi-tenant clusters, it ensures
> that a Spark job is taking no more resources than necessary. In cloud
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured
> streaming job, the batch dynamic allocation algorithm kicks in. It requests
> more executors if the task backlog is a certain size, and removes executors
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a
> particular implementation in SparkContext.scala (this should be a separate
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the
> batch algorithm. Eventually, continuous processing might need its own
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when
> Core's dynamic allocation is enabled
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]