Michael Armbrust commented on SPARK-17813:

I think its okay to ignore compacted topics, at least initially.  You would 
still respect the "maximum" nature of the configuration, though would waste 
some effort scheduling tasks smaller than the max.

I would probably start simple and just have a global {{maxOffsetsPerTrigger}} 
that bounds the total number of records in each batch and is distributed 
amongst the topic partitions.  topicpartitions that are skewed too small will 
not have enough offsets available and we can spill that over to the ones that 
are skewed large.  We can always add something more complicated in the future.

An alternative proposal would be to spread out the max to each partition 
proportional to the total number of offsets available when planning.

Regarding [SPARK-17510], I would make this configuration an option to the 
DataStreamReader, you'd be able to configure it perstream instead of globally.  
So, I think we are good.

> Maximum data per trigger
> ------------------------
>                 Key: SPARK-17813
>                 URL: https://issues.apache.org/jira/browse/SPARK-17813
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Michael Armbrust
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to