[jira] [Updated] (FLINK-36553) Local Global aggregation in Datasream API

Jacob Jona Fahlenkamp (Jira) Wed, 16 Oct 2024 04:46:30 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jacob Jona Fahlenkamp updated FLINK-36553:
------------------------------------------
    Description: 
This would be useful for the same reasons as in Table API. AggregateFunctions 
already have a merge method, which could be used to merge the local aggregates.

Batch jobs where you read a large amount of raw data -> key by -> aggregate 
currently write all the raw data to disk before the shuffle. This makes it 
infeasible to run a batch job over a large span of data, because of large disk 
requirements. If we had local global-aggregation the data could be 
preaggregated already, before being written to disk.

For stream jobs it would help with network load and data skew.

  was:
This would be useful for the same reasons as in Table API. AggregateFunctions 
already have a merge method, which could be used to merge the local aggregates.

Batch jobs where you read a large amount of raw data -> key by -> aggregate 
currently write all the raw data to disk before the shuffle. This makes it 
infeasible to run a batch job over a large span of data, because of large disk 
requirements. If we had local global-aggregation the data could be 
preaggregated already, before being written to disk.

For stream jobs it would help with network throughput and data skew.


> Local Global aggregation in Datasream API
> -----------------------------------------
>
>                 Key: FLINK-36553
>                 URL: https://issues.apache.org/jira/browse/FLINK-36553
>             Project: Flink
>          Issue Type: New Feature
>          Components: API / DataStream
>            Reporter: Jacob Jona Fahlenkamp
>            Priority: Major
>
> This would be useful for the same reasons as in Table API. AggregateFunctions 
> already have a merge method, which could be used to merge the local 
> aggregates.
> Batch jobs where you read a large amount of raw data -> key by -> aggregate 
> currently write all the raw data to disk before the shuffle. This makes it 
> infeasible to run a batch job over a large span of data, because of large 
> disk requirements. If we had local global-aggregation the data could be 
> preaggregated already, before being written to disk.
> For stream jobs it would help with network load and data skew.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36553) Local Global aggregation in Datasream API

Reply via email to