[
https://issues.apache.org/jira/browse/FLINK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jacob Jona Fahlenkamp updated FLINK-36553:
------------------------------------------
Description:
This would be useful for the same reasons as in Table API. AggregateFunctions
already have a merge method, which could be used to merge the local aggregates.
Batch jobs where you read a large amount of raw data -> key by -> aggregate
currently write all the raw data to disk before the shuffle. This makes it
infeasible to run a batch job over a large span of data, because of large disk
requirements. If we had local global-aggregation the data could be
preaggregated already, before being written to disk.
For stream jobs it would help with network load and data skew.
was:
This would be useful for the same reasons as in Table API. AggregateFunctions
already have a merge method, which could be used to merge the local aggregates.
Batch jobs where you read a large amount of raw data -> key by -> aggregate
currently write all the raw data to disk before the shuffle. This makes it
infeasible to run a batch job over a large span of data, because of large disk
requirements. If we had local global-aggregation the data could be
preaggregated already, before being written to disk.
For stream jobs it would help with network throughput and data skew.
> Local Global aggregation in Datasream API
> -----------------------------------------
>
> Key: FLINK-36553
> URL: https://issues.apache.org/jira/browse/FLINK-36553
> Project: Flink
> Issue Type: New Feature
> Components: API / DataStream
> Reporter: Jacob Jona Fahlenkamp
> Priority: Major
>
> This would be useful for the same reasons as in Table API. AggregateFunctions
> already have a merge method, which could be used to merge the local
> aggregates.
> Batch jobs where you read a large amount of raw data -> key by -> aggregate
> currently write all the raw data to disk before the shuffle. This makes it
> infeasible to run a batch job over a large span of data, because of large
> disk requirements. If we had local global-aggregation the data could be
> preaggregated already, before being written to disk.
> For stream jobs it would help with network load and data skew.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)