[
https://issues.apache.org/jira/browse/HUDI-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046015#comment-17046015
]
Vinoth Chandar commented on HUDI-315:
-------------------------------------
Great! I actually spent sometime on this today. In short use of accumulators,
may not be possible.. :( since their values wont be available until
countByKey() (which defeats the whole point).. take a stab and love to hear
your thoughts as well
> Reimplement statistics/workload profile collected during writes using Spark
> 2.x custom accumulators
> ---------------------------------------------------------------------------------------------------
>
> Key: HUDI-315
> URL: https://issues.apache.org/jira/browse/HUDI-315
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Components: Performance, Writer Core
> Reporter: Vinoth Chandar
> Assignee: Yanjia Gary Li
> Priority: Major
>
> https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1
>
> In Hudi, there are two places where we need to obtain statistics on the input
> data
> - HoodieBloomIndex : for knowing what partitions need to be loaded and
> checked against (is this still needed with the timeline server enabled is a
> separate question)
> - Workload profile to get a sense of number of updates, inserts to each
> partition/file group
> Both of them issue their own groupBy or shuffle computation today. This can
> be avoided using an accumulator
--
This message was sent by Atlassian Jira
(v8.3.4#803005)