[
https://issues.apache.org/jira/browse/HUDI-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360549#comment-17360549
]
Thirumalai Raj R commented on HUDI-1628:
----------------------------------------
Hi [~vinoth] / [~satishkotha] , is anyone working on this feature ? When we
tried to insert data into Hudi COW table with drop duplicates enabled using
Spark Streaming (DStreams) the pipeline wasn't scaling because Min Max pruning
in HoodieBloomIndex wasn't efficient and the exploded RDD size was >5X which
caused bottleneck in the shuffle stage.
If no one has started working on this, I would like to understand the
requirements better and contribute to it
> Improve data locality during ingestion
> --------------------------------------
>
> Key: HUDI-1628
> URL: https://issues.apache.org/jira/browse/HUDI-1628
> Project: Apache Hudi
> Issue Type: New Feature
> Components: Writer Core
> Reporter: satish
> Priority: Major
>
> Today the upsert partitioner does the file sizing/bin-packing etc for
> inserts and then sends some inserts over to existing file groups to
> maintain file size.
> We can abstract all of this into strategies and some kind of pipeline
> abstractions and have it also consider "affinity" to an existing file group
> based
> on say information stored in the metadata table?
> See http://mail-archives.apache.org/mod_mbox/hudi-dev/202102.mbox/browser
> for more details
--
This message was sent by Atlassian Jira
(v8.3.4#803005)