[ 
https://issues.apache.org/jira/browse/HUDI-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360549#comment-17360549
 ] 

Thirumalai Raj R commented on HUDI-1628:
----------------------------------------

Hi [~vinoth] / [~satishkotha] , is anyone working on this feature ? When we 
tried to insert data into Hudi COW table with drop duplicates enabled using 
Spark Streaming (DStreams) the pipeline wasn't scaling because Min Max pruning 
in HoodieBloomIndex wasn't efficient and the exploded RDD size was >5X which 
caused bottleneck in the shuffle stage. 

If no one has started working on this, I would like to understand the 
requirements better and contribute to it 

> Improve data locality during ingestion
> --------------------------------------
>
>                 Key: HUDI-1628
>                 URL: https://issues.apache.org/jira/browse/HUDI-1628
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: Writer Core
>            Reporter: satish
>            Priority: Major
>
> Today the upsert partitioner does the file sizing/bin-packing etc for
> inserts and then sends some inserts over to existing file groups to
> maintain file size.
> We can abstract all of this into strategies and some kind of pipeline
> abstractions and have it also consider "affinity" to an existing file group
> based
> on say information stored in the metadata table?
> See http://mail-archives.apache.org/mod_mbox/hudi-dev/202102.mbox/browser
>  for more details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to