[ 
https://issues.apache.org/jira/browse/HUDI-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478029#comment-17478029
 ] 

Ethan Guo commented on HUDI-1628:
---------------------------------

My approach to this:

- For new file write (insert, upsert, etc), the sorting is handled at the write 
handle level:
   - For upsert, HoodieMergeHandle does the file write and all records are 
known before writing, so we can sort the records based on a single or multiple 
columns (space curve) before the actual writing.  This will add memory pressure.

  - For insert, HoodieCreateHandle does the file write and it does dynamic file 
sizing so it is not known when a file will be closed.  In this case, we need to 
sort the records in Spark RDD partition before this.

- For partitioner, we need to abstract a better Partitioner interface so that 
sorting logic does not leak into the core write path.

> [Umbrella] Improve data locality during ingestion
> -------------------------------------------------
>
>                 Key: HUDI-1628
>                 URL: https://issues.apache.org/jira/browse/HUDI-1628
>             Project: Apache Hudi
>          Issue Type: Epic
>          Components: writer-core
>            Reporter: satish
>            Assignee: Ethan Guo
>            Priority: Major
>              Labels: hudi-umbrellas
>             Fix For: 0.11.0
>
>
> Today the upsert partitioner does the file sizing/bin-packing etc for
> inserts and then sends some inserts over to existing file groups to
> maintain file size.
> We can abstract all of this into strategies and some kind of pipeline
> abstractions and have it also consider "affinity" to an existing file group
> based
> on say information stored in the metadata table?
> See http://mail-archives.apache.org/mod_mbox/hudi-dev/202102.mbox/browser
>  for more details



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to