[GitHub] [hudi] nsivabalan commented on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

GitBox Mon, 08 Mar 2021 09:13:36 -0800


nsivabalan commented on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-792912918



   Hey hi @codejoyan : Few clarifying questions on your use-case and record 
keys. 
   - What constitutes your record key? Is it completely random, or does it have 
any ordering property to it. 
   - For eg, if you record key consists of timestamp, we could leverage file 
pruning by min and max ranges per data file. But if its completely random, then 
our pruning step would be an overhead since we may not filter out any data file 
as min and max are completely random. 
   - Does your ingestion batch contain just inserts or updates as well.
   - If updates, does it touch latest partitions or spread across all 
partitions equally. 
   
   If your record keys are completely random, then using SIMPLE makes sense, as 
we may not do any filtering. While with default BLOOM index, we do filtering 
based on min/max ranges which may not be required(since in this step we read 
parquet footers to parse the min/max ranges). 
   
   Once you clarify these details, I can look into it further. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

Reply via email to