[ 
https://issues.apache.org/jira/browse/HUDI-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZiyueGuan closed HUDI-1795.
---------------------------
    Resolution: Duplicate

> allow ExternalSpillMap use accurate payload size rather than estimated
> ----------------------------------------------------------------------
>
>                 Key: HUDI-1795
>                 URL: https://issues.apache.org/jira/browse/HUDI-1795
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Compaction
>            Reporter: ZiyueGuan
>            Priority: Major
>
> Situation: In ExternalSpillMap, we need to control the amount of data in 
> memory map to avoid OOM. Currently, we evaluate this by estimate the average 
> size of each payload twice. And get total memory use by multiple average 
> payload size with payload number. The first time we get the size is when 
> first payload is inserted while the second time is when there are 100 
> payloads stored in memory. 
> Problem: If the size is underestimated in the second estimation, an OOM will 
> happen.
> Plan: Could we have a flag to control if we want an evaluation in accurate?
> Currently, I have several ideas but not sure which one could be the best or 
> if there are any better one.
>  # Estimate each payload, store the length of payload with its value.  Once 
> update or remove happen, use diff old length and add new length if needed so 
> that we keep the sum of all payload size precisely. This is the method I 
> currently use in prod.
>  # Do not store the length but evaluate old payload again when it is popped. 
> It trades off space against time comparing to method one. A better 
> performance may be reached when updating and removing are rare. I didn't 
> adopt this because I had profile ingestion process by arthas and found size 
> estimating in that may be time consuming in flame graph. But I'm not sure 
> whether it is true in compaction. In my intuition,HoodieRecordPayload has a 
> quite simple structure.
>  # I also have a more accurate estimate method that is evaluate the whole map 
> when size is 1,100,10000 and one million. Less underestimate will happen in 
> such large amount of data.
> Look forward to any advice or suggestion or discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to