[
https://issues.apache.org/jira/browse/HUDI-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ZiyueGuan closed HUDI-1795.
---------------------------
Resolution: Duplicate
> allow ExternalSpillMap use accurate payload size rather than estimated
> ----------------------------------------------------------------------
>
> Key: HUDI-1795
> URL: https://issues.apache.org/jira/browse/HUDI-1795
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Compaction
> Reporter: ZiyueGuan
> Priority: Major
>
> Situation: In ExternalSpillMap, we need to control the amount of data in
> memory map to avoid OOM. Currently, we evaluate this by estimate the average
> size of each payload twice. And get total memory use by multiple average
> payload size with payload number. The first time we get the size is when
> first payload is inserted while the second time is when there are 100
> payloads stored in memory.
> Problem: If the size is underestimated in the second estimation, an OOM will
> happen.
> Plan: Could we have a flag to control if we want an evaluation in accurate?
> Currently, I have several ideas but not sure which one could be the best or
> if there are any better one.
> # Estimate each payload, store the length of payload with its value. Once
> update or remove happen, use diff old length and add new length if needed so
> that we keep the sum of all payload size precisely. This is the method I
> currently use in prod.
> # Do not store the length but evaluate old payload again when it is popped.
> It trades off space against time comparing to method one. A better
> performance may be reached when updating and removing are rare. I didn't
> adopt this because I had profile ingestion process by arthas and found size
> estimating in that may be time consuming in flame graph. But I'm not sure
> whether it is true in compaction. In my intuition,HoodieRecordPayload has a
> quite simple structure.
> # I also have a more accurate estimate method that is evaluate the whole map
> when size is 1,100,10000 and one million. Less underestimate will happen in
> such large amount of data.
> Look forward to any advice or suggestion or discussion.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)