ZiyueGuan created HUDI-1795:
-------------------------------

             Summary: allow ExternalSpillMap use accurate payload size rather 
than estimated
                 Key: HUDI-1795
                 URL: https://issues.apache.org/jira/browse/HUDI-1795
             Project: Apache Hudi
          Issue Type: Improvement
          Components: Compaction
            Reporter: ZiyueGuan


Situation: In ExternalSpillMap, we need to control the amount of data in memory 
map to avoid OOM. Currently, we evaluate this by estimate the average size of 
each payload twice. And get total memory use by multiple average payload size 
with payload number. The first time we get the size is when first payload is 
inserted while the second time is when there are 100 payloads stored in memory. 

Problem: If the size is underestimated in the second estimation, an OOM will 
happen.

Plan: Could we have a flag to control if we want an evaluation in accurate?

Currently, I have several ideas but not sure which one could be the best or if 
there are any better one.
 # Estimate each payload, store the length of payload with its value.  Once 
update or remove happen, use diff old length and add new length if needed so 
that we keep the sum of all payload size precisely. This is the method I 
currently use in prod.
 # Do not store the length but evaluate old payload again when it is popped. It 
trades off space against time comparing to method one. A better performance may 
be reached when updating and removing are rare. I didn't adopt this because I 
had profile ingestion process by arthas and found size estimating in that may 
be time consuming in flame graph. But I'm not sure whether it is true in 
compaction. In my intuition,HoodieRecordPayload has a quite simple structure.
 # I also have a more accurate estimate method that is evaluate the whole map 
when size is 1,100,10000 and one million. Less underestimate will happen in 
such large amount of data.

Look forward to any advice or suggestion or discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to