ZiyueGuan created HUDI-1795:
-------------------------------
Summary: allow ExternalSpillMap use accurate payload size rather
than estimated
Key: HUDI-1795
URL: https://issues.apache.org/jira/browse/HUDI-1795
Project: Apache Hudi
Issue Type: Improvement
Components: Compaction
Reporter: ZiyueGuan
Situation: In ExternalSpillMap, we need to control the amount of data in memory
map to avoid OOM. Currently, we evaluate this by estimate the average size of
each payload twice. And get total memory use by multiple average payload size
with payload number. The first time we get the size is when first payload is
inserted while the second time is when there are 100 payloads stored in memory.
Problem: If the size is underestimated in the second estimation, an OOM will
happen.
Plan: Could we have a flag to control if we want an evaluation in accurate?
Currently, I have several ideas but not sure which one could be the best or if
there are any better one.
# Estimate each payload, store the length of payload with its value. Once
update or remove happen, use diff old length and add new length if needed so
that we keep the sum of all payload size precisely. This is the method I
currently use in prod.
# Do not store the length but evaluate old payload again when it is popped. It
trades off space against time comparing to method one. A better performance may
be reached when updating and removing are rare. I didn't adopt this because I
had profile ingestion process by arthas and found size estimating in that may
be time consuming in flame graph. But I'm not sure whether it is true in
compaction. In my intuition,HoodieRecordPayload has a quite simple structure.
# I also have a more accurate estimate method that is evaluate the whole map
when size is 1,100,10000 and one million. Less underestimate will happen in
such large amount of data.
Look forward to any advice or suggestion or discussion.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)