shenzhu commented on pull request #24:
URL: https://github.com/apache/flink-table-store/pull/24#issuecomment-1049587025


   > Hi @shenzhu .
   > 
   > After a second thought, I believe we should still stick to your current 
implementation with a small change. There are several reasons:
   > 
   > * After merging several files into one file, the size of the resulting 
file should shrink evidently. The file size may drop below the merging 
threshold again and is eligible for another merge. I stated that after a file 
is merged we'll never touch it again, but we can see that it is not true.
   > * If we do not introduce this merging count constraint, we might merge an 
old large file and a small freshly produced file several times until the old 
large file exceeds the size threshold. This will cause the write overhead to 
grow wildly (let's say the old large file is 7MB and it is merged 10 times, 
then the write overhead will be 7 * 10MB) and this is what we'd like to avoid.
   > 
   > However your current implementation still has some problems.
   > 
   > * Let's begin with a 7MB file. After 29 snapshots (each snapshot produce 
one 1MB file) you'll merge these 30 files into a 7 + 29 = 36MB file. After 
another 29 snapshots you'll again merge these 30 files into a 36 + 29 = 65MB 
file. You can see that the first file will become larger and larger and the 
write overhead will grow unboundedly. Our current implementation (with size 
constraint only) does not have this problem.
   > * You're missing the case when we merge the last bits of files. That said, 
these files should not be merged because they do not meet any of the 
constraints. You should remove this merging and only write the new entries into 
a manifest file.
   > 
   > To avoid these problems, you should perform merging if either size or 
count constraint is met. Just like @JingsongLi suggested.
   > 
   > About your questions. For Q1 we currently do not have such a mechanism. 
For Q2 you'll also face this problem in the current implementation. It is OK 
because manifests are meaningful only if they're viewed as a whole.
   
   Hey @tsreaper , got it, thanks for your explanation!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to