tsreaper edited a comment on pull request #24:
URL: https://github.com/apache/flink-table-store/pull/24#issuecomment-1049521064


   Hi @shenzhu .
   
   After a second thought, I believe we should still stick to your current 
implementation with a small change. There are several reasons:
   * After merging several files into one file, the size of the resulting file 
should shrink evidently. The file size may drop below the merging threshold 
again and is eligible for another merge. I stated that after a file is merged 
we'll never touch it again, but we can see that it is not true.
   * If we do not introduce this merging count constraint, we might merge an 
old large file and a small freshly produced file several times until the old 
large file exceeds the size threshold. This will cause the write overhead to 
grow wildly (let's say the old large file is 7MB and it is merged 10 times, 
then the write overhead will be 7 * 10MB) and this is what we'd like to avoid.
   
   However your current implementation still has some problems.
   * Let's begin with a 7MB file. After 29 snapshots (each snapshot produce one 
1MB file) you'll merge these 30 files into a 7 + 29 = 36MB file. After another 
29 snapshots you'll again merge these 30 files into a 36 + 29 = 65MB file. You 
can see that the first file will become larger and larger and the write 
overhead will grow unboundedly. Our current implementation (with size 
constraint only) does not have this problem.
   * You're missing the case when we merge the last bits of files.
   
   To avoid these problems, you should perform merging if either size or count 
constraint is met. Just like @JingsongLi suggested.
   
   About your questions. For Q1 we currently do not have such a mechanism. For 
Q2 you'll also face this problem in the current implementation. It is OK 
because manifests are meaningful only if they're viewed as a whole.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to