tsreaper edited a comment on pull request #24: URL: https://github.com/apache/flink-table-store/pull/24#issuecomment-1049521064
Hi @shenzhu . After a second thought, I believe we should still stick to your current implementation with a small change. There are several reasons: * After merging several files into one file, the size of the resulting file should shrink evidently. The file size may drop below the merging threshold again and is eligible for another merge. I stated that after a file is merged we'll never touch it again, but we can see that it is not true. * If we do not introduce this merging count constraint, we might merge an old large file and a small freshly produced file several times until the old large file exceeds the size threshold. This will cause the write overhead to grow wildly (let's say the old large file is 7MB and it is merged 10 times, then the write overhead will be 7 * 10MB) and this is what we'd like to avoid. However your current implementation still has some problems. * Let's begin with a 7MB file. After 29 snapshots (each snapshot produce one 1MB file) you'll merge these 30 files into a 7 + 29 = 36MB file. After another 29 snapshots you'll again merge these 30 files into a 36 + 29 = 65MB file. You can see that the first file will become larger and larger and the write overhead will grow unboundedly. Our current implementation (with size constraint only) does not have this problem. * You're missing the case when we merge the last bits of files. To avoid these problems, you should perform merging if either size or count constraint is met. Just like @JingsongLi suggested. About your questions. For Q1 we currently do not have such a mechanism. For Q2 you'll also face this problem in the current implementation. It is OK because manifests are meaningful only if they're viewed as a whole. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
