kszucs commented on PR #47090: URL: https://github.com/apache/arrow/pull/47090#issuecomment-3124574602
It depends on the value of the `length limit`. The current `size limit` is 1MB calculated after encoding while CDC default size range calculated on the logical values before encoding is between 256KB and 1MB. CDC chunking is applied before the size limit check, so using the default parameters should trigger a data page write before the size limit check. If the size limit is set to a smaller value, then there will be two data pages, a larger one cut at the size limit, and a smaller one cut at the CDC boundary because the CDC hash is not being reset if size limit is triggered. So basically there are two cases: a) size limits are bigger than the cdc limits, then the pages are cut earlier than the size limits would happen: `page1 [cdc-cut] page2 [cdc-cut] page3` b) size limits are smaller than the cdc limits, then the previous cdc cuts will happen nonetheless, but the pages are going to be split according to the size limits `page1 [cdc-cut] page2/a [size-cut] page2/b [cdc-cut] page3` So in theory it shouldn't affect the CDC's effectiveness. We can also check this before merging using https://github.com/huggingface/dataset-dedupe-estimator -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org