kszucs commented on PR #47090:
URL: https://github.com/apache/arrow/pull/47090#issuecomment-3124574602

   It depends on the value of the `length limit`. The current `size limit` is 
1MB calculated after encoding while CDC default size range calculated on the 
logical values before encoding is between 256KB and 1MB. CDC chunking is 
applied before the size limit check, so using the default parameters should 
trigger a data page write before the size limit check. If the size limit is set 
to a smaller value, then there will be two data pages, a larger one cut at the 
size limit, and a smaller one cut at the CDC boundary because the CDC hash is 
not being reset if size limit is triggered. So basically there are two cases:
   
   a) size limits are bigger than the cdc limits, then the pages are cut 
earlier than the size limits would happen:
   `page1 [cdc-cut] page2 [cdc-cut] page3`
   b) size limits are smaller than the cdc limits, then the previous cdc cuts 
will happen nonetheless, but the pages are going to be split according to the 
size limits
   `page1 [cdc-cut] page2/a [size-cut] page2/b [cdc-cut] page3`
   
   So in theory it shouldn't affect the CDC's effectiveness. We can also check 
this before merging using 
https://github.com/huggingface/dataset-dedupe-estimator
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to