eeroel commented on PR #37868: URL: https://github.com/apache/arrow/pull/37868#issuecomment-1913672232
> Yes, I changed fake_size to -9999 and reran it and it still worked. But since I am not using S3 (only have access to GCS and ADLSgen2) perhaps it is just ignored entirely. OK, thanks for confirming. I think it's good to check this, I didn't add a test case for negative values for this PR. Could be a bug, or it's possible there's some validation somewhere along the chain that silently ignores the value. > My original plan was to take a look at deltalake (delta-rs library) which already uses make_fragments(). Since the transaction log (get_add_actions()) has the actual file sizes then we could pass these to make_fragments() for some potential efficiency correct? Yep, that was the motivation for this PR! I actually implemented this in deltalake for the filesystem implementation, so if you can use that then the optimization should already be applied: https://github.com/delta-io/delta-rs/blob/1b6c830aae4553d2a079a2bf42b024863fcbbb40/python/deltalake/table.py#L1035 To use this with the Arrow GCS implementation, I think the OpenFile method should be updated to check the size from FileInfo. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
