eeroel commented on PR #37868:
URL: https://github.com/apache/arrow/pull/37868#issuecomment-1913672232

   > Yes, I changed fake_size to -9999 and reran it and it still worked. But 
since I am not using S3 (only have access to GCS and ADLSgen2) perhaps it is 
just ignored entirely.
   
   OK, thanks for confirming. I think it's good to check this, I didn't add a 
test case for negative values for this PR. Could be a bug, or it's possible 
there's some validation somewhere along the chain that silently ignores the 
value.
   
   > My original plan was to take a look at deltalake (delta-rs library) which 
already uses make_fragments(). Since the transaction log (get_add_actions()) 
has the actual file sizes then we could pass these to make_fragments() for some 
potential efficiency correct?
   
   Yep, that was the motivation for this PR! I actually implemented this in 
deltalake for the filesystem implementation, so if you can use that then the 
optimization should already be applied: 
https://github.com/delta-io/delta-rs/blob/1b6c830aae4553d2a079a2bf42b024863fcbbb40/python/deltalake/table.py#L1035
   
   To use this with the Arrow GCS implementation, I think the OpenFile method 
should be updated to check the size from FileInfo.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to