zeapo commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-636850680


   Thanks for your feedback.
   
   > Do you mean other compression formats?
   
   Yes, but not only. This would allow also to have different filesystems (like 
S3) where using `File` is not possible, and would make much more sense to have 
`dyn Read` (in rusoto they use ByteStream that implements Read and AsyncRead)
   
   > There have also been some changes on the arrow::csv side, such as allowing 
inference of multiple files, which might also be convenient to have in 
arrow::json
   
   That would be great. This would help when there are multiple small file that 
are written without a fixed batch size (small & big files) and data would be 
scattered across multiple files. If you have a JIRA issue for this I can take a 
look at it :)
   
   > I'm still pro returning the reader back to the start, or is there a 
performance impact in doing so? I wouldn't want to place the burden of seeking 
on the user, because I'd expect the common inference case to be getting the 
schema then reading the file.
   
   I agree that placing the burden on the user is a bad idea. However, there 
are situations where we just can't seek back to start (s3 is one example). 
Maybe a specific implementation for `Seek + Read`, that would do the seek back 
to start, and one for `Read` only, that would not. However... this would need 
the use of specialization, so more nightly dependencies.
   
   Not really sure :/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to