zeapo commented on pull request #7309: URL: https://github.com/apache/arrow/pull/7309#issuecomment-636850680
Thanks for your feedback. > Do you mean other compression formats? Yes, but not only. This would allow also to have different filesystems (like S3) where using `File` is not possible, and would make much more sense to have `dyn Read` (in rusoto they use ByteStream that implements Read and AsyncRead) > There have also been some changes on the arrow::csv side, such as allowing inference of multiple files, which might also be convenient to have in arrow::json That would be great. This would help when there are multiple small file that are written without a fixed batch size (small & big files) and data would be scattered across multiple files. If you have a JIRA issue for this I can take a look at it :) > I'm still pro returning the reader back to the start, or is there a performance impact in doing so? I wouldn't want to place the burden of seeking on the user, because I'd expect the common inference case to be getting the schema then reading the file. I agree that placing the burden on the user is a bad idea. However, there are situations where we just can't seek back to start (s3 is one example). Maybe a specific implementation for `Seek + Read`, that would do the seek back to start, and one for `Read` only, that would not. However... this would need the use of specialization, so more nightly dependencies. Not really sure :/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
