rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-704796994


   The discussion with @alamb about the need for an intermediate layer when 
reading a parquet file is discussed on 
[JIRA](https://issues.apache.org/jira/browse/ARROW-10135)
   
   The highlights of the current implementation:
   - The public API has changed, but keeps working for `File` and `Path` thanks 
to the corresponding trait implementations. `Cursor` cannot be used any more 
because it requires data copies when being passed around with `clone()` (this 
was already the case before in the implem of `TryClone` for `Cursor<Vec<u8>>`).
   - I have added a custom cursor type (`SliceableCursor`) that allows to 
generate cursor slices without cloning the underlying data. This can be used to 
read in-memory files. I guess it could be made more generic, but this would be 
for convenience only and I find it simple and clear as is.
   - I have separated the implem (`SerializedFileReader`, 
`SerializedRowGroupReader`...) from the traits (`FileReader`, 
`RowGroupReader`...) for more clarity. I know that this is not how the code 
base is structured in the rest of the project but I tend to get lost in these 
huge files with millions of struct/trait/impl blocks. I'm very much open to 
suggestion about this point!
   - There is nothing about async/parallelism yet, I have to think about it a 
little bit.
   
   @alamb : can you take a look at the new `ChunckReader` trait and how it is 
integrated to the rest of the reader? What do you think about it? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to