[GitHub] [arrow-datafusion] tustvold commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

GitBox Tue, 08 Mar 2022 10:52:25 -0800


tustvold commented on pull request #1905:
URL: 
https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1062096897



   I intend to pick up #1617 tomorrow and so I might have some more thoughts 
then, but some initial thoughts:
   
   * `ObjectStoreWrapper` has perhaps surprising behaviour with respect to the 
read position, a call to read (and/or seek) can side-effect on clones
   * https://doc.rust-lang.org/std/fs/struct.File.html#method.try_clone might 
provide another option, although it shares the same pain as above w.r.t shared 
interior mutability. The parquet crate uses this 
https://github.com/apache/arrow-rs/blob/master/parquet/src/util/io.rs for 
better or worse
   * I agree that query parallelism is better handled at a higher level, e.g. 
in PartitionedFile, vs introducing internal parallelism for row groups within 
ParquetExec
   * The ChunkReader abstraction used by parquet is extremely confusing, 
probably not something we want to replicate, and something I may actually 
remove in the future - https://github.com/apache/arrow-rs/issues/1163
   * The async parquet exec PR as currently written moves away from using the 
ChunkObjectReader, etc... plumbing in favour of using the tokio io traits 
directly and so will likely fix this issue - 
https://github.com/apache/arrow-datafusion/pull/1617/files#diff-4b6766123d48840a81a799938d38bffd72dde49603e0a3fd9d3c0b5b46d2d224R54
   
   In short I think this is working around a slightly funky API exposed by the 
parquet crate, and to be honest I'd rather fix this API and/or move to the 
async reader API which doesn't share the same funky.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] tustvold commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Reply via email to