tustvold commented on pull request #1905: URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1062096897
I intend to pick up #1617 tomorrow and so I might have some more thoughts then, but some initial thoughts: * `ObjectStoreWrapper` has perhaps surprising behaviour with respect to the read position, a call to read (and/or seek) can side-effect on clones * https://doc.rust-lang.org/std/fs/struct.File.html#method.try_clone might provide another option, although it shares the same pain as above w.r.t shared interior mutability. The parquet crate uses this https://github.com/apache/arrow-rs/blob/master/parquet/src/util/io.rs for better or worse * I agree that query parallelism is better handled at a higher level, e.g. in PartitionedFile, vs introducing internal parallelism for row groups within ParquetExec * The ChunkReader abstraction used by parquet is extremely confusing, probably not something we want to replicate, and something I may actually remove in the future - https://github.com/apache/arrow-rs/issues/1163 * The async parquet exec PR as currently written moves away from using the ChunkObjectReader, etc... plumbing in favour of using the tokio io traits directly and so will likely fix this issue - https://github.com/apache/arrow-datafusion/pull/1617/files#diff-4b6766123d48840a81a799938d38bffd72dde49603e0a3fd9d3c0b5b46d2d224R54 In short I think this is working around a slightly funky API exposed by the parquet crate, and to be honest I'd rather fix this API and/or move to the async reader API which doesn't share the same funky. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
