tustvold commented on code in PR #4156:
URL: https://github.com/apache/arrow-rs/pull/4156#discussion_r1180807171
##########
parquet/src/file/reader.rs:
##########
@@ -44,19 +46,47 @@ pub trait Length {
}
/// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded
on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+///
+/// For more information see [`File::try_clone`]
pub trait ChunkReader: Length + Send + Sync {
- type T: Read + Send;
- /// Get a serially readable slice of the current reader
- /// This should fail if the slice exceeds the current bounds
- fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
+ type T: Read;
+
+ /// Get a [`Read`] starting at the provided file offset
+ ///
+ /// Subsequent or concurrent calls to [`Self::get_read`] or
[`Self::get_bytes`] may
Review Comment:
FileSource provided protection against subsequent calls to get_read, by
calling Seek on every read, but provided no protection against concurrent
access. I think it is less risky to just clearly not support non-serial usage,
than to only break on concurrent usage.
**TBC there are no safety implications of not synchronising this access**.
You will just get interleaved reads, which is no different from just reading
gibberish.
One option would be to add `Mutex` to synchronise access, however, this
solution is necessarily incomplete as a user can just call `File::try_clone`.
Ultimately there is no reliable way to synchronise file IO, I think if no
synchronisation is fine for the standard library, it is fine for the parquet
crate.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]