alamb commented on code in PR #5471:
URL: https://github.com/apache/arrow-rs/pull/5471#discussion_r1516893367
##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -244,16 +244,35 @@ impl<W: Write + Send> ArrowWriter<W> {
self.writer.append_key_value_metadata(kv_metadata)
}
+ /// Returns a reference to the underlying writer.
+ pub fn inner(&self) -> &W {
+ self.writer.inner()
+ }
+
+ /// Returns a mutable reference to the underlying writer.
+ ///
+ /// It is inadvisable to directly write to the underlying writer.
+ pub fn inner_mut(&mut self) -> &mut W {
+ self.writer.inner_mut()
+ }
+
/// Flushes any outstanding data and returns the underlying writer.
pub fn into_inner(mut self) -> Result<W> {
self.flush()?;
self.writer.into_inner()
}
/// Close and finalize the underlying Parquet writer
- pub fn close(mut self) -> Result<crate::format::FileMetaData> {
+ ///
+ /// Unlike [`Self::close`] this does not consume self
+ pub fn finish(&mut self) -> Result<crate::format::FileMetaData> {
self.flush()?;
- self.writer.close()
+ self.writer.finish()
+ }
+
+ /// Close and finalize the underlying Parquet writer
Review Comment:
```suggestion
/// Close and finalize the underlying Parquet writer, consuming self
///
/// Returns the metadata for the newly created Parquet file.
```
##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -244,16 +244,35 @@ impl<W: Write + Send> ArrowWriter<W> {
self.writer.append_key_value_metadata(kv_metadata)
}
+ /// Returns a reference to the underlying writer.
+ pub fn inner(&self) -> &W {
+ self.writer.inner()
+ }
+
+ /// Returns a mutable reference to the underlying writer.
+ ///
+ /// It is inadvisable to directly write to the underlying writer.
+ pub fn inner_mut(&mut self) -> &mut W {
+ self.writer.inner_mut()
+ }
+
/// Flushes any outstanding data and returns the underlying writer.
pub fn into_inner(mut self) -> Result<W> {
self.flush()?;
self.writer.into_inner()
}
/// Close and finalize the underlying Parquet writer
- pub fn close(mut self) -> Result<crate::format::FileMetaData> {
+ ///
Review Comment:
```suggestion
/// returning the completed [`crate::formatFileMetaData`] for the
written file.
```
##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -244,16 +244,35 @@ impl<W: Write + Send> ArrowWriter<W> {
self.writer.append_key_value_metadata(kv_metadata)
}
+ /// Returns a reference to the underlying writer.
+ pub fn inner(&self) -> &W {
+ self.writer.inner()
+ }
+
+ /// Returns a mutable reference to the underlying writer.
+ ///
+ /// It is inadvisable to directly write to the underlying writer.
+ pub fn inner_mut(&mut self) -> &mut W {
+ self.writer.inner_mut()
+ }
+
/// Flushes any outstanding data and returns the underlying writer.
pub fn into_inner(mut self) -> Result<W> {
self.flush()?;
self.writer.into_inner()
}
/// Close and finalize the underlying Parquet writer
- pub fn close(mut self) -> Result<crate::format::FileMetaData> {
+ ///
+ /// Unlike [`Self::close`] this does not consume self
+ pub fn finish(&mut self) -> Result<crate::format::FileMetaData> {
Review Comment:
Could you add some comments here about what happens if one attempts to write
to this writer after calling `finish()`?
##########
parquet/src/file/writer.rs:
##########
@@ -1766,6 +1802,11 @@ mod tests {
assert!(row_group.columns[1].offset_index_offset.is_some());
assert!(row_group.columns[1].column_index_offset.is_none());
+ let err = file_writer.next_row_group().err().unwrap().to_string();
+ assert_eq!(err, "Parquet error: SerializedFileWriter already
finished");
+
+ drop(file_writer);
Review Comment:
why is the drop necessary?
##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -244,16 +244,35 @@ impl<W: Write + Send> ArrowWriter<W> {
self.writer.append_key_value_metadata(kv_metadata)
}
+ /// Returns a reference to the underlying writer.
+ pub fn inner(&self) -> &W {
+ self.writer.inner()
+ }
+
+ /// Returns a mutable reference to the underlying writer.
+ ///
+ /// It is inadvisable to directly write to the underlying writer.
Review Comment:
Maybe we can add some additional commentary about *why* it is inadvisable
and maybe when it is ok (clearly there is some usecase otherwise we wouldn't
have it) for example
```suggestion
/// WARNING: directly writing to the underlying writer
/// prior to closing this ArrowWriter can result in corrupted
/// parquet files.
///
/// This API can be used to flush the underlying writer
```
I think the actual usecase in this PR is more complicated (the async writer
writes directly into the buffer) so I probably don't fully understand what is
going on / the use case. Perhaps all the more reason to improve the
documentation 🤔
##########
parquet/src/arrow/async_writer/mod.rs:
##########
@@ -71,22 +69,19 @@ use tokio::io::{AsyncWrite, AsyncWriteExt};
/// buffer's threshold is exceeded.
pub struct AsyncArrowWriter<W> {
/// Underlying sync writer
- sync_writer: ArrowWriter<SharedBuffer>,
+ sync_writer: ArrowWriter<Vec<u8>>,
/// Async writer provided by caller
async_writer: W,
- /// The inner buffer shared by the `sync_writer` and the `async_writer`
- shared_buffer: SharedBuffer,
-
/// Trigger forced flushing once buffer size reaches this value
buffer_size: usize,
}
impl<W: AsyncWrite + Unpin + Send> AsyncArrowWriter<W> {
/// Try to create a new Async Arrow Writer.
///
- /// `buffer_size` determines the number of bytes to buffer before flushing
+ /// `buffer_size` determines the minimum number of bytes to buffer before
flushing
Review Comment:
the comment below says "self::write" will flush the intermediate buffer it
it is at least half full, but this comment seems to imply that `buffer_size`
determines when to flush. This seems inconsistent to me, but maybe I don't
undersand
The same comments applies to `try_new_with_options` below
##########
parquet/src/file/writer.rs:
##########
@@ -387,6 +411,18 @@ impl<W: Write + Send> SerializedFileWriter<W> {
&self.props
}
+ /// Returns a reference to the underlying writer.
+ pub fn inner(&self) -> &W {
+ self.buf.inner()
+ }
+
+ /// Returns a mutable reference to the underlying writer.
+ ///
+ /// It is inadvisable to directly write to the underlying writer.
Review Comment:
As above, I think it would be good to explain *why* it is inadvisable and
document when one would use this API
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]