alamb commented on code in PR #4269:
URL: https://github.com/apache/arrow-rs/pull/4269#discussion_r1204102839
##########
parquet/src/file/writer.rs:
##########
@@ -475,28 +464,107 @@ impl<'a, W: Write> SerializedRowGroupWriter<'a, W> {
Ok(())
};
+ (self.buf, Box::new(on_close))
+ }
- let column = self.descr.column(self.column_index);
- self.column_index += 1;
-
- Ok(Some(factory(
- column,
- &self.props,
- page_writer,
- Box::new(on_close),
- )?))
+ /// Returns the next column writer, if available, using the factory
function;
+ /// otherwise returns `None`.
+ pub(crate) fn next_column_with_factory<'b, F, C>(
+ &'b mut self,
+ factory: F,
+ ) -> Result<Option<C>>
+ where
+ F: FnOnce(
+ ColumnDescPtr,
+ WriterPropertiesPtr,
+ Box<dyn PageWriter + 'b>,
+ OnCloseColumnChunk<'b>,
+ ) -> Result<C>,
+ {
+ self.assert_previous_writer_closed()?;
+ Ok(match self.next_column_desc() {
+ Some(column) => {
+ let props = self.props.clone();
+ let (buf, on_close) = self.get_on_close();
+ let page_writer = Box::new(SerializedPageWriter::new(buf));
+ Some(factory(column, props, page_writer, Box::new(on_close))?)
+ }
+ None => None,
+ })
}
/// Returns the next column writer, if available; otherwise returns `None`.
/// In case of any IO error or Thrift error, or if row group writer has
already been
/// closed returns `Err`.
pub fn next_column(&mut self) ->
Result<Option<SerializedColumnWriter<'_>>> {
self.next_column_with_factory(|descr, props, page_writer, on_close| {
- let column_writer = get_column_writer(descr, props.clone(),
page_writer);
+ let column_writer = get_column_writer(descr, props, page_writer);
Ok(SerializedColumnWriter::new(column_writer, Some(on_close)))
})
}
+ /// Append a column chunk from another source without decoding it
+ ///
+ /// This can be used for efficiently concatenating or projecting parquet
data,
+ /// or encoding parquet data to temporary in-memory buffers
+ pub fn splice_column<R: ChunkReader>(
Review Comment:
I have some comments on this API:
# `pub`?
Are we happy enough with this API to mark it pub? Maybe we should leave it
crate private until there is an example showing how to use it (see `Example`
section ) below
# Naming
I don't understand the use of the name `splice` given this API appears to
*append* a column -- the only difference is that the column comes from some
other source.
Given this I suggest an alternate name like `append_column` or
`append_column_from_reader`
# Example
I also think it would be super helpful to write an example program in
[`parquet/examples`](https://github.com/apache/arrow-rs/tree/master/parquet/examples)
that shows how to append data to an existing file (e.g.
https://github.com/apache/arrow-rs/discussions/4150) and link that to this doc
commet . Perhaps you plan to do that as a follow on PR
# Documentation
I recommend the following additions:
1. Add the caveat from the PR description that the data has to match or else
invalid parquet will be written
2. Add a note that the next column from `reader` is appended to the next
column from `writer` (the state is stored in the reader)
3. Explain that the `close` is the result from closing the previous column
in this writer
# Reduced Foot Guns 🦶 🔫
While I agree providing users unlimited protection is probably not
reasonable, I do think we should provide basic error checking to help users
avoid silly errors.
For example, perhaps we can at least make sure `metadata.column_descr_ptr()`
matches the target column (to make sure the column name and type matches..)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]