alamb commented on code in PR #9357:
URL: https://github.com/apache/arrow-rs/pull/9357#discussion_r2790331330
##########
parquet/src/file/properties.rs:
##########
@@ -575,7 +595,34 @@ impl WriterPropertiesBuilder {
/// If the value is set to 0.
pub fn set_max_row_group_size(mut self, value: usize) -> Self {
Review Comment:
> Wait wait, no so fast, this is a breaking change, as clippy will fail for
users, I was asking, it might be in a different pr, but open for discussion. if
you keep it please update the pr description under changes to users
Yeah, in general I agree with @yonipeleg33 -- I don't think a clippy
failure is a breaking change per-se -- the rust compiler will be happy to
compile it. If downstream projects want to take a more strict "clippy must
pass" stance I donthink that is technically an API breakage
##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -331,18 +341,58 @@ impl<W: Write + Send> ArrowWriter<W> {
),
};
- // If would exceed max_row_group_size, split batch
- if in_progress.buffered_rows + batch.num_rows() >
self.max_row_group_size {
- let to_write = self.max_row_group_size - in_progress.buffered_rows;
- let a = batch.slice(0, to_write);
- let b = batch.slice(to_write, batch.num_rows() - to_write);
- self.write(&a)?;
- return self.write(&b);
+ if let Some(max_rows) = self.max_row_group_row_count {
+ if in_progress.buffered_rows + batch.num_rows() > max_rows {
+ let to_write = max_rows - in_progress.buffered_rows;
+ let a = batch.slice(0, to_write);
+ let b = batch.slice(to_write, batch.num_rows() - to_write);
+ self.write(&a)?;
+ return self.write(&b);
Review Comment:
Since this recurses, this could potentially blow out the stack with
pathalogical inputs (e.g. a RecordBatch with 1M rows with a max_row_group_count
of 1). I don't think it is necessary to fix now, I just wanted to point it out
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]