alamb commented on code in PR #10149:
URL: https://github.com/apache/arrow-rs/pull/10149#discussion_r3427742675
##########
parquet/src/file/writer.rs:
##########
@@ -236,14 +236,25 @@ impl<W: Write + Send> SerializedFileWriter<W> {
self.assert_previous_writer_closed()?;
let ordinal = self.row_group_index;
- let ordinal: i16 = ordinal.try_into().map_err(|_| {
+ // Thrift cannot encode lists with more than 2B elements
Review Comment:
I am not sure what the 2B elements refers to
##########
parquet/src/file/metadata/thrift/mod.rs:
##########
@@ -1438,6 +1433,8 @@ impl<'a> WriteThrift for FileMeta<'a> {
#[allow(unused_assignments)]
fn write_thrift<W: Write>(&self, writer: &mut
ThriftCompactOutputProtocol<W>) -> Result<()> {
writer.set_write_path_in_schema(self.write_path_in_schema);
+ // only write ordinal if all values will fit in an i16
Review Comment:
Would it be possible to write ordinal for the first `2^16-1` row groups, and
then just not write it for larger row groups. I think you could avoid changing
the thrift writer / adding `set_write_row_group_ordinal`
##########
parquet/src/parquet_thrift.rs:
##########
@@ -778,6 +780,16 @@ impl<W: Write> ThriftCompactOutputProtocol<W> {
self.write_path_in_schema
}
+ /// Control the writing of the `ordinal` element of the `RowGroup` struct.
Review Comment:
Should we add some context here about why one would disable writing the
optional `ordinal` field? Namely because it can't be set for row groups greater
than i32::MAX?
##########
parquet/src/file/writer.rs:
##########
@@ -2239,14 +2258,50 @@ mod tests {
assert_eq!(i, 0x8000);
assert_eq!(
e.to_string(),
- "Parquet error: Parquet does not support more than
32767 row groups per file (currently: 32768)"
+ "Parquet error: Parquet with encryption does not
support more than 32767 row groups per file (currently: 32768)"
);
}
}
}
writer.close().unwrap();
}
+ #[test]
+ fn test_32k_rowgroups() {
+ let message_type = "
+ message test_schema {
+ REQUIRED BYTE_ARRAY a (UTF8);
+ }
+ ";
+ let schema = Arc::new(parse_message_type(message_type).unwrap());
+ let file: File = tempfile::tempfile().unwrap();
+ let props = Arc::new(
+ WriterProperties::builder()
+ .set_statistics_enabled(EnabledStatistics::None)
+ .set_max_row_group_row_count(Some(1))
+ .build(),
+ );
+ let mut writer = SerializedFileWriter::new(&file, schema,
props).unwrap();
+
+ // Create 32k + 1 empty rowgroups. No row group ordinals should be
written (but we can't
+ // test for that).
Review Comment:
and arguably it is an implementation detail -- just testing writing 32k row
groups is enough
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]