alamb commented on code in PR #10149:
URL: https://github.com/apache/arrow-rs/pull/10149#discussion_r3427742675


##########
parquet/src/file/writer.rs:
##########
@@ -236,14 +236,25 @@ impl<W: Write + Send> SerializedFileWriter<W> {
         self.assert_previous_writer_closed()?;
         let ordinal = self.row_group_index;
 
-        let ordinal: i16 = ordinal.try_into().map_err(|_| {
+        // Thrift cannot encode lists with more than 2B elements

Review Comment:
   I am not sure what the 2B elements refers to
   
   



##########
parquet/src/file/metadata/thrift/mod.rs:
##########
@@ -1438,6 +1433,8 @@ impl<'a> WriteThrift for FileMeta<'a> {
     #[allow(unused_assignments)]
     fn write_thrift<W: Write>(&self, writer: &mut 
ThriftCompactOutputProtocol<W>) -> Result<()> {
         writer.set_write_path_in_schema(self.write_path_in_schema);
+        // only write ordinal if all values will fit in an i16

Review Comment:
   Would it be possible to write ordinal for the first `2^16-1` row groups, and 
then just not write it for larger row groups. I think you could avoid changing 
the thrift writer / adding `set_write_row_group_ordinal` 



##########
parquet/src/parquet_thrift.rs:
##########
@@ -778,6 +780,16 @@ impl<W: Write> ThriftCompactOutputProtocol<W> {
         self.write_path_in_schema
     }
 
+    /// Control the writing of the `ordinal` element of the `RowGroup` struct.

Review Comment:
   Should we add some context here about why one would disable writing the 
optional `ordinal` field? Namely because it can't be set for row groups greater 
than i32::MAX?



##########
parquet/src/file/writer.rs:
##########
@@ -2239,14 +2258,50 @@ mod tests {
                     assert_eq!(i, 0x8000);
                     assert_eq!(
                         e.to_string(),
-                        "Parquet error: Parquet does not support more than 
32767 row groups per file (currently: 32768)"
+                        "Parquet error: Parquet with encryption does not 
support more than 32767 row groups per file (currently: 32768)"
                     );
                 }
             }
         }
         writer.close().unwrap();
     }
 
+    #[test]
+    fn test_32k_rowgroups() {
+        let message_type = "
+            message test_schema {
+                REQUIRED BYTE_ARRAY a (UTF8);
+            }
+        ";
+        let schema = Arc::new(parse_message_type(message_type).unwrap());
+        let file: File = tempfile::tempfile().unwrap();
+        let props = Arc::new(
+            WriterProperties::builder()
+                .set_statistics_enabled(EnabledStatistics::None)
+                .set_max_row_group_row_count(Some(1))
+                .build(),
+        );
+        let mut writer = SerializedFileWriter::new(&file, schema, 
props).unwrap();
+
+        // Create 32k + 1 empty rowgroups. No row group ordinals should be 
written (but we can't
+        // test for that).

Review Comment:
   and arguably it is an implementation detail -- just testing writing 32k row 
groups is enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to