Re: [PR] Convert RunEndEncoded to Parquet [arrow-rs]

via GitHub Fri, 29 Aug 2025 05:25:58 -0700


albertlockett commented on code in PR #8069:
URL: https://github.com/apache/arrow-rs/pull/8069#discussion_r2310030616



##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -4293,4 +4304,50 @@ mod tests {
         assert_eq!(get_dict_page_size(col0_meta), 1024 * 1024);
         assert_eq!(get_dict_page_size(col1_meta), 1024 * 1024 * 4);
     }
+
+    #[test]
+    fn arrow_writer_run_end_encoded() {
+        // Create a run array of strings
+        let mut builder = StringRunBuilder::<Int16Type>::new();
+        builder.extend(
+            vec![Some("alpha"); 1000]
+                .into_iter()
+                .chain(vec![Some("beta"); 1000]),
+        );
+        let run_array: RunArray<Int16Type> = builder.finish();
+        println!("run_array type: {:?}", run_array.data_type());
+        let schema = Arc::new(Schema::new(vec![Field::new(
+            "ree",
+            run_array.data_type().clone(),
+            run_array.is_nullable(),
+        )]));
+
+        // Write to parquet
+        let mut parquet_bytes: Vec<u8> = Vec::new();
+        let mut writer = ArrowWriter::try_new(&mut parquet_bytes, 
schema.clone(), None).unwrap();
+        let batch = RecordBatch::try_new(schema.clone(), 
vec![Arc::new(run_array)]).unwrap();
+        writer.write(&batch).unwrap();
+        writer.close().unwrap();
+
+        // Schema of output is plain, not dictionary or REE encoded!!

Review Comment:
   > It makes sense to add the reader implementation in this PR as well, right?
   
   Sure, if you have time! 
   
   If you don't have time to add in this PR maybe we could maybe add it as a 
followup (I'd be happy to help with this as well, if needed).
   
   The temporary workaround to get the Parquet column to decode to the correct 
Arrow type would be to decode the column to the native Arrow type, and then 
convert to a `RunArray` by casting. This is less efficient  because we 
materialize a full length array before encoding it to REE.
   
   For non-byte types (numberic types), this would already be happening. We 
follow this branch when creating the reader:
   
https://github.com/apache/arrow-rs/blob/1dacecba8e11cac307eea5d1a0f10c22d7f4a8b7/parquet/src/arrow/array_reader/builder.rs#L354-L358
   And then in the `PrimitiveArrayReader` we do the cast here:
   
https://github.com/apache/arrow-rs/blob/1dacecba8e11cac307eea5d1a0f10c22d7f4a8b7/parquet/src/arrow/array_reader/primitive_array.rs#L484
   
   For byte types (String, Binary, etc), we'd need to do something like this:
   
https://github.com/albertlockett/arrow-rs/commit/f3d072f5684ff2749e9a961eb8e9662a629cbc2b



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Convert RunEndEncoded to Parquet [arrow-rs]

Reply via email to