albertlockett commented on code in PR #8069:
URL: https://github.com/apache/arrow-rs/pull/8069#discussion_r2305220259


##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -4293,4 +4304,50 @@ mod tests {
         assert_eq!(get_dict_page_size(col0_meta), 1024 * 1024);
         assert_eq!(get_dict_page_size(col1_meta), 1024 * 1024 * 4);
     }
+
+    #[test]
+    fn arrow_writer_run_end_encoded() {
+        // Create a run array of strings
+        let mut builder = StringRunBuilder::<Int16Type>::new();
+        builder.extend(
+            vec![Some("alpha"); 1000]
+                .into_iter()
+                .chain(vec![Some("beta"); 1000]),
+        );
+        let run_array: RunArray<Int16Type> = builder.finish();
+        println!("run_array type: {:?}", run_array.data_type());
+        let schema = Arc::new(Schema::new(vec![Field::new(
+            "ree",
+            run_array.data_type().clone(),
+            run_array.is_nullable(),
+        )]));
+
+        // Write to parquet
+        let mut parquet_bytes: Vec<u8> = Vec::new();
+        let mut writer = ArrowWriter::try_new(&mut parquet_bytes, 
schema.clone(), None).unwrap();
+        let batch = RecordBatch::try_new(schema.clone(), 
vec![Arc::new(run_array)]).unwrap();
+        writer.write(&batch).unwrap();
+        writer.close().unwrap();
+
+        // Schema of output is plain, not dictionary or REE encoded!!

Review Comment:
   Also, if this has been added ^^^ I suspect it might cause an error. If i'm 
not mistaken, once the arrow type is known, when we construct the reader we'll 
end up in this builder code:
   
https://github.com/apache/arrow-rs/blob/1dacecba8e11cac307eea5d1a0f10c22d7f4a8b7/parquet/src/arrow/array_reader/builder.rs#L381-L394
   
   and because there's no match for the arrow datatype 
`DataType::RunEndEncoded`, we'll fall into this branch which may error
   ```
   _ => make_byte_array_reader(page_iterator, column_desc, arrow_type)?,
   ```
   
   The quick fix would be to add something like this, which will still cause 
the array to read back as plan encoded array.
   ```rs
                   Some(DataType::RunEndEncoded(_, val)) => {
                       make_byte_array_reader(page_iterator, column_desc, 
Some(val.data_type().clone()))?
                   }
   ```
   
   The more involved (and probably more correct) thing to do would be to add a 
reader implementation for REE. Similarly to what's been done for dictionary 
arrays
   
https://github.com/apache/arrow-rs/blob/main/parquet/src/arrow/array_reader/byte_array_dictionary.rs
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to