nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r470102183



##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -674,8 +712,16 @@ mod tests {
         )
         .unwrap();
 
+        let props = WriterProperties::builder()
+            .set_key_value_metadata(Some(vec![KeyValue {
+                key: "test_key".to_string(),
+                value: Some("test_value".to_string()),
+            }]))
+            .build();
+
         let file = get_temp_file("test_arrow_writer_complex.parquet", &[]);

Review comment:
       TODO: this is stateful, and might fail as this file night not exist. 
Create a small file for the test

##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -53,13 +56,48 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
     pub fn try_new(
         writer: W,
         arrow_schema: SchemaRef,
-        props: Option<Rc<WriterProperties>>,
+        props: Option<WriterProperties>,
     ) -> Result<Self> {
         let schema = crate::arrow::arrow_to_parquet_schema(&arrow_schema)?;
+        // add serialized arrow schema
+        let mut serialized_schema = 
arrow::ipc::writer::schema_to_bytes(&arrow_schema);
+        let schema_len = serialized_schema.len();
+        let mut len_prefix_schema = Vec::with_capacity(schema_len + 8);
+        len_prefix_schema.append(&mut vec![255u8, 255, 255, 255]);

Review comment:
       Our IPC implementation is still based on the legacy format, so we have 
to manually prepend the continuation marker and message length.

##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -55,40 +55,75 @@ pub fn parquet_to_arrow_schema_by_columns<T>(
 where
     T: IntoIterator<Item = usize>,
 {
-    let mut base_nodes = Vec::new();
-    let mut base_nodes_set = HashSet::new();
-    let mut leaves = HashSet::new();
-
-    for c in column_indices {
-        let column = parquet_schema.column(c).self_type() as *const Type;
-        let root = parquet_schema.get_column_root(c);
-        let root_raw_ptr = root as *const Type;
-
-        leaves.insert(column);
-        if !base_nodes_set.contains(&root_raw_ptr) {
-            base_nodes.push(root);
-            base_nodes_set.insert(root_raw_ptr);
+    let mut metadata = 
parse_key_value_metadata(key_value_metadata).unwrap_or_default();

Review comment:
       @sunchao the main change here is that if the `ARROW:schema` exists, we 
try to decode it and pass it back as the schema to read.
   Unlike C++, if there's a problem reading the schema, we do not return an 
error, but continue to reconstruct the schema with some information loss. The 
error is logged to console.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to