wgtmac commented on issue #37389:
URL: https://github.com/apache/arrow/issues/37389#issuecomment-1708621290

   @negrusti Sorry for the late reply. I have tried to use Apache Spark to 
rewrite the file and it seems worked.
   
   ```scala
   package a.b.c
   
   import org.apache.spark.sql.SparkSession
   
   object ParquetTest {
   
     def main(args: Array[String]) {
       val spark = SparkSession.builder.master("local[1]").appName("Parquet 
Test").getOrCreate()
       
spark.read.parquet("/tmp/test.parquet").repartition(1).write.parquet("/tmp/rewrite")
     }
   
   }
   ```
   
   The original schema using parquet-cli from parquet-mr
   ```
   Created by: parquet-mr version 1.12.2-amzn-athena-1 (build 
6c6353027ef5d7782e8657ea59b290452cdfdaee)
   Properties:
     writer.time.zone: UTC
   Schema:
   message hive_schema {
     optional binary id (STRING);
     optional int96 updatetime;
     optional int32 version;
     optional int32 level;
     optional binary subtype (STRING);
     optional group connectors (LIST) {
       repeated group bag {
         optional binary array_element (STRING);
       }
     }
     optional binary road (STRING);
     optional group sources (LIST) {
       repeated group bag {
         optional group array_element (MAP) {
           repeated group key_value (MAP_KEY_VALUE) {
             optional binary key (STRING);
             optional binary value (STRING);
           }
         }
       }
     }
     optional group bbox {
       optional double minx;
       optional double maxx;
       optional double miny;
       optional double maxy;
     }
     optional binary geometry;
   }
   ```
   
   The schema in the rewritten file is
   ```
   Created by: parquet-mr version 1.12.2 (build 
77e30c8093386ec52c3cfa6c34b7ef3321322c94)
   Properties:
                      org.apache.spark.version: 3.3.2
     org.apache.spark.sql.parquet.row.metadata: 
{"type":"struct","fields":[{"name":"id","type":"string","nullable":true,"metadata":{}},{"name":"updatetime","type":"timestamp","nullable":true,"metadata":{}},{"name":"version","type":"integer","nullable":true,"metadata":{}},{"name":"level","type":"integer","nullable":true,"metadata":{}},{"name":"subtype","type":"string","nullable":true,"metadata":{}},{"name":"connectors","type":{"type":"array","elementType":"string","containsNull":true},"nullable":true,"metadata":{}},{"name":"road","type":"string","nullable":true,"metadata":{}},{"name":"sources","type":{"type":"array","elementType":{"type":"map","keyType":"string","valueType":"string","valueContainsNull":true},"containsNull":true},"nullable":true,"metadata":{}},{"name":"bbox","type":{"type":"struct","fields":[{"name":"minx","type":"double","nullable":true,"metadata":{}},{"name":"maxx","type":"double","nullable":true,"metadata":{}},{"name":"miny","type":"double","nullable":true,"metadata
 
":{}},{"name":"maxy","type":"double","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"geometry","type":"binary","nullable":true,"metadata":{}}]}
   Schema:
   message spark_schema {
     optional binary id (STRING);
     optional int96 updatetime;
     optional int32 version;
     optional int32 level;
     optional binary subtype (STRING);
     optional group connectors (LIST) {
       repeated group list {
         optional binary element (STRING);
       }
     }
     optional binary road (STRING);
     optional group sources (LIST) {
       repeated group list {
         optional group element (MAP) {
           repeated group key_value {
             required binary key (STRING);
             optional binary value (STRING);
           }
         }
       }
     }
     optional group bbox {
       optional double minx;
       optional double maxx;
       optional double miny;
       optional double maxy;
     }
     optional binary geometry;
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to