[GitHub] [hudi] WilliamWhispell edited a comment on issue #2265: Arrays with nulls in them result in broken parquet files

GitBox Fri, 30 Apr 2021 16:50:53 -0700


WilliamWhispell edited a comment on issue #2265:
URL: https://github.com/apache/hudi/issues/2265#issuecomment-830462191



   So another side effect, when you remove the breaking entry from the example 
above, hudi_key 4 that has the null in the array:
   
   docker run -d -p 8888:8888 -v ./data:/home/jovyan --name spark 
jupyter/pyspark-notebook
   
   docker exec -it jupyter/pyspark-notebook /bin/bash
   
   pyspark \
     --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
        from pyspark.sql import SparkSession
   
        spark = SparkSession.builder.master('local').getOrCreate()
   
        spark_df = spark.createDataFrame([
                (1, '2020/10/29', ['NY123', 'LA456']),
                (2, '2020/10/29', []),
                (3, '2020/10/29', None),
                (5, '2020/10/29', ['ABC123']), 
        ], ['hudi_key', 'hudi_partition', 'postcodes'])
   
   
        spark_df.write.parquet("/home/jovyan/spark-out-native.parquet")
        
        
spark_df.write.format("org.apache.hudi").mode("append").option("hoodie.table.name",
 "test").option("hoodie.datasource.write.precombine.field", 
"hudi_key").option("hoodie.datasource.write.recordkey.field", 
"hudi_key").save("/home/jovyan/spark-out-hudi.parquet")
        
        quit()
        
   exit
   
   docker cp spark:/home/jovyan/spark-out-native.parquet ./
   docker cp spark:/home/jovyan/spark-out-hudi.parquet ./
   
   
   Then examine the schema, you'll see the spark native write has:
   
   optional group field_id=3 postcodes (List) {
       repeated group field_id=4 list {
         optional byte_array field_id=5 element (String);
       }
     }
     
   however, the hudi write has:
     optional group field_id=8 postcodes (List) {
       repeated byte_array field_id=9 array (String);
     }
   
   
   It appears that hudi writes using the old list format..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] WilliamWhispell edited a comment on issue #2265: Arrays with nulls in them result in broken parquet files

Reply via email to