WilliamWhispell edited a comment on issue #2265:
URL: https://github.com/apache/hudi/issues/2265#issuecomment-830462191


   So another side effect, when you remove the breaking entry from the example 
above, hudi_key 4 that has the null in the array:
   
   docker run -d -p 8888:8888 -v ./data:/home/jovyan --name spark 
jupyter/pyspark-notebook
   
   docker exec -it jupyter/pyspark-notebook /bin/bash
   
   pyspark \
     --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
        from pyspark.sql import SparkSession
   
        spark = SparkSession.builder.master('local').getOrCreate()
   
        spark_df = spark.createDataFrame([
                (1, '2020/10/29', ['NY123', 'LA456']),
                (2, '2020/10/29', []),
                (3, '2020/10/29', None),
                (5, '2020/10/29', ['ABC123']), 
        ], ['hudi_key', 'hudi_partition', 'postcodes'])
   
   
        spark_df.write.parquet("/home/jovyan/spark-out-native.parquet")
        
        
spark_df.write.format("org.apache.hudi").mode("append").option("hoodie.table.name",
 "test").option("hoodie.datasource.write.precombine.field", 
"hudi_key").option("hoodie.datasource.write.recordkey.field", 
"hudi_key").save("/home/jovyan/spark-out-hudi.parquet")
        
        quit()
        
   exit
   
   docker cp spark:/home/jovyan/spark-out-native.parquet ./
   docker cp spark:/home/jovyan/spark-out-hudi.parquet ./
   
   
   Then examine the schema, you'll see the spark native write has:
   
   optional group field_id=3 postcodes (List) {
       repeated group field_id=4 list {
         optional byte_array field_id=5 element (String);
       }
     }
     
   however, the hudi write has:
     optional group field_id=8 postcodes (List) {
       repeated byte_array field_id=9 array (String);
     }
   
   
   It appears that hudi writes using the old list format..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to