WilliamWhispell edited a comment on issue #2265:
URL: https://github.com/apache/hudi/issues/2265#issuecomment-830462191
So another side effect, when you remove the breaking entry from the example
above, hudi_key 4 that has the null in the array:
docker run -d -p 8888:8888 -v ./data:/home/jovyan --name spark
jupyter/pyspark-notebook
docker exec -it jupyter/pyspark-notebook /bin/bash
pyspark \
--packages
org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1
\
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').getOrCreate()
spark_df = spark.createDataFrame([
(1, '2020/10/29', ['NY123', 'LA456']),
(2, '2020/10/29', []),
(3, '2020/10/29', None),
(5, '2020/10/29', ['ABC123']),
], ['hudi_key', 'hudi_partition', 'postcodes'])
spark_df.write.parquet("/home/jovyan/spark-out-native.parquet")
spark_df.write.format("org.apache.hudi").mode("append").option("hoodie.table.name",
"test").option("hoodie.datasource.write.precombine.field",
"hudi_key").option("hoodie.datasource.write.recordkey.field",
"hudi_key").save("/home/jovyan/spark-out-hudi.parquet")
quit()
exit
docker cp spark:/home/jovyan/spark-out-native.parquet ./
docker cp spark:/home/jovyan/spark-out-hudi.parquet ./
Then examine the schema, you'll see the spark native write has:
optional group field_id=3 postcodes (List) {
repeated group field_id=4 list {
optional byte_array field_id=5 element (String);
}
}
however, the hudi write has:
optional group field_id=8 postcodes (List) {
repeated byte_array field_id=9 array (String);
}
It appears that hudi writes using the old list format..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]