pengzhiwei2018 edited a comment on pull request #2283:
URL: https://github.com/apache/hudi/pull/2283#issuecomment-739497762
> @pengzhiwei2018 would you please describe in more details about the issue?
Hi @leesf ,Sorry for the late response. I find that when reading hudi table
in hive meta using spark sql when SparkSession#enableHiveSupport is true, the
query result is incorrect with many duplicate records. After debugging, I found
out that the reason is that it miss some table properties and serde property in
the hive table metadata. Spark sql just treat the table in hive as a hive
table, but not a hudi datasource table.And finally, spark sql read the hudi
table as a normal parquet table, which lead to the duplicate records in query
result.
I fix this issue by add the missing table properites and serde properties
to the hive table.The missing table properties are here:
`spark.sql.sources.provider= 'hudi'
spark.sql.sources.schema.numParts = 'xx'
spark.sql.sources.schema.part.{num} ='xx'
spark.sql.sources.schema.numPartCols = 'xx'
spark.sql.sources.schema.partCol.{num} = 'xx'`
and the missing serde properties is: `'path'='/path/to/hudi'`
With this fix, the spark sql can read the hudi table in hive meta correctly
Now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]