pengzhiwei2018 commented on pull request #2283:
URL: https://github.com/apache/hudi/pull/2283#issuecomment-739497762


   > @pengzhiwei2018 would you please describe in more details about the issue?
   
   
   
   > @pengzhiwei2018 would you please describe in more details about the issue?
   
   Hi @leesf ,Sorry for the late response. I find that when reading hudi table 
using spark sql when SparkSession#enableHiveSupport is true, the query result 
is incorrect with many duplicate records. After debugging, I found out that the 
reason is that it miss some table properties and serde property in the hive 
table metadata. Spark sql just treat the table in hive as a hive table, but not 
a hudi datasource table.And finally, spark sql read the hudi table as a normal 
parquet table, which lead to the duplicate records in query result.
    I fix this issue by add the missing table properites and serde properties 
to the hive table.The missing table properties are here:
   `spark.sql.sources.provider= 'hudi'
   spark.sql.sources.schema.numParts = 'xx'
   spark.sql.sources.schema.part.{num} ='xx'
   spark.sql.sources.schema.numPartCols = 'xx'
   spark.sql.sources.schema.partCol.{num} = 'xx'`
   and the missing serde properties is: `'path'='/path/to/hudi'`
   With this fix, the spark sql can read the hudi table in hive meta correctly 
Now.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to