pengzhiwei2018 edited a comment on pull request #2283: URL: https://github.com/apache/hudi/pull/2283#issuecomment-739497762
> @pengzhiwei2018 would you please describe in more details about the issue? Hi @leesf ,Sorry for the late response. I find that when reading a hudi table which has updated twice more using spark sql, the query result is incorrect with many duplicate records. After debugging, I found out that the reason is that it miss some table properties and serde property in the hive table with `HiveSyncTool`. Spark sql just treat the table as a hive table, but not a hudi datasource table.And finally, spark sql read the hudi table as a normal parquet table, which lead to the duplicate records in query result. I fix this issue by add the missing table properites and serde properties to the hive table.The missing table properties are here: `spark.sql.sources.provider= 'hudi' spark.sql.sources.schema.numParts = 'xx' spark.sql.sources.schema.part.{num} ='xx' spark.sql.sources.schema.numPartCols = 'xx' spark.sql.sources.schema.partCol.{num} = 'xx'` and the missing serde properties is: `'path'='/path/to/hudi'` With this fix, the spark sql can read the hudi table in hive meta correctly Now. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org