pengzhiwei2018 edited a comment on pull request #2283:
URL: https://github.com/apache/hudi/pull/2283#issuecomment-739497762


   > @pengzhiwei2018 would you please describe in more details about the issue?
   
   Hi @leesf ,Sorry for the late response. I find that when reading a hudi 
table which has updated twice more using spark sql, the query result is 
incorrect with many duplicate records. 
   After debugging, I found out that the reason is that it miss some table 
properties and serde property in the hive table with `HiveSyncTool`. Spark sql 
just treat the table as a hive table, but not a hudi datasource table.And 
finally, spark sql read the hudi table as a normal parquet table, which lead to 
the duplicate records in query result.
    I fix this issue by add the missing table properites and serde properties 
to the hive table.The missing table properties are here:
   `spark.sql.sources.provider= 'hudi'
   spark.sql.sources.schema.numParts = 'xx'
   spark.sql.sources.schema.part.{num} ='xx'
   spark.sql.sources.schema.numPartCols = 'xx'
   spark.sql.sources.schema.partCol.{num} = 'xx'`
   and the missing serde properties is: `'path'='/path/to/hudi'`
   With this fix, the spark sql can read the hudi table in hive meta correctly 
Now.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to