[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2283: [HUDI-1415] Incorrect query result for hudi hive table when using spa…

GitBox Sun, 06 Dec 2020 04:46:33 -0800


pengzhiwei2018 edited a comment on pull request #2283:
URL: https://github.com/apache/hudi/pull/2283#issuecomment-739497762



   > @pengzhiwei2018 would you please describe in more details about the issue?
   
   Hi @leesf ,Sorry for the late response. I find that when reading hudi table 
in hive meta using spark sql when SparkSession#enableHiveSupport is true, the 
query result is incorrect with many duplicate records. After debugging, I found 
out that the reason is that it miss some table properties and serde property in 
the hive table metadata. Spark sql just treat the table in hive as a hive 
table, but not a hudi datasource table.And finally, spark sql read the hudi 
table as a normal parquet table, which lead to the duplicate records in query 
result.
    I fix this issue by add the missing table properites and serde properties 
to the hive table.The missing table properties are here:
   `spark.sql.sources.provider= 'hudi'
   spark.sql.sources.schema.numParts = 'xx'
   spark.sql.sources.schema.part.{num} ='xx'
   spark.sql.sources.schema.numPartCols = 'xx'
   spark.sql.sources.schema.partCol.{num} = 'xx'`
   and the missing serde properties is: `'path'='/path/to/hudi'`
   With this fix, the spark sql can read the hudi table in hive meta correctly 
Now.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2283: [HUDI-1415] Incorrect query result for hudi hive table when using spa…

Reply via email to