[
https://issues.apache.org/jira/browse/HUDI-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344656#comment-17344656
]
jing commented on HUDI-733:
---------------------------
I have verified that there is no problem with the new version.
> presto query data error
> -----------------------
>
> Key: HUDI-733
> URL: https://issues.apache.org/jira/browse/HUDI-733
> Project: Apache Hudi
> Issue Type: Bug
> Components: Presto Integration
> Affects Versions: 0.5.1
> Reporter: jing
> Assignee: Bhavani Sudha
> Priority: Major
> Labels: sev:critical, user-support-issues
> Attachments: hive_table.png, parquet_context.png, parquet_schema.png,
> presto_query_data.png
>
>
> We found a data sequence issue in Hudi when we use API to import data(use
> spark.read.json("filename") read to dataframe then write to hudi). The
> original d is rowkey:1 dt:2 time:3.
> But the value is unexpected when query the data by Presto(rowkey:2 dt:1
> time:2), but correctly in Hive.
> After analysis, if I use dt to partition the column data, it is also written
> in the parquet file. dt = xxx, and the value of the partition column should
> be the value in the path of the hudi. However, I found that the value of the
> presto query must be one-to-one with the columns in the parquet. He will not
> detect the column names.
> Transformation methods and suggestions:
> # Can the inputformat class be ignored to read the column value of the
> partition column dt in parquet?
> # Can hive data be synchronized without dt as a partition column? Consider
> adding a column such as repl_dt as a partition column and dt as an ordinary
> field.
> # The dt column is not written to the parquet file.
> 4, dt is written to the parquet file, but as the last column.
>
> [~bhasudha]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)