[ https://issues.apache.org/jira/browse/HUDI-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
jing updated HUDI-733: ---------------------- Description: We found a data sequence issue in Hudi when we use API to import data(use spark.read.json("filename") read to dataframe then write to hudi). The original d is rowkey:1 dt:2 time:3. But the value is unexpected when query the data by Presto(rowkey:2 dt:1 time:2), but correctly in Hive. After analysis, if I use dt to partition the column data, it is also written in the parquet file. dt = xxx, and the value of the partition column should be the value in the path of the hudi. However, I found that the value of the presto query must be one-to-one with the columns in the parquet. He will not detect the column names. Transformation methods and suggestions: # Can the inputformat class be ignored to read the column value of the partition column dt in parquet? # Can hive data be synchronized without dt as a partition column? Consider adding a column such as repl_dt as a partition column and dt as an ordinary field. # The dt column is not written to the parquet file. 4, dt is written to the parquet file, but as the last column. [~bhasudha] was: We found a data sequence issue in Hudi when we use API to import data(use spark.read.json("filename") read to dataframe then write to hudi). The original d is rowkey:1 dt:2 time:3. But the value is unexpected when query the data by Presto(rowkey:2 dt:1 time:2), but correctly in Hive. After analysis, if I use dt to partition the column data, it is also written in the parquet file. dt = xxx, and the value of the partition column should be the value in the path of the hudi. However, I found that the value of the presto query must be one-to-one with the columns in the parquet. He will not detect the column names. Transformation methods and suggestions: # Can the inputformat class be ignored to read the column value of the partition column dt in parquet? # Can hive data be synchronized without dt as a partition column? Consider adding a column such as repl_dt as a partition column and dt as an ordinary field. # The dt column is not written to the parquet file. 4, dt is written to the parquet file, but as the last column. @Sudha > presto query data error > ----------------------- > > Key: HUDI-733 > URL: https://issues.apache.org/jira/browse/HUDI-733 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Presto Integration > Affects Versions: 0.5.1 > Reporter: jing > Priority: Major > Attachments: hive_table.png, parquet_context.png, parquet_schema.png, > presto_query_data.png > > > We found a data sequence issue in Hudi when we use API to import data(use > spark.read.json("filename") read to dataframe then write to hudi). The > original d is rowkey:1 dt:2 time:3. > But the value is unexpected when query the data by Presto(rowkey:2 dt:1 > time:2), but correctly in Hive. > After analysis, if I use dt to partition the column data, it is also written > in the parquet file. dt = xxx, and the value of the partition column should > be the value in the path of the hudi. However, I found that the value of the > presto query must be one-to-one with the columns in the parquet. He will not > detect the column names. > Transformation methods and suggestions: > # Can the inputformat class be ignored to read the column value of the > partition column dt in parquet? > # Can hive data be synchronized without dt as a partition column? Consider > adding a column such as repl_dt as a partition column and dt as an ordinary > field. > # The dt column is not written to the parquet file. > 4, dt is written to the parquet file, but as the last column. > > [~bhasudha] -- This message was sent by Atlassian Jira (v8.3.4#803005)