[jira] [Assigned] (HUDI-2780) Mor reads the log file and skips the complete block as a bad block, resulting in data loss
[ https://issues.apache.org/jira/browse/HUDI-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jing reassigned HUDI-2780: -- Assignee: jing > Mor reads the log file and skips the complete block as a bad block, resulting > in data loss > -- > > Key: HUDI-2780 > URL: https://issues.apache.org/jira/browse/HUDI-2780 > Project: Apache Hudi > Issue Type: Bug >Reporter: jing >Assignee: jing >Priority: Major > Attachments: image-2021-11-17-15-45-33-031.png, > image-2021-11-17-15-46-04-313.png, image-2021-11-17-15-46-14-694.png > > > Check the data in the middle of the bad block through debug, and find that > the lost data is in the offset of the bad block, but because of the eof skip > during the reading, the compact merge cannot be written to the parquet at > that time, but the deltacommit of the time is successful. There are two > consecutive hudi magic in the middle of the bad block. Reading blocksize in > the next digit actually reads the binary conversion of #HUDI# to 1227030528, > which means that the eof exception is reported when the file size is exceeded. > !image-2021-11-17-15-45-33-031.png! > Detect the position of the next block and skip the bad block. It should not > start from the position after reading the blocksize, but from the position > before reading the blocksize > !image-2021-11-17-15-46-04-313.png! > !image-2021-11-17-15-46-14-694.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2780) Mor reads the log file and skips the complete block as a bad block, resulting in data loss
jing created HUDI-2780: -- Summary: Mor reads the log file and skips the complete block as a bad block, resulting in data loss Key: HUDI-2780 URL: https://issues.apache.org/jira/browse/HUDI-2780 Project: Apache Hudi Issue Type: Bug Reporter: jing Attachments: image-2021-11-17-15-45-33-031.png, image-2021-11-17-15-46-04-313.png, image-2021-11-17-15-46-14-694.png Check the data in the middle of the bad block through debug, and find that the lost data is in the offset of the bad block, but because of the eof skip during the reading, the compact merge cannot be written to the parquet at that time, but the deltacommit of the time is successful. There are two consecutive hudi magic in the middle of the bad block. Reading blocksize in the next digit actually reads the binary conversion of #HUDI# to 1227030528, which means that the eof exception is reported when the file size is exceeded. !image-2021-11-17-15-45-33-031.png! Detect the position of the next block and skip the bad block. It should not start from the position after reading the blocksize, but from the position before reading the blocksize !image-2021-11-17-15-46-04-313.png! !image-2021-11-17-15-46-14-694.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-733) presto query data error
[ https://issues.apache.org/jira/browse/HUDI-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344657#comment-17344657 ] jing commented on HUDI-733: --- thanks sivabalan narayanan , Bhavani Sudha > presto query data error > --- > > Key: HUDI-733 > URL: https://issues.apache.org/jira/browse/HUDI-733 > Project: Apache Hudi > Issue Type: Bug > Components: Presto Integration >Affects Versions: 0.5.1 >Reporter: jing >Assignee: Bhavani Sudha >Priority: Major > Labels: sev:critical, user-support-issues > Attachments: hive_table.png, parquet_context.png, parquet_schema.png, > presto_query_data.png > > > We found a data sequence issue in Hudi when we use API to import data(use > spark.read.json("filename") read to dataframe then write to hudi). The > original d is rowkey:1 dt:2 time:3. > But the value is unexpected when query the data by Presto(rowkey:2 dt:1 > time:2), but correctly in Hive. > After analysis, if I use dt to partition the column data, it is also written > in the parquet file. dt = xxx, and the value of the partition column should > be the value in the path of the hudi. However, I found that the value of the > presto query must be one-to-one with the columns in the parquet. He will not > detect the column names. > Transformation methods and suggestions: > # Can the inputformat class be ignored to read the column value of the > partition column dt in parquet? > # Can hive data be synchronized without dt as a partition column? Consider > adding a column such as repl_dt as a partition column and dt as an ordinary > field. > # The dt column is not written to the parquet file. > 4, dt is written to the parquet file, but as the last column. > > [~bhasudha] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-733) presto query data error
[ https://issues.apache.org/jira/browse/HUDI-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344656#comment-17344656 ] jing commented on HUDI-733: --- I have verified that there is no problem with the new version. > presto query data error > --- > > Key: HUDI-733 > URL: https://issues.apache.org/jira/browse/HUDI-733 > Project: Apache Hudi > Issue Type: Bug > Components: Presto Integration >Affects Versions: 0.5.1 >Reporter: jing >Assignee: Bhavani Sudha >Priority: Major > Labels: sev:critical, user-support-issues > Attachments: hive_table.png, parquet_context.png, parquet_schema.png, > presto_query_data.png > > > We found a data sequence issue in Hudi when we use API to import data(use > spark.read.json("filename") read to dataframe then write to hudi). The > original d is rowkey:1 dt:2 time:3. > But the value is unexpected when query the data by Presto(rowkey:2 dt:1 > time:2), but correctly in Hive. > After analysis, if I use dt to partition the column data, it is also written > in the parquet file. dt = xxx, and the value of the partition column should > be the value in the path of the hudi. However, I found that the value of the > presto query must be one-to-one with the columns in the parquet. He will not > detect the column names. > Transformation methods and suggestions: > # Can the inputformat class be ignored to read the column value of the > partition column dt in parquet? > # Can hive data be synchronized without dt as a partition column? Consider > adding a column such as repl_dt as a partition column and dt as an ordinary > field. > # The dt column is not written to the parquet file. > 4, dt is written to the parquet file, but as the last column. > > [~bhasudha] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1792) flink-client query error when processing files larger than 128mb
[ https://issues.apache.org/jira/browse/HUDI-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jing updated HUDI-1792: --- Summary: flink-client query error when processing files larger than 128mb (was: Fix flink-client query error when processing files larger than 128mb) > flink-client query error when processing files larger than 128mb > - > > Key: HUDI-1792 > URL: https://issues.apache.org/jira/browse/HUDI-1792 > Project: Apache Hudi > Issue Type: Bug > Components: Flink Integration >Reporter: jing >Assignee: jing >Priority: Major > > Use the flink client to query the cow table and report an error. The error > message is as follows: > {code:java} > Caused by: org.apache.flink.runtime.JobException: Creating the input splits > caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to > java.lang.ComparableCaused by: org.apache.flink.runtime.JobException: > Creating the input splits caused an error: > org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to java.lang.Comparable > at > org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:260) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:866) > at > org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:257) > at > org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:322) > at > org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:276) > at > org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:249) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:133) > at > org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:111) > at > org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345) > at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:330) > at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:95) > at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39) > at > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:162) > at > org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:86) > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:478) > ... 4 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1792) Fix flink-client query error when processing files larger than 128mb
[ https://issues.apache.org/jira/browse/HUDI-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jing updated HUDI-1792: --- Description: Use the flink client to query the cow table and report an error. The error message is as follows: {code:java} Caused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to java.lang.ComparableCaused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to java.lang.Comparable at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:260) at org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:866) at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:257) at org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:322) at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:276) at org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:249) at org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:133) at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:111) at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345) at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:330) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:95) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39) at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:162) at org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:86) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:478) ... 4 more {code} was: Use the flink client to query the cow table and report an error. The error message is as follows: {code:java} //代码占位符 Caused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to java.lang.ComparableCaused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to java.lang.Comparable at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:260) at org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:866) at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:257) at org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:322) at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:276) at org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:249) at org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:133) at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:111) at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345) at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:330) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:95) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39) at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:162) at org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:86) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:478) ... 4 more {code} > Fix flink-client query error when processing files larger than 128mb > - > > Key: HUDI-1792 > URL: https://issues.apache.org/jira/browse/HUDI-1792 > Project: Apache Hudi > Issue Type: Bug > Components: Flink Integration >Reporter: jing >Assignee: jing >Priority: Major > > Use the flink client to query the cow table and report an error. The error > message is as follows: > {code:java} > Caused by: org.apache.flink.runtime.JobException: Creating the input splits > caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast
[jira] [Created] (HUDI-1792) Fix flink-client query error when processing files larger than 128mb
jing created HUDI-1792: -- Summary: Fix flink-client query error when processing files larger than 128mb Key: HUDI-1792 URL: https://issues.apache.org/jira/browse/HUDI-1792 Project: Apache Hudi Issue Type: Bug Components: Flink Integration Reporter: jing Assignee: jing Use the flink client to query the cow table and report an error. The error message is as follows: {code:java} //代码占位符 Caused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to java.lang.ComparableCaused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to java.lang.Comparable at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:260) at org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:866) at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:257) at org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:322) at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:276) at org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:249) at org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:133) at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:111) at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345) at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:330) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:95) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39) at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:162) at org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:86) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:478) ... 4 more {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1784) Added print detailed stack log when hbase connection error
jing created HUDI-1784: -- Summary: Added print detailed stack log when hbase connection error Key: HUDI-1784 URL: https://issues.apache.org/jira/browse/HUDI-1784 Project: Apache Hudi Issue Type: Improvement Components: Index Reporter: jing Assignee: jing I tried to upgrade hdfs to version 3.0 and found that hbase reported an error and could not connect, but hbase was normal, and debug found that it was a jar conflict problem. Exception failed to print detailed stack log, resulting in no precise location of the cause of the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1347) Hbase index partition changes cause data duplication problems
jing created HUDI-1347: -- Summary: Hbase index partition changes cause data duplication problems Key: HUDI-1347 URL: https://issues.apache.org/jira/browse/HUDI-1347 Project: Apache Hudi Issue Type: Bug Components: Index Reporter: jing Assignee: jing 1,A piece of data repeatedly changes the partition. After the data deduplication operation, the partition information of the key and data in the HoodieRecord object is inconsistent. E.g: id,oid,name,dt,isdeleted,lastupdatedttm,rowkey 9,1,,2018,0,2020-02-17 00:50:25.01,00_test1-9-1 9,1,,2019,0,2020-02-17 00:50:25.02,00_test1-9-1 rowkey is the primary key and dt is the partition. After deduplication, the key of the HoodieRecord object is (00_test1-9-1,2018).The key should be (00_test1-9-1,2019) 2,An exception in the hudi task caused the hbase index to be written successfully but the task failed. If the task is retried, the partition change data becomes only a new creation. The data before the partition change is not deleted. Solution: 1,Fixed the error of partition information in HoodieRecord key caused by deduplication operation 2.The hbase index adds a rollback operation instead of doing nothing. The partition change needs to be rolledback to the index of the last successful commit。 3.Rich test cases -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1184) Support updatePartitionPath for HBaseIndex
[ https://issues.apache.org/jira/browse/HUDI-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jing reassigned HUDI-1184: -- Assignee: jing (was: Ryan Pifer) > Support updatePartitionPath for HBaseIndex > -- > > Key: HUDI-1184 > URL: https://issues.apache.org/jira/browse/HUDI-1184 > Project: Apache Hudi > Issue Type: Bug > Components: Index >Affects Versions: 0.6.1 >Reporter: sivabalan narayanan >Assignee: jing >Priority: Major > > In implicit global indexes, we have a config named updatePartitionPath. When > an already existing record is upserted to a new partition (compared to where > it is in storage), if the config is set to true, record is inserted to new > partition and deleted in old partition. If the config is set to false, record > is upserted to old partition ignoring the new partition. > > Don't think we have this fix for HBase. We need similar support in HBase too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1184) Support updatePartitionPath for HBaseIndex
[ https://issues.apache.org/jira/browse/HUDI-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178703#comment-17178703 ] jing commented on HUDI-1184: I have tried to fix the problem in 0.52 and tested it, and I will modify it in the master branch later. Please assign this problem to me. > Support updatePartitionPath for HBaseIndex > -- > > Key: HUDI-1184 > URL: https://issues.apache.org/jira/browse/HUDI-1184 > Project: Apache Hudi > Issue Type: Bug > Components: Index >Affects Versions: 0.6.1 >Reporter: sivabalan narayanan >Assignee: Ryan Pifer >Priority: Major > > In implicit global indexes, we have a config named updatePartitionPath. When > an already existing record is upserted to a new partition (compared to where > it is in storage), if the config is set to true, record is inserted to new > partition and deleted in old partition. If the config is set to false, record > is upserted to old partition ignoring the new partition. > > Don't think we have this fix for HBase. We need similar support in HBase too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-289) Implement a test suite to support long running test for Hudi writing and querying end-end
[ https://issues.apache.org/jira/browse/HUDI-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jing updated HUDI-289: -- Issue Type: Test (was: Bug) > Implement a test suite to support long running test for Hudi writing and > querying end-end > - > > Key: HUDI-289 > URL: https://issues.apache.org/jira/browse/HUDI-289 > Project: Apache Hudi > Issue Type: Test > Components: Usability >Reporter: Vinoth Chandar >Assignee: Nishith Agarwal >Priority: Blocker > Labels: pull-request-available > Fix For: 0.6.0 > > > We would need an equivalent of an end-end test which runs some workload for > few hours atleast, triggers various actions like commit, deltacopmmit, > rollback, compaction and ensures correctness of code before every release > P.S: Learn from all the CSS issues managing compaction.. > The feature branch is here: > [https://github.com/apache/incubator-hudi/tree/hudi_test_suite_refactor] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1000) incremental query for COW non-partitioned table no data
jing created HUDI-1000: -- Summary: incremental query for COW non-partitioned table no data Key: HUDI-1000 URL: https://issues.apache.org/jira/browse/HUDI-1000 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Reporter: jing Assignee: jing Attachments: 设置前后对比.png I use a partitioned table to query normally, but a non-partitioned table to query abnormally this my commit time /tmp/test_commit/.hoodie/20200603154154.commit /tmp/test_commit/.hoodie/20200603154224.commit /tmp/test_commit/.hoodie/20200603154253.commit /tmp/test_commit/.hoodie/20200603202911.commit this hive set config set hoodie.test_commit.consume.mode=INCREMENTAL; set hoodie.test_commit.consume.start.timestamp=20200603154154; set hoodie.test_commit.consume.max.commits=1; After setting, compare before and after reference accessories -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-289) Implement a test suite to support long running test for Hudi writing and querying end-end
[ https://issues.apache.org/jira/browse/HUDI-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jing updated HUDI-289: -- Issue Type: Bug (was: Test) > Implement a test suite to support long running test for Hudi writing and > querying end-end > - > > Key: HUDI-289 > URL: https://issues.apache.org/jira/browse/HUDI-289 > Project: Apache Hudi > Issue Type: Bug > Components: Usability >Reporter: Vinoth Chandar >Assignee: Nishith Agarwal >Priority: Blocker > Labels: pull-request-available > Fix For: 0.6.0 > > > We would need an equivalent of an end-end test which runs some workload for > few hours atleast, triggers various actions like commit, deltacopmmit, > rollback, compaction and ensures correctness of code before every release > P.S: Learn from all the CSS issues managing compaction.. > The feature branch is here: > [https://github.com/apache/incubator-hudi/tree/hudi_test_suite_refactor] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1000) incremental query for COW non-partitioned table no data
[ https://issues.apache.org/jira/browse/HUDI-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127268#comment-17127268 ] jing commented on HUDI-1000: pr :[https://github.com/apache/hudi/pull/1708] Please refer to the attachment as the example of repair result. !修复后的查询结果.png! > incremental query for COW non-partitioned table no data > --- > > Key: HUDI-1000 > URL: https://issues.apache.org/jira/browse/HUDI-1000 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: jing >Assignee: jing >Priority: Major > Attachments: 修复后的查询结果.png, 设置前后对比.png > > > I use a partitioned table to query normally, but a non-partitioned table to > query abnormally > this my commit time > /tmp/test_commit/.hoodie/20200603154154.commit > /tmp/test_commit/.hoodie/20200603154224.commit > /tmp/test_commit/.hoodie/20200603154253.commit > /tmp/test_commit/.hoodie/20200603202911.commit > > this hive set config > set hoodie.test_commit.consume.mode=INCREMENTAL; > set hoodie.test_commit.consume.start.timestamp=20200603154154; > set hoodie.test_commit.consume.max.commits=1; > > After setting, compare before and after reference accessories -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1000) incremental query for COW non-partitioned table no data
[ https://issues.apache.org/jira/browse/HUDI-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jing updated HUDI-1000: --- Attachment: 修复后的查询结果.png > incremental query for COW non-partitioned table no data > --- > > Key: HUDI-1000 > URL: https://issues.apache.org/jira/browse/HUDI-1000 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: jing >Assignee: jing >Priority: Major > Attachments: 修复后的查询结果.png, 设置前后对比.png > > > I use a partitioned table to query normally, but a non-partitioned table to > query abnormally > this my commit time > /tmp/test_commit/.hoodie/20200603154154.commit > /tmp/test_commit/.hoodie/20200603154224.commit > /tmp/test_commit/.hoodie/20200603154253.commit > /tmp/test_commit/.hoodie/20200603202911.commit > > this hive set config > set hoodie.test_commit.consume.mode=INCREMENTAL; > set hoodie.test_commit.consume.start.timestamp=20200603154154; > set hoodie.test_commit.consume.max.commits=1; > > After setting, compare before and after reference accessories -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-733) presto query data error
[ https://issues.apache.org/jira/browse/HUDI-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jing updated HUDI-733: -- Description: We found a data sequence issue in Hudi when we use API to import data(use spark.read.json("filename") read to dataframe then write to hudi). The original d is rowkey:1 dt:2 time:3. But the value is unexpected when query the data by Presto(rowkey:2 dt:1 time:2), but correctly in Hive. After analysis, if I use dt to partition the column data, it is also written in the parquet file. dt = xxx, and the value of the partition column should be the value in the path of the hudi. However, I found that the value of the presto query must be one-to-one with the columns in the parquet. He will not detect the column names. Transformation methods and suggestions: # Can the inputformat class be ignored to read the column value of the partition column dt in parquet? # Can hive data be synchronized without dt as a partition column? Consider adding a column such as repl_dt as a partition column and dt as an ordinary field. # The dt column is not written to the parquet file. 4, dt is written to the parquet file, but as the last column. [~bhasudha] was: We found a data sequence issue in Hudi when we use API to import data(use spark.read.json("filename") read to dataframe then write to hudi). The original d is rowkey:1 dt:2 time:3. But the value is unexpected when query the data by Presto(rowkey:2 dt:1 time:2), but correctly in Hive. After analysis, if I use dt to partition the column data, it is also written in the parquet file. dt = xxx, and the value of the partition column should be the value in the path of the hudi. However, I found that the value of the presto query must be one-to-one with the columns in the parquet. He will not detect the column names. Transformation methods and suggestions: # Can the inputformat class be ignored to read the column value of the partition column dt in parquet? # Can hive data be synchronized without dt as a partition column? Consider adding a column such as repl_dt as a partition column and dt as an ordinary field. # The dt column is not written to the parquet file. 4, dt is written to the parquet file, but as the last column. @Sudha > presto query data error > --- > > Key: HUDI-733 > URL: https://issues.apache.org/jira/browse/HUDI-733 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Presto Integration >Affects Versions: 0.5.1 >Reporter: jing >Priority: Major > Attachments: hive_table.png, parquet_context.png, parquet_schema.png, > presto_query_data.png > > > We found a data sequence issue in Hudi when we use API to import data(use > spark.read.json("filename") read to dataframe then write to hudi). The > original d is rowkey:1 dt:2 time:3. > But the value is unexpected when query the data by Presto(rowkey:2 dt:1 > time:2), but correctly in Hive. > After analysis, if I use dt to partition the column data, it is also written > in the parquet file. dt = xxx, and the value of the partition column should > be the value in the path of the hudi. However, I found that the value of the > presto query must be one-to-one with the columns in the parquet. He will not > detect the column names. > Transformation methods and suggestions: > # Can the inputformat class be ignored to read the column value of the > partition column dt in parquet? > # Can hive data be synchronized without dt as a partition column? Consider > adding a column such as repl_dt as a partition column and dt as an ordinary > field. > # The dt column is not written to the parquet file. > 4, dt is written to the parquet file, but as the last column. > > [~bhasudha] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-733) presto query data error
[ https://issues.apache.org/jira/browse/HUDI-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066417#comment-17066417 ] jing commented on HUDI-733: --- [~bhasudha] help me > presto query data error > --- > > Key: HUDI-733 > URL: https://issues.apache.org/jira/browse/HUDI-733 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Presto Integration >Affects Versions: 0.5.1 >Reporter: jing >Priority: Major > Attachments: hive_table.png, parquet_context.png, parquet_schema.png, > presto_query_data.png > > > We found a data sequence issue in Hudi when we use API to import data(use > spark.read.json("filename") read to dataframe then write to hudi). The > original d is rowkey:1 dt:2 time:3. > But the value is unexpected when query the data by Presto(rowkey:2 dt:1 > time:2), but correctly in Hive. > After analysis, if I use dt to partition the column data, it is also written > in the parquet file. dt = xxx, and the value of the partition column should > be the value in the path of the hudi. However, I found that the value of the > presto query must be one-to-one with the columns in the parquet. He will not > detect the column names. > Transformation methods and suggestions: > # Can the inputformat class be ignored to read the column value of the > partition column dt in parquet? > # Can hive data be synchronized without dt as a partition column? Consider > adding a column such as repl_dt as a partition column and dt as an ordinary > field. > # The dt column is not written to the parquet file. > 4, dt is written to the parquet file, but as the last column. > > @Sudha -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-733) presto query data error
jing created HUDI-733: - Summary: presto query data error Key: HUDI-733 URL: https://issues.apache.org/jira/browse/HUDI-733 Project: Apache Hudi (incubating) Issue Type: Bug Components: Presto Integration Affects Versions: 0.5.1 Reporter: jing Attachments: hive_table.png, parquet_context.png, parquet_schema.png, presto_query_data.png We found a data sequence issue in Hudi when we use API to import data(use spark.read.json("filename") read to dataframe then write to hudi). The original d is rowkey:1 dt:2 time:3. But the value is unexpected when query the data by Presto(rowkey:2 dt:1 time:2), but correctly in Hive. After analysis, if I use dt to partition the column data, it is also written in the parquet file. dt = xxx, and the value of the partition column should be the value in the path of the hudi. However, I found that the value of the presto query must be one-to-one with the columns in the parquet. He will not detect the column names. Transformation methods and suggestions: # Can the inputformat class be ignored to read the column value of the partition column dt in parquet? # Can hive data be synchronized without dt as a partition column? Consider adding a column such as repl_dt as a partition column and dt as an ordinary field. # The dt column is not written to the parquet file. 4, dt is written to the parquet file, but as the last column. @Sudha -- This message was sent by Atlassian Jira (v8.3.4#803005)