[jira] [Assigned] (HUDI-2780) Mor reads the log file and skips the complete block as a bad block, resulting in data loss

2021-11-16 Thread jing (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jing reassigned HUDI-2780:
--

Assignee: jing

> Mor reads the log file and skips the complete block as a bad block, resulting 
> in data loss
> --
>
> Key: HUDI-2780
> URL: https://issues.apache.org/jira/browse/HUDI-2780
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: jing
>Assignee: jing
>Priority: Major
> Attachments: image-2021-11-17-15-45-33-031.png, 
> image-2021-11-17-15-46-04-313.png, image-2021-11-17-15-46-14-694.png
>
>
> Check the data in the middle of the bad block through debug, and find that 
> the lost data is in the offset of the bad block, but because of the eof skip 
> during the reading, the compact merge cannot be written to the parquet at 
> that time, but the deltacommit of the time is successful. There are two 
> consecutive hudi magic in the middle of the bad block. Reading blocksize in 
> the next digit actually reads the binary conversion of #HUDI# to 1227030528, 
> which means that the eof exception is reported when the file size is exceeded.
> !image-2021-11-17-15-45-33-031.png!
> Detect the position of the next block and skip the bad block. It should not 
> start from the position after reading the blocksize, but from the position 
> before reading the blocksize
> !image-2021-11-17-15-46-04-313.png!
> !image-2021-11-17-15-46-14-694.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2780) Mor reads the log file and skips the complete block as a bad block, resulting in data loss

2021-11-16 Thread jing (Jira)
jing created HUDI-2780:
--

 Summary: Mor reads the log file and skips the complete block as a 
bad block, resulting in data loss
 Key: HUDI-2780
 URL: https://issues.apache.org/jira/browse/HUDI-2780
 Project: Apache Hudi
  Issue Type: Bug
Reporter: jing
 Attachments: image-2021-11-17-15-45-33-031.png, 
image-2021-11-17-15-46-04-313.png, image-2021-11-17-15-46-14-694.png

Check the data in the middle of the bad block through debug, and find that the 
lost data is in the offset of the bad block, but because of the eof skip during 
the reading, the compact merge cannot be written to the parquet at that time, 
but the deltacommit of the time is successful. There are two consecutive hudi 
magic in the middle of the bad block. Reading blocksize in the next digit 
actually reads the binary conversion of #HUDI# to 1227030528, which means that 
the eof exception is reported when the file size is exceeded.

!image-2021-11-17-15-45-33-031.png!

Detect the position of the next block and skip the bad block. It should not 
start from the position after reading the blocksize, but from the position 
before reading the blocksize

!image-2021-11-17-15-46-04-313.png!

!image-2021-11-17-15-46-14-694.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-733) presto query data error

2021-05-14 Thread jing (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344657#comment-17344657
 ] 

jing commented on HUDI-733:
---

thanks  sivabalan narayanan , Bhavani Sudha

> presto query data error
> ---
>
> Key: HUDI-733
> URL: https://issues.apache.org/jira/browse/HUDI-733
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Presto Integration
>Affects Versions: 0.5.1
>Reporter: jing
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: sev:critical, user-support-issues
> Attachments: hive_table.png, parquet_context.png, parquet_schema.png, 
> presto_query_data.png
>
>
> We found a data sequence issue in Hudi when we use API to import data(use 
> spark.read.json("filename") read to dataframe then write  to hudi). The 
> original d is rowkey:1 dt:2 time:3.
> But the value is unexpected when query the data by Presto(rowkey:2 dt:1 
> time:2), but correctly in Hive.
> After analysis, if I use dt to partition the column data, it is also written 
> in the parquet file. dt = xxx, and the value of the partition column should 
> be the value in the path of the hudi. However, I found that the value of the 
> presto query must be one-to-one with the columns in the parquet. He will not 
> detect the column names.
> Transformation methods and suggestions:
>  # Can the inputformat class be ignored to read the column value of the 
> partition column dt in parquet?
>  # Can hive data be synchronized without dt as a partition column? Consider 
> adding a column such as repl_dt as a partition column and dt as an ordinary 
> field.
>  # The dt column is not written to the parquet file.
>      4, dt is written to the parquet file, but as the last column.
>  
> [~bhasudha]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-733) presto query data error

2021-05-14 Thread jing (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344656#comment-17344656
 ] 

jing commented on HUDI-733:
---

I have verified that there is no problem with the new version.

> presto query data error
> ---
>
> Key: HUDI-733
> URL: https://issues.apache.org/jira/browse/HUDI-733
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Presto Integration
>Affects Versions: 0.5.1
>Reporter: jing
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: sev:critical, user-support-issues
> Attachments: hive_table.png, parquet_context.png, parquet_schema.png, 
> presto_query_data.png
>
>
> We found a data sequence issue in Hudi when we use API to import data(use 
> spark.read.json("filename") read to dataframe then write  to hudi). The 
> original d is rowkey:1 dt:2 time:3.
> But the value is unexpected when query the data by Presto(rowkey:2 dt:1 
> time:2), but correctly in Hive.
> After analysis, if I use dt to partition the column data, it is also written 
> in the parquet file. dt = xxx, and the value of the partition column should 
> be the value in the path of the hudi. However, I found that the value of the 
> presto query must be one-to-one with the columns in the parquet. He will not 
> detect the column names.
> Transformation methods and suggestions:
>  # Can the inputformat class be ignored to read the column value of the 
> partition column dt in parquet?
>  # Can hive data be synchronized without dt as a partition column? Consider 
> adding a column such as repl_dt as a partition column and dt as an ordinary 
> field.
>  # The dt column is not written to the parquet file.
>      4, dt is written to the parquet file, but as the last column.
>  
> [~bhasudha]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1792) flink-client query error when processing files larger than 128mb

2021-04-13 Thread jing (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jing updated HUDI-1792:
---
Summary: flink-client  query error when processing files larger than 128mb  
(was: Fix flink-client  query error when processing files larger than 128mb)

> flink-client  query error when processing files larger than 128mb
> -
>
> Key: HUDI-1792
> URL: https://issues.apache.org/jira/browse/HUDI-1792
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Flink Integration
>Reporter: jing
>Assignee: jing
>Priority: Major
>
> Use the flink client to query the cow table and report an error. The error 
> message is as follows:
> {code:java}
> Caused by: org.apache.flink.runtime.JobException: Creating the input splits 
> caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to 
> java.lang.ComparableCaused by: org.apache.flink.runtime.JobException: 
> Creating the input splits caused an error: 
> org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to java.lang.Comparable 
> at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:260)
>  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:866)
>  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:257)
>  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:322)
>  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:276)
>  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:249)
>  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:133)
>  at 
> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:111)
>  at 
> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345)
>  at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:330) 
> at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:95)
>  at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39)
>  at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:162)
>  at 
> org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:86)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:478)
>  ... 4 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1792) Fix flink-client query error when processing files larger than 128mb

2021-04-13 Thread jing (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jing updated HUDI-1792:
---
Description: 
Use the flink client to query the cow table and report an error. The error 
message is as follows:
{code:java}
Caused by: org.apache.flink.runtime.JobException: Creating the input splits 
caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to 
java.lang.ComparableCaused by: org.apache.flink.runtime.JobException: Creating 
the input splits caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot 
be cast to java.lang.Comparable at 
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:260)
 at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:866)
 at 
org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:257)
 at 
org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:322)
 at 
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:276)
 at 
org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:249) 
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:133)
 at 
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:111)
 at 
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345)
 at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:330) at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:95)
 at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39)
 at 
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:162)
 at 
org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:86)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:478)
 ... 4 more
{code}

  was:
Use the flink client to query the cow table and report an error. The error 
message is as follows:
{code:java}
//代码占位符
Caused by: org.apache.flink.runtime.JobException: Creating the input splits 
caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to 
java.lang.ComparableCaused by: org.apache.flink.runtime.JobException: Creating 
the input splits caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot 
be cast to java.lang.Comparable at 
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:260)
 at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:866)
 at 
org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:257)
 at 
org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:322)
 at 
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:276)
 at 
org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:249) 
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:133)
 at 
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:111)
 at 
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345)
 at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:330) at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:95)
 at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39)
 at 
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:162)
 at 
org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:86)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:478)
 ... 4 more
{code}


> Fix flink-client  query error when processing files larger than 128mb
> -
>
> Key: HUDI-1792
> URL: https://issues.apache.org/jira/browse/HUDI-1792
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Flink Integration
>Reporter: jing
>Assignee: jing
>Priority: Major
>
> Use the flink client to query the cow table and report an error. The error 
> message is as follows:
> {code:java}
> Caused by: org.apache.flink.runtime.JobException: Creating the input splits 
> caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast 

[jira] [Created] (HUDI-1792) Fix flink-client query error when processing files larger than 128mb

2021-04-13 Thread jing (Jira)
jing created HUDI-1792:
--

 Summary: Fix flink-client  query error when processing files 
larger than 128mb
 Key: HUDI-1792
 URL: https://issues.apache.org/jira/browse/HUDI-1792
 Project: Apache Hudi
  Issue Type: Bug
  Components: Flink Integration
Reporter: jing
Assignee: jing


Use the flink client to query the cow table and report an error. The error 
message is as follows:
{code:java}
//代码占位符
Caused by: org.apache.flink.runtime.JobException: Creating the input splits 
caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot be cast to 
java.lang.ComparableCaused by: org.apache.flink.runtime.JobException: Creating 
the input splits caused an error: org.apache.hadoop.fs.HdfsBlockLocation cannot 
be cast to java.lang.Comparable at 
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:260)
 at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:866)
 at 
org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:257)
 at 
org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:322)
 at 
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:276)
 at 
org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:249) 
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:133)
 at 
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:111)
 at 
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345)
 at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:330) at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:95)
 at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39)
 at 
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:162)
 at 
org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:86)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:478)
 ... 4 more
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1784) Added print detailed stack log when hbase connection error

2021-04-09 Thread jing (Jira)
jing created HUDI-1784:
--

 Summary: Added print detailed stack log when hbase connection error
 Key: HUDI-1784
 URL: https://issues.apache.org/jira/browse/HUDI-1784
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Index
Reporter: jing
Assignee: jing


I tried to upgrade hdfs to version 3.0 and found that hbase reported an error 
and could not connect, but hbase was normal, and debug found that it was a jar 
conflict problem. Exception failed to print detailed stack log, resulting in no 
precise location of the cause of the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1347) Hbase index partition changes cause data duplication problems

2020-10-18 Thread jing (Jira)
jing created HUDI-1347:
--

 Summary: Hbase index partition changes cause data duplication 
problems
 Key: HUDI-1347
 URL: https://issues.apache.org/jira/browse/HUDI-1347
 Project: Apache Hudi
  Issue Type: Bug
  Components: Index
Reporter: jing
Assignee: jing


1,A piece of data repeatedly changes the partition. After the data 
deduplication operation, the partition information of the key and data in the 
HoodieRecord object is inconsistent.

E.g:

id,oid,name,dt,isdeleted,lastupdatedttm,rowkey
9,1,,2018,0,2020-02-17 00:50:25.01,00_test1-9-1
9,1,,2019,0,2020-02-17 00:50:25.02,00_test1-9-1

rowkey is the primary key and dt is the partition. After deduplication, the key 
of the HoodieRecord object is (00_test1-9-1,2018).The key should be 
(00_test1-9-1,2019)

2,An exception in the hudi task caused the hbase index to be written 
successfully but the task failed. If the task is retried, the partition change 
data becomes only a new creation. The data before the partition change is not 
deleted.

Solution:

1,Fixed the error of partition information in HoodieRecord key caused by 
deduplication operation

2.The hbase index adds a rollback operation instead of doing nothing. The 
partition change needs to be rolledback to the index of the last successful 
commit。

3.Rich test cases

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1184) Support updatePartitionPath for HBaseIndex

2020-08-16 Thread jing (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jing reassigned HUDI-1184:
--

Assignee: jing  (was: Ryan Pifer)

> Support updatePartitionPath for HBaseIndex
> --
>
> Key: HUDI-1184
> URL: https://issues.apache.org/jira/browse/HUDI-1184
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 0.6.1
>Reporter: sivabalan narayanan
>Assignee: jing
>Priority: Major
>
> In implicit global indexes, we have a config named updatePartitionPath. When 
> an already existing record is upserted to a new partition (compared to where 
> it is in storage), if the config is set to true, record is inserted to new 
> partition and deleted in old partition. If the config is set to false, record 
> is upserted to old partition ignoring the new partition. 
>  
> Don't think we have this fix for HBase. We need similar support in HBase too. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1184) Support updatePartitionPath for HBaseIndex

2020-08-16 Thread jing (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178703#comment-17178703
 ] 

jing commented on HUDI-1184:


I have tried to fix the problem in 0.52 and tested it, and I will modify it in 
the master branch later. Please assign this problem to me.

> Support updatePartitionPath for HBaseIndex
> --
>
> Key: HUDI-1184
> URL: https://issues.apache.org/jira/browse/HUDI-1184
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 0.6.1
>Reporter: sivabalan narayanan
>Assignee: Ryan Pifer
>Priority: Major
>
> In implicit global indexes, we have a config named updatePartitionPath. When 
> an already existing record is upserted to a new partition (compared to where 
> it is in storage), if the config is set to true, record is inserted to new 
> partition and deleted in old partition. If the config is set to false, record 
> is upserted to old partition ignoring the new partition. 
>  
> Don't think we have this fix for HBase. We need similar support in HBase too. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-289) Implement a test suite to support long running test for Hudi writing and querying end-end

2020-06-06 Thread jing (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jing updated HUDI-289:
--
Issue Type: Test  (was: Bug)

> Implement a test suite to support long running test for Hudi writing and 
> querying end-end
> -
>
> Key: HUDI-289
> URL: https://issues.apache.org/jira/browse/HUDI-289
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> We would need an equivalent of an end-end test which runs some workload for 
> few hours atleast, triggers various actions like commit, deltacopmmit, 
> rollback, compaction and ensures correctness of code before every release
> P.S: Learn from all the CSS issues managing compaction..
> The feature branch is here: 
> [https://github.com/apache/incubator-hudi/tree/hudi_test_suite_refactor]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1000) incremental query for COW non-partitioned table no data

2020-06-06 Thread jing (Jira)
jing created HUDI-1000:
--

 Summary: incremental query for COW non-partitioned table no data
 Key: HUDI-1000
 URL: https://issues.apache.org/jira/browse/HUDI-1000
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Reporter: jing
Assignee: jing
 Attachments: 设置前后对比.png

I use a partitioned table to query normally, but a non-partitioned table to 
query abnormally

this my commit time

/tmp/test_commit/.hoodie/20200603154154.commit
 /tmp/test_commit/.hoodie/20200603154224.commit
 /tmp/test_commit/.hoodie/20200603154253.commit
 /tmp/test_commit/.hoodie/20200603202911.commit

 

this  hive set config

set hoodie.test_commit.consume.mode=INCREMENTAL;
set hoodie.test_commit.consume.start.timestamp=20200603154154;
set hoodie.test_commit.consume.max.commits=1;

 

After setting, compare before and after reference accessories



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-289) Implement a test suite to support long running test for Hudi writing and querying end-end

2020-06-06 Thread jing (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jing updated HUDI-289:
--
Issue Type: Bug  (was: Test)

> Implement a test suite to support long running test for Hudi writing and 
> querying end-end
> -
>
> Key: HUDI-289
> URL: https://issues.apache.org/jira/browse/HUDI-289
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> We would need an equivalent of an end-end test which runs some workload for 
> few hours atleast, triggers various actions like commit, deltacopmmit, 
> rollback, compaction and ensures correctness of code before every release
> P.S: Learn from all the CSS issues managing compaction..
> The feature branch is here: 
> [https://github.com/apache/incubator-hudi/tree/hudi_test_suite_refactor]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1000) incremental query for COW non-partitioned table no data

2020-06-06 Thread jing (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127268#comment-17127268
 ] 

jing commented on HUDI-1000:


pr :[https://github.com/apache/hudi/pull/1708]

 

Please refer to the attachment as the example of repair result.

!修复后的查询结果.png!

 

> incremental query for COW non-partitioned table no data
> ---
>
> Key: HUDI-1000
> URL: https://issues.apache.org/jira/browse/HUDI-1000
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: jing
>Assignee: jing
>Priority: Major
> Attachments: 修复后的查询结果.png, 设置前后对比.png
>
>
> I use a partitioned table to query normally, but a non-partitioned table to 
> query abnormally
> this my commit time
> /tmp/test_commit/.hoodie/20200603154154.commit
>  /tmp/test_commit/.hoodie/20200603154224.commit
>  /tmp/test_commit/.hoodie/20200603154253.commit
>  /tmp/test_commit/.hoodie/20200603202911.commit
>  
> this  hive set config
> set hoodie.test_commit.consume.mode=INCREMENTAL;
> set hoodie.test_commit.consume.start.timestamp=20200603154154;
> set hoodie.test_commit.consume.max.commits=1;
>  
> After setting, compare before and after reference accessories



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1000) incremental query for COW non-partitioned table no data

2020-06-06 Thread jing (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jing updated HUDI-1000:
---
Attachment: 修复后的查询结果.png

> incremental query for COW non-partitioned table no data
> ---
>
> Key: HUDI-1000
> URL: https://issues.apache.org/jira/browse/HUDI-1000
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: jing
>Assignee: jing
>Priority: Major
> Attachments: 修复后的查询结果.png, 设置前后对比.png
>
>
> I use a partitioned table to query normally, but a non-partitioned table to 
> query abnormally
> this my commit time
> /tmp/test_commit/.hoodie/20200603154154.commit
>  /tmp/test_commit/.hoodie/20200603154224.commit
>  /tmp/test_commit/.hoodie/20200603154253.commit
>  /tmp/test_commit/.hoodie/20200603202911.commit
>  
> this  hive set config
> set hoodie.test_commit.consume.mode=INCREMENTAL;
> set hoodie.test_commit.consume.start.timestamp=20200603154154;
> set hoodie.test_commit.consume.max.commits=1;
>  
> After setting, compare before and after reference accessories



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-733) presto query data error

2020-03-25 Thread jing (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jing updated HUDI-733:
--
Description: 
We found a data sequence issue in Hudi when we use API to import data(use 
spark.read.json("filename") read to dataframe then write  to hudi). The 
original d is rowkey:1 dt:2 time:3.

But the value is unexpected when query the data by Presto(rowkey:2 dt:1 
time:2), but correctly in Hive.

After analysis, if I use dt to partition the column data, it is also written in 
the parquet file. dt = xxx, and the value of the partition column should be the 
value in the path of the hudi. However, I found that the value of the presto 
query must be one-to-one with the columns in the parquet. He will not detect 
the column names.

Transformation methods and suggestions:
 # Can the inputformat class be ignored to read the column value of the 
partition column dt in parquet?
 # Can hive data be synchronized without dt as a partition column? Consider 
adding a column such as repl_dt as a partition column and dt as an ordinary 
field.
 # The dt column is not written to the parquet file.

     4, dt is written to the parquet file, but as the last column.

 

[~bhasudha]

  was:
We found a data sequence issue in Hudi when we use API to import data(use 
spark.read.json("filename") read to dataframe then write  to hudi). The 
original d is rowkey:1 dt:2 time:3.

But the value is unexpected when query the data by Presto(rowkey:2 dt:1 
time:2), but correctly in Hive.

After analysis, if I use dt to partition the column data, it is also written in 
the parquet file. dt = xxx, and the value of the partition column should be the 
value in the path of the hudi. However, I found that the value of the presto 
query must be one-to-one with the columns in the parquet. He will not detect 
the column names.

Transformation methods and suggestions:
 # Can the inputformat class be ignored to read the column value of the 
partition column dt in parquet?
 # Can hive data be synchronized without dt as a partition column? Consider 
adding a column such as repl_dt as a partition column and dt as an ordinary 
field.
 # The dt column is not written to the parquet file.

     4, dt is written to the parquet file, but as the last column.

 

@Sudha


> presto query data error
> ---
>
> Key: HUDI-733
> URL: https://issues.apache.org/jira/browse/HUDI-733
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Presto Integration
>Affects Versions: 0.5.1
>Reporter: jing
>Priority: Major
> Attachments: hive_table.png, parquet_context.png, parquet_schema.png, 
> presto_query_data.png
>
>
> We found a data sequence issue in Hudi when we use API to import data(use 
> spark.read.json("filename") read to dataframe then write  to hudi). The 
> original d is rowkey:1 dt:2 time:3.
> But the value is unexpected when query the data by Presto(rowkey:2 dt:1 
> time:2), but correctly in Hive.
> After analysis, if I use dt to partition the column data, it is also written 
> in the parquet file. dt = xxx, and the value of the partition column should 
> be the value in the path of the hudi. However, I found that the value of the 
> presto query must be one-to-one with the columns in the parquet. He will not 
> detect the column names.
> Transformation methods and suggestions:
>  # Can the inputformat class be ignored to read the column value of the 
> partition column dt in parquet?
>  # Can hive data be synchronized without dt as a partition column? Consider 
> adding a column such as repl_dt as a partition column and dt as an ordinary 
> field.
>  # The dt column is not written to the parquet file.
>      4, dt is written to the parquet file, but as the last column.
>  
> [~bhasudha]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-733) presto query data error

2020-03-25 Thread jing (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066417#comment-17066417
 ] 

jing commented on HUDI-733:
---

[~bhasudha]   help me 

> presto query data error
> ---
>
> Key: HUDI-733
> URL: https://issues.apache.org/jira/browse/HUDI-733
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Presto Integration
>Affects Versions: 0.5.1
>Reporter: jing
>Priority: Major
> Attachments: hive_table.png, parquet_context.png, parquet_schema.png, 
> presto_query_data.png
>
>
> We found a data sequence issue in Hudi when we use API to import data(use 
> spark.read.json("filename") read to dataframe then write  to hudi). The 
> original d is rowkey:1 dt:2 time:3.
> But the value is unexpected when query the data by Presto(rowkey:2 dt:1 
> time:2), but correctly in Hive.
> After analysis, if I use dt to partition the column data, it is also written 
> in the parquet file. dt = xxx, and the value of the partition column should 
> be the value in the path of the hudi. However, I found that the value of the 
> presto query must be one-to-one with the columns in the parquet. He will not 
> detect the column names.
> Transformation methods and suggestions:
>  # Can the inputformat class be ignored to read the column value of the 
> partition column dt in parquet?
>  # Can hive data be synchronized without dt as a partition column? Consider 
> adding a column such as repl_dt as a partition column and dt as an ordinary 
> field.
>  # The dt column is not written to the parquet file.
>      4, dt is written to the parquet file, but as the last column.
>  
> @Sudha



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-733) presto query data error

2020-03-25 Thread jing (Jira)
jing created HUDI-733:
-

 Summary: presto query data error
 Key: HUDI-733
 URL: https://issues.apache.org/jira/browse/HUDI-733
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Presto Integration
Affects Versions: 0.5.1
Reporter: jing
 Attachments: hive_table.png, parquet_context.png, parquet_schema.png, 
presto_query_data.png

We found a data sequence issue in Hudi when we use API to import data(use 
spark.read.json("filename") read to dataframe then write  to hudi). The 
original d is rowkey:1 dt:2 time:3.

But the value is unexpected when query the data by Presto(rowkey:2 dt:1 
time:2), but correctly in Hive.

After analysis, if I use dt to partition the column data, it is also written in 
the parquet file. dt = xxx, and the value of the partition column should be the 
value in the path of the hudi. However, I found that the value of the presto 
query must be one-to-one with the columns in the parquet. He will not detect 
the column names.

Transformation methods and suggestions:
 # Can the inputformat class be ignored to read the column value of the 
partition column dt in parquet?
 # Can hive data be synchronized without dt as a partition column? Consider 
adding a column such as repl_dt as a partition column and dt as an ordinary 
field.
 # The dt column is not written to the parquet file.

     4, dt is written to the parquet file, but as the last column.

 

@Sudha



--
This message was sent by Atlassian Jira
(v8.3.4#803005)