[jira] [Closed] (HUDI-1264) incremental read support with replace

2021-11-07 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei closed HUDI-1264.
---

> incremental read support with replace
> -
>
> Key: HUDI-1264
> URL: https://issues.apache.org/jira/browse/HUDI-1264
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> initial version, we could fail incremental reads if there is a REPLACE 
> instant. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1264) incremental read support with replace

2021-11-07 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1264:

Status: Closed  (was: Patch Available)

> incremental read support with replace
> -
>
> Key: HUDI-1264
> URL: https://issues.apache.org/jira/browse/HUDI-1264
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> initial version, we could fail incremental reads if there is a REPLACE 
> instant. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-1264) incremental read support with replace

2021-11-07 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei resolved HUDI-1264.
-

> incremental read support with replace
> -
>
> Key: HUDI-1264
> URL: https://issues.apache.org/jira/browse/HUDI-1264
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> initial version, we could fail incremental reads if there is a REPLACE 
> instant. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Reopened] (HUDI-1264) incremental read support with replace

2021-11-07 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reopened HUDI-1264:
-

> incremental read support with replace
> -
>
> Key: HUDI-1264
> URL: https://issues.apache.org/jira/browse/HUDI-1264
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> initial version, we could fail incremental reads if there is a REPLACE 
> instant. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-1307) spark datasource load path format is confused for snapshot and increment read mode

2021-09-26 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420424#comment-17420424
 ] 

liwei commented on HUDI-1307:
-

[~xushiyan] hello , recently i am focus on ingest kafka data using hudi with 
clustering, and the online & offline analytics workload schedule on k8s. So 
this issue have not update. 

I think we can   keep glob path pattern around, but increment mode and snapshot 
mode can unified.:D

> spark datasource load path format is confused for snapshot and increment read 
> mode
> --
>
> Key: HUDI-1307
> URL: https://issues.apache.org/jira/browse/HUDI-1307
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Critical
>  Labels: sev:high, user-support-issues
>
> as spark datasource read hudi table
> 1、snapshot mode
> {code:java}
>  val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*");
> should add "/*" ,otherwise will fail, because in 
> org.apache.hudi.DefaultSource.
> createRelation() will use fs.globStatus(). if do not have "/*" will not get 
> .hoodie and default dir
> val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, 
> fs){code}
>  
> 2、increment mode
> both basePath and  basePath + "/*"  is ok.This is because in 
> org.apache.hudi.DefaultSource  
> DataSourceUtils.getTablePath can support both the two format.
> {code:java}
>  val incViewDF = spark.read.format("org.apache.hudi").
>  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
>  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
>  option(END_INSTANTTIME_OPT_KEY, endTime).
>  load(basePath){code}
>  
> {code:java}
>  val incViewDF = spark.read.format("org.apache.hudi").
>  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
>  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
>  option(END_INSTANTTIME_OPT_KEY, endTime).
>  load(basePath + "/*")
>  {code}
>  
> as  increment mode and snapshot mode not coincide, user will confuse .Also 
> load use basepath +"/*"  *or "/***/*"* is  confuse. I know this is to support 
> partition.
> but i think this api will more clear for user
>  
> {code:java}
>  partition = "year = '2019'"
> spark.read .format("hudi") .load(path) .where(partition) {code}
>  
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-52) Implement Savepoints for Merge On Read table #88

2021-08-26 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-52?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405053#comment-17405053
 ] 

liwei edited comment on HUDI-52 at 8/26/21, 8:21 AM:
-

[~vinoth] hello, We are going to build table backup and recovery based on 
savepoint and wal, but  MOR mode does not support savepoint now. Are there any 
blocking problems about this? :)


was (Author: 309637554):
[~vinoth] hello, We are going to build table backup and recovery based on 
savepoint and wal, but the discovery MOR mode does not support savepoint now. 
Are there any blocking problems about this? :)

> Implement Savepoints for Merge On Read table #88
> 
>
> Key: HUDI-52
> URL: https://issues.apache.org/jira/browse/HUDI-52
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Storage Management, Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: liwei
>Priority: Major
>  Labels: help-requested, starter
> Fix For: 0.10.0
>
>
> https://github.com/uber/hudi/issues/88



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-52) Implement Savepoints for Merge On Read table #88

2021-08-26 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-52?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405053#comment-17405053
 ] 

liwei commented on HUDI-52:
---

[~vinoth] hello, We are going to build table backup and recovery based on 
savepoint and wal, but the discovery MOR mode does not support savepoint now. 
Are there any blocking problems about this? :)

> Implement Savepoints for Merge On Read table #88
> 
>
> Key: HUDI-52
> URL: https://issues.apache.org/jira/browse/HUDI-52
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Storage Management, Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: liwei
>Priority: Major
>  Labels: help-requested, starter
> Fix For: 0.10.0
>
>
> https://github.com/uber/hudi/issues/88



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2355) after clustering with archive meet data incorrect

2021-08-24 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-2355:
---

Assignee: liwei

> after clustering with archive  meet data incorrect
> --
>
> Key: HUDI-2355
> URL: https://issues.apache.org/jira/browse/HUDI-2355
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
> after  [https://github.com/apache/hudi/pull/3310]  replace data file clean in 
> clean. but if replacecommit file deleted , in clean can not read the 
> datafile. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2355) after clustering with archive meet data incorrect

2021-08-24 Thread liwei (Jira)
liwei created HUDI-2355:
---

 Summary: after clustering with archive  meet data incorrect
 Key: HUDI-2355
 URL: https://issues.apache.org/jira/browse/HUDI-2355
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: liwei


after  [https://github.com/apache/hudi/pull/3310]  replace data file clean in 
clean. but if replacecommit file deleted , in clean can not read the datafile. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2354) archive delete replacecommit, but stop timeline server meet file not found

2021-08-24 Thread liwei (Jira)
liwei created HUDI-2354:
---

 Summary: archive delete replacecommit, but stop timeline server 
meet file not found
 Key: HUDI-2354
 URL: https://issues.apache.org/jira/browse/HUDI-2354
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: liwei


1、in spark writeclient postcommit will archive replacecommit which meet the 
archive Requirement

21/08/23 14:57:12 INFO HoodieTimelineArchiveLog: Archived and deleted instant 
file .hoodie/20210823114552.commit
21/08/23 14:57:13 INFO HoodieTimelineArchiveLog: Archived and deleted instant 
file .hoodie/20210823114553.replacecommit.requested
21/08/23 14:57:13 INFO HoodieTimelineArchiveLog: Archived and deleted instant 
file .hoodie/20210823114553.replacecommit.inflight
21/08/23 14:57:13 INFO HoodieTimelineArchiveLog: Archived and deleted instant 
file .hoodie/20210823114553.replacecommit

 

2、if you start timelineservice, after sparksqlwrite post commit  it will stop . 
 In HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:106) need to 
read replace instant metadata ,  but the replace instant file is delete , but 
the timeline not update

 

org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:297)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
Caused by: org.apache.hudi.exception.HoodieIOException: Could not read commit 
details from .hoodie/20210823114553.replacecommit
at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:555)
at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:219)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$resetFileGroupsReplaced$8(AbstractTableFileSystemView.java:217)
at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267)
at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.resetFileGroupsReplaced(AbstractTableFileSystemView.java:228)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:106)
at 
org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:106)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.reset(AbstractTableFileSystemView.java:248)
at 
org.apache.hudi.common.table.view.HoodieTableFileSystemView.close(HoodieTableFileSystemView.java:353)
at 
java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4707)
at 
org.apache.hudi.common.table.view.FileSystemViewManager.close(FileSystemViewManager.java:118)
at 
org.apache.hudi.timeline.service.TimelineService.close(TimelineService.java:207)
at 
org.apache.hudi.client.embedded.EmbeddedTimelineService.stop(EmbeddedTimelineService.java:121)
at 
org.apache.hudi.client.AbstractHoodieClient.stopEmbeddedServerView(AbstractHoodieClient.java:94)
at 
org.apache.hudi.client.AbstractHoodieClient.close(AbstractHoodieClient.java:86)
at 
org.apache.hudi.client.AbstractHoodieWriteClient.close(AbstractHoodieWriteClient.java:1094)
at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:509)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:226)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at 

[jira] [Assigned] (HUDI-2354) archive delete replacecommit, but stop timeline server meet file not found

2021-08-24 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-2354:
---

Assignee: liwei

> archive delete replacecommit, but stop timeline server meet file not found
> --
>
> Key: HUDI-2354
> URL: https://issues.apache.org/jira/browse/HUDI-2354
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
> 1、in spark writeclient postcommit will archive replacecommit which meet the 
> archive Requirement
> 21/08/23 14:57:12 INFO HoodieTimelineArchiveLog: Archived and deleted instant 
> file .hoodie/20210823114552.commit
> 21/08/23 14:57:13 INFO HoodieTimelineArchiveLog: Archived and deleted instant 
> file .hoodie/20210823114553.replacecommit.requested
> 21/08/23 14:57:13 INFO HoodieTimelineArchiveLog: Archived and deleted instant 
> file .hoodie/20210823114553.replacecommit.inflight
> 21/08/23 14:57:13 INFO HoodieTimelineArchiveLog: Archived and deleted instant 
> file .hoodie/20210823114553.replacecommit
>  
> 2、if you start timelineservice, after sparksqlwrite post commit  it will stop 
> .  In HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:106) need 
> to read replace instant metadata ,  but the replace instant file is delete , 
> but the timeline not update
>  
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:297)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
> Caused by: org.apache.hudi.exception.HoodieIOException: Could not read commit 
> details from .hoodie/20210823114553.replacecommit
> at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:555)
> at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:219)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$resetFileGroupsReplaced$8(AbstractTableFileSystemView.java:217)
> at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267)
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.resetFileGroupsReplaced(AbstractTableFileSystemView.java:228)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:106)
> at 
> org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:106)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.reset(AbstractTableFileSystemView.java:248)
> at 
> org.apache.hudi.common.table.view.HoodieTableFileSystemView.close(HoodieTableFileSystemView.java:353)
> at 
> java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4707)
> at 
> org.apache.hudi.common.table.view.FileSystemViewManager.close(FileSystemViewManager.java:118)
> at 
> org.apache.hudi.timeline.service.TimelineService.close(TimelineService.java:207)
> at 
> org.apache.hudi.client.embedded.EmbeddedTimelineService.stop(EmbeddedTimelineService.java:121)
> at 
> org.apache.hudi.client.AbstractHoodieClient.stopEmbeddedServerView(AbstractHoodieClient.java:94)
> at 
> org.apache.hudi.client.AbstractHoodieClient.close(AbstractHoodieClient.java:86)
> at 
> org.apache.hudi.client.AbstractHoodieWriteClient.close(AbstractHoodieWriteClient.java:1094)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:509)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:226)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> 

[jira] [Updated] (HUDI-2301) fix FileSliceMetrics utils bug

2021-08-24 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-2301:

Status: In Progress  (was: Open)

> fix FileSliceMetrics utils bug
> --
>
> Key: HUDI-2301
> URL: https://issues.apache.org/jira/browse/HUDI-2301
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: WangZhongze
>Priority: Major
>  Labels: pull-request-available
>
> Fix bug of metrics calculation error
> In the original code, the calculation of totalReadIO and totalWriteIO will 
> only obtain the size of one of the Fileslice



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2301) fix FileSliceMetrics utils bug

2021-08-24 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-2301:

Status: Closed  (was: Patch Available)

> fix FileSliceMetrics utils bug
> --
>
> Key: HUDI-2301
> URL: https://issues.apache.org/jira/browse/HUDI-2301
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: WangZhongze
>Priority: Major
>  Labels: pull-request-available
>
> Fix bug of metrics calculation error
> In the original code, the calculation of totalReadIO and totalWriteIO will 
> only obtain the size of one of the Fileslice



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2301) fix FileSliceMetrics utils bug

2021-08-24 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-2301:

Status: Patch Available  (was: In Progress)

> fix FileSliceMetrics utils bug
> --
>
> Key: HUDI-2301
> URL: https://issues.apache.org/jira/browse/HUDI-2301
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: WangZhongze
>Priority: Major
>  Labels: pull-request-available
>
> Fix bug of metrics calculation error
> In the original code, the calculation of totalReadIO and totalWriteIO will 
> only obtain the size of one of the Fileslice



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2301) fix FileSliceMetrics utils bug

2021-08-12 Thread liwei (Jira)
liwei created HUDI-2301:
---

 Summary: fix FileSliceMetrics utils bug
 Key: HUDI-2301
 URL: https://issues.apache.org/jira/browse/HUDI-2301
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: liwei
Assignee: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2300) add ClusteringPlanStrategy unit test

2021-08-12 Thread liwei (Jira)
liwei created HUDI-2300:
---

 Summary: add ClusteringPlanStrategy unit test
 Key: HUDI-2300
 URL: https://issues.apache.org/jira/browse/HUDI-2300
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: liwei
Assignee: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1468) incremental read support with clustering

2021-07-09 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377966#comment-17377966
 ] 

liwei commented on HUDI-1468:
-

[~vinoth] hello , is [https://github.com/apache/hudi/pull/3139/files] land it , 
this issue can close?

> incremental read support with clustering
> 
>
> Key: HUDI-1468
> URL: https://issues.apache.org/jira/browse/HUDI-1468
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Affects Versions: 0.9.0
>Reporter: satish
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> As part of clustering, metadata such as hoodie_commit_time changes for 
> records that are clustered. This is specific to 
> SparkBulkInsertBasedRunClusteringStrategy implementation. Figure out a way to 
> carry commit_time from original record to support incremental queries.
> Also, incremental queries dont work with 'replacecommit' used by clustering 
> HUDI-1264. Change incremental query to work for replacecommits created by 
> Clustering.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1042) [Umbrella] Support clustering on filegroups

2021-06-26 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-1042:
---

Assignee: liwei  (was: leesf)

> [Umbrella] Support clustering on filegroups
> ---
>
> Key: HUDI-1042
> URL: https://issues.apache.org/jira/browse/HUDI-1042
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: leesf
>Assignee: liwei
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 0.9.0
>
>
> please see 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1043) Support clustering in CoW mode

2021-06-26 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-1043:
---

Assignee: liwei  (was: leesf)

> Support clustering in CoW mode
> --
>
> Key: HUDI-1043
> URL: https://issues.apache.org/jira/browse/HUDI-1043
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: leesf
>Assignee: liwei
>Priority: Major
>
> updates are not allowed during clustering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-06-14 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363315#comment-17363315
 ] 

liwei commented on HUDI-1138:
-

[~guoyihua] thanks .

“We may consider blocking the requests for batching so that the timeline server 
sends the actual responses only after MARKERS are overwritten / updated.”

If  waiting the batch requests overwrite/updated successfully. The create 
marker file request from spark task will wait long time such as 200ms interval 
plus the markerfiles read and overwrite.  

 

Do you have same plan to update the marker file?

> Re-implement marker files via timeline server
> -
>
> Key: HUDI-1138
> URL: https://issues.apache.org/jira/browse/HUDI-1138
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Even as you can argue that RFC-15/consolidated metadata, removes the need for 
> deleting partial files written due to spark task failures/stage retries. It 
> will still leave extra files inside the table (and users will pay for it 
> every month) and we need the marker mechanism to be able to delete these 
> partial files. 
> Here we explore if we can improve the current marker file mechanism, that 
> creates one marker file per data file written, by 
> Delegating the createMarker() call to the driver/timeline server, and have it 
> create marker metadata into a single file handle, that is flushed for 
> durability guarantees
>  
> P.S: I was tempted to think Spark listener mechanism can help us deal with 
> failed tasks, but it has no guarantees. the writer job could die without 
> deleting a partial file. i.e it can improve things, but cant provide 
> guarantees 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-06-13 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362535#comment-17362535
 ] 

liwei commented on HUDI-1138:
-

[~guoyihua] [~vinoth]:D hello , i have a question 
 * "When using S3, use overwrite operation for MARKERS file, and batch requests 
within an interval, say a few hundred milliseconds (configurable)."

If the timeline server crashed before it  overwrite  MARKERS file with the 
latest batch request. The latest batch files will not rollback?

> Re-implement marker files via timeline server
> -
>
> Key: HUDI-1138
> URL: https://issues.apache.org/jira/browse/HUDI-1138
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Even as you can argue that RFC-15/consolidated metadata, removes the need for 
> deleting partial files written due to spark task failures/stage retries. It 
> will still leave extra files inside the table (and users will pay for it 
> every month) and we need the marker mechanism to be able to delete these 
> partial files. 
> Here we explore if we can improve the current marker file mechanism, that 
> creates one marker file per data file written, by 
> Delegating the createMarker() call to the driver/timeline server, and have it 
> create marker metadata into a single file handle, that is flushed for 
> durability guarantees
>  
> P.S: I was tempted to think Spark listener mechanism can help us deal with 
> failed tasks, but it has no guarantees. the writer job could die without 
> deleting a partial file. i.e it can improve things, but cant provide 
> guarantees 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1768) spark datasource support schema validate add column

2021-05-16 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei resolved HUDI-1768.
-
Resolution: Fixed

> spark datasource support schema validate add column 
> 
>
> Key: HUDI-1768
> URL: https://issues.apache.org/jira/browse/HUDI-1768
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
>
> spark datasource now not support set avro column default value.
> but it set column to nullable and use SchemaConverters.toAvroType transform 
> to union type which has null, such as :
> Registered avro schema : {
>  "type" : "record",
>  "name" : "hoodie_test_record",
>  "namespace" : "hoodie.hoodie_test",
>  "fields" : [ {
>  "name" : "_row_key",
>  "type" : [ "string", "null" ]
>  }, {
>  "name" : "name",
>  "type" : [ "string", "null" ]
>  }, {
>  "name" : "timestamp",
>  "type" : [ "int", "null" ]
>  }, {
>  "name" : "partition",
>  "type" : [ "int", "null" ]
>  } ]
> }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1295) RFC-15: Track bloom filters as a part of metadata table

2021-05-04 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338774#comment-17338774
 ] 

liwei commented on HUDI-1295:
-

[~vinoth] got it. Thanks 

> RFC-15: Track bloom filters as a part of metadata table
> ---
>
> Key: HUDI-1295
> URL: https://issues.apache.org/jira/browse/HUDI-1295
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 0.9.0
>
>
> Idea here to maintain our bloom filters outside of parquet for speedier 
> access from bloom. index 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1295) RFC-15: Track bloom filters as a part of metadata table

2021-05-01 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337947#comment-17337947
 ] 

liwei commented on HUDI-1295:
-

[~vinoth] hello , do we also land this in 
[RFC-27|https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance]
 or can land it use metatable now? :D

> RFC-15: Track bloom filters as a part of metadata table
> ---
>
> Key: HUDI-1295
> URL: https://issues.apache.org/jira/browse/HUDI-1295
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 0.9.0
>
>
> Idea here to maintain our bloom filters outside of parquet for speedier 
> access from bloom. index 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-04-22 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329158#comment-17329158
 ] 

liwei commented on HUDI-1138:
-

[~vinoth] ok, i will move the discussion to RFC-27.:D

> Re-implement marker files via timeline server
> -
>
> Key: HUDI-1138
> URL: https://issues.apache.org/jira/browse/HUDI-1138
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Even as you can argue that RFC-15/consolidated metadata, removes the need for 
> deleting partial files written due to spark task failures/stage retries. It 
> will still leave extra files inside the table (and users will pay for it 
> every month) and we need the marker mechanism to be able to delete these 
> partial files. 
> Here we explore if we can improve the current marker file mechanism, that 
> creates one marker file per data file written, by 
> Delegating the createMarker() call to the driver/timeline server, and have it 
> create marker metadata into a single file handle, that is flushed for 
> durability guarantees
>  
> P.S: I was tempted to think Spark listener mechanism can help us deal with 
> failed tasks, but it has no guarantees. the writer job could die without 
> deleting a partial file. i.e it can improve things, but cant provide 
> guarantees 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-04-21 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327052#comment-17327052
 ] 

liwei commented on HUDI-1138:
-

[~vinoth] thanks 

1. I have a idea, can we update the file to metatable in timeline server. As we 
can unify the meta info to metatable ?

2. Now rollback is not  a frequency action. So we need poc the perf first.

3. I recently also research RFC-27. I think if we can unify the metadata such 
as partitions, markfiles, statistics ,index or others. Just as delta lake use 
delta log store this , and snowflake use metaservice . The unify metatable can 
resolve cloud storage poor meta manage 、 compute and storage query performance 
. I think RFC-27. RFC - 15 . RFC-08 have some overlaps. Want to discuss with  
you !  Thanks 

> Re-implement marker files via timeline server
> -
>
> Key: HUDI-1138
> URL: https://issues.apache.org/jira/browse/HUDI-1138
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Even as you can argue that RFC-15/consolidated metadata, removes the need for 
> deleting partial files written due to spark task failures/stage retries. It 
> will still leave extra files inside the table (and users will pay for it 
> every month) and we need the marker mechanism to be able to delete these 
> partial files. 
> Here we explore if we can improve the current marker file mechanism, that 
> creates one marker file per data file written, by 
> Delegating the createMarker() call to the driver/timeline server, and have it 
> create marker metadata into a single file handle, that is flushed for 
> durability guarantees
>  
> P.S: I was tempted to think Spark listener mechanism can help us deal with 
> failed tasks, but it has no guarantees. the writer job could die without 
> deleting a partial file. i.e it can improve things, but cant provide 
> guarantees 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-04-18 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324649#comment-17324649
 ] 

liwei commented on HUDI-1138:
-

[~uditme] [~vinoth] i also think listing will be  performance improvement 
point.  In cloud storage such as S3 and OSS of alibaba cloud list is expensive 
and slow. 

can we use  

P.S: I was tempted to think Spark listener mechanism can help us deal with 
failed tasks, but it has no guarantees. the writer job could die without 
deleting a partial file. i.e it can improve things, but cant provide guarantees 

and delete the residue files in clean ?

> Re-implement marker files via timeline server
> -
>
> Key: HUDI-1138
> URL: https://issues.apache.org/jira/browse/HUDI-1138
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Even as you can argue that RFC-15/consolidated metadata, removes the need for 
> deleting partial files written due to spark task failures/stage retries. It 
> will still leave extra files inside the table (and users will pay for it 
> every month) and we need the marker mechanism to be able to delete these 
> partial files. 
> Here we explore if we can improve the current marker file mechanism, that 
> creates one marker file per data file written, by 
> Delegating the createMarker() call to the driver/timeline server, and have it 
> create marker metadata into a single file handle, that is flushed for 
> durability guarantees
>  
> P.S: I was tempted to think Spark listener mechanism can help us deal with 
> failed tasks, but it has no guarantees. the writer job could die without 
> deleting a partial file. i.e it can improve things, but cant provide 
> guarantees 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction

2021-04-18 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-897:
---
Status: In Progress  (was: Open)

> hudi support log append scenario with better write and asynchronous compaction
> --
>
> Key: HUDI-897
> URL: https://issues.apache.org/jira/browse/HUDI-897
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Compaction, Performance
>Affects Versions: 0.9.0
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.9.0
>
> Attachments: image-2020-05-14-19-51-37-938.png, 
> image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, 
> logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three  scenario. [1]
>  
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is 
> incrementally written to hudi, and it is also doing bulkload files to hudi. 
> However, there is no good native support for log scenarios (requiring 
> high-throughput writes, no updates, deletions, and focusing on small file 
> scenarios);now can write through inserts without deduplication, but they will 
> still merge on the write side.
>  * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but 
>  every batch small  will cost some time for merge,it will reduce write 
> throughput.  
>  * This scene is not suitable for  merge on read. 
>  * the actual scenario only needs to write parquet in batches when writing, 
> and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>   
>  1.On the write side, just write every batch to parquet file base on the 
> snapshot mechanism,default open the merge,use can close the auto merge for 
> more  write throughput.  
> 2. hudi support asynchronous merge small parquet files like databricks delta 
> lake's  OPTIMIZE command [2] 
>  
> [1] [https://databricks.com/product/delta-lake-on-databricks]
> [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction

2021-04-18 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324647#comment-17324647
 ] 

liwei commented on HUDI-897:


okay

> hudi support log append scenario with better write and asynchronous compaction
> --
>
> Key: HUDI-897
> URL: https://issues.apache.org/jira/browse/HUDI-897
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Compaction, Performance
>Affects Versions: 0.9.0
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.9.0
>
> Attachments: image-2020-05-14-19-51-37-938.png, 
> image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, 
> logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three  scenario. [1]
>  
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is 
> incrementally written to hudi, and it is also doing bulkload files to hudi. 
> However, there is no good native support for log scenarios (requiring 
> high-throughput writes, no updates, deletions, and focusing on small file 
> scenarios);now can write through inserts without deduplication, but they will 
> still merge on the write side.
>  * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but 
>  every batch small  will cost some time for merge,it will reduce write 
> throughput.  
>  * This scene is not suitable for  merge on read. 
>  * the actual scenario only needs to write parquet in batches when writing, 
> and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>   
>  1.On the write side, just write every batch to parquet file base on the 
> snapshot mechanism,default open the merge,use can close the auto merge for 
> more  write throughput.  
> 2. hudi support asynchronous merge small parquet files like databricks delta 
> lake's  OPTIMIZE command [2] 
>  
> [1] [https://databricks.com/product/delta-lake-on-databricks]
> [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction

2021-04-18 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei resolved HUDI-897.

Resolution: Fixed

> hudi support log append scenario with better write and asynchronous compaction
> --
>
> Key: HUDI-897
> URL: https://issues.apache.org/jira/browse/HUDI-897
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Compaction, Performance
>Affects Versions: 0.9.0
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.9.0
>
> Attachments: image-2020-05-14-19-51-37-938.png, 
> image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, 
> logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three  scenario. [1]
>  
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is 
> incrementally written to hudi, and it is also doing bulkload files to hudi. 
> However, there is no good native support for log scenarios (requiring 
> high-throughput writes, no updates, deletions, and focusing on small file 
> scenarios);now can write through inserts without deduplication, but they will 
> still merge on the write side.
>  * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but 
>  every batch small  will cost some time for merge,it will reduce write 
> throughput.  
>  * This scene is not suitable for  merge on read. 
>  * the actual scenario only needs to write parquet in batches when writing, 
> and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>   
>  1.On the write side, just write every batch to parquet file base on the 
> snapshot mechanism,default open the merge,use can close the auto merge for 
> more  write throughput.  
> 2. hudi support asynchronous merge small parquet files like databricks delta 
> lake's  OPTIMIZE command [2] 
>  
> [1] [https://databricks.com/product/delta-lake-on-databricks]
> [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1674) add partition level delete DOC or example

2021-04-07 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316820#comment-17316820
 ] 

liwei commented on HUDI-1674:
-

[~shivnarayan] spark datasource do not have the delete partition API. It need 
use the catalog.

https://stackoverflow.com/questions/52531327/drop-partitions-from-spark

After [https://github.com/apache/hudi/pull/2645] is landed, We can support 
'alter table xx drop partition  ()'

 

> add partition level delete DOC or example
> -
>
> Key: HUDI-1674
> URL: https://issues.apache.org/jira/browse/HUDI-1674
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Priority: Minor
>  Labels: docs, user-support-issues
> Attachments: image-2021-03-08-09-57-05-768.png
>
>
> !image-2021-03-08-09-57-05-768.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1768) spark datasource support schema validate add column

2021-04-06 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1768:

Issue Type: Improvement  (was: Bug)

> spark datasource support schema validate add column 
> 
>
> Key: HUDI-1768
> URL: https://issues.apache.org/jira/browse/HUDI-1768
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
> spark datasource now not support set avro column default value.
> but it set column to nullable and use SchemaConverters.toAvroType transform 
> to union type which has null, such as :
> Registered avro schema : {
>  "type" : "record",
>  "name" : "hoodie_test_record",
>  "namespace" : "hoodie.hoodie_test",
>  "fields" : [ {
>  "name" : "_row_key",
>  "type" : [ "string", "null" ]
>  }, {
>  "name" : "name",
>  "type" : [ "string", "null" ]
>  }, {
>  "name" : "timestamp",
>  "type" : [ "int", "null" ]
>  }, {
>  "name" : "partition",
>  "type" : [ "int", "null" ]
>  } ]
> }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1768) spark datasource support schema validate add column

2021-04-06 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-1768:
---

Assignee: liwei

> spark datasource support schema validate add column 
> 
>
> Key: HUDI-1768
> URL: https://issues.apache.org/jira/browse/HUDI-1768
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
> spark datasource now not support set avro column default value.
> but it set column to nullable and use SchemaConverters.toAvroType transform 
> to union type which has null, such as :
> Registered avro schema : {
>  "type" : "record",
>  "name" : "hoodie_test_record",
>  "namespace" : "hoodie.hoodie_test",
>  "fields" : [ {
>  "name" : "_row_key",
>  "type" : [ "string", "null" ]
>  }, {
>  "name" : "name",
>  "type" : [ "string", "null" ]
>  }, {
>  "name" : "timestamp",
>  "type" : [ "int", "null" ]
>  }, {
>  "name" : "partition",
>  "type" : [ "int", "null" ]
>  } ]
> }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1768) spark datasource support schema validate add column

2021-04-06 Thread liwei (Jira)
liwei created HUDI-1768:
---

 Summary: spark datasource support schema validate add column 
 Key: HUDI-1768
 URL: https://issues.apache.org/jira/browse/HUDI-1768
 Project: Apache Hudi
  Issue Type: Bug
Reporter: liwei


spark datasource now not support set avro column default value.

but it set column to nullable and use SchemaConverters.toAvroType transform to 
union type which has null, such as :

Registered avro schema : {
 "type" : "record",
 "name" : "hoodie_test_record",
 "namespace" : "hoodie.hoodie_test",
 "fields" : [ {
 "name" : "_row_key",
 "type" : [ "string", "null" ]
 }, {
 "name" : "name",
 "type" : [ "string", "null" ]
 }, {
 "name" : "timestamp",
 "type" : [ "int", "null" ]
 }, {
 "name" : "partition",
 "type" : [ "int", "null" ]
 } ]
}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1590) Support async clustering w/ test suite job

2021-03-08 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297421#comment-17297421
 ] 

liwei commented on HUDI-1590:
-

[~legendtkl] try it . :D

> Support async clustering w/ test suite job
> --
>
> Key: HUDI-1590
> URL: https://issues.apache.org/jira/browse/HUDI-1590
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: sivabalan narayanan
>Assignee: Kelu Tao
>Priority: Major
> Fix For: 0.8.0
>
>
> As of now, we only have inline clustering support w/ hoodie test suite job. 
> we need to add support for async clustering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1674) add partition level delete DOC or example

2021-03-07 Thread liwei (Jira)
liwei created HUDI-1674:
---

 Summary: add partition level delete DOC or example
 Key: HUDI-1674
 URL: https://issues.apache.org/jira/browse/HUDI-1674
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: liwei
 Attachments: image-2021-03-08-09-57-05-768.png

!image-2021-03-08-09-57-05-768.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord

2021-03-01 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292942#comment-17292942
 ] 

liwei commented on HUDI-797:


[~pwason]  Hello i also meet the performance problem. When i ingest large logs, 
it is so slowly when 

HoodieCreateHandle.rewrite() just 2MB/s to object store. Do we have any other 
method to solve this problem? :)

> Improve performance of rewriting AVRO records in 
> HoodieAvroUtils::rewriteRecord
> ---
>
> Key: HUDI-797
> URL: https://issues.apache.org/jira/browse/HUDI-797
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO 
> encoded records. These records have a [schema 
> |https://avro.apache.org/docs/current/spec.html]which is determined by the 
> dataset user and provided to HUDI during the writing process (as part of 
> HUDIWriteConfig). The records are finally saved in [parquet 
> |https://parquet.apache.org/]files which include the schema (in parquet 
> format) in the footer of individual files.
>  
> HUDI design requires addition of some metadata fields to all incoming records 
> to aid in book-keeping and indexing. To achieve this, the incoming schema 
> needs to be modified by adding the HUDI metadata fields and is called the 
> HUDI schema for the dataset. Each incoming record is then re-written to 
> translate it from the incoming schema into the HUDI schema. Re-writing the 
> incoming records to a new schema is reasonably fast as it looks up all fields 
> in the incoming record and adds them to a new record. But since this takes 
> place for each and every incoming record. 
> When ingestion large datasets (billions of records) or large number of 
> datasets, even small improvements in the CPU-bound conversion can translate 
> into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1520) add configure for spark sql overwrite use replace

2021-01-11 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei resolved HUDI-1520.
-
Resolution: Fixed

> add configure for spark sql overwrite use replace
> -
>
> Key: HUDI-1520
> URL: https://issues.apache.org/jira/browse/HUDI-1520
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1520) add configure for spark sql overwrite use replace

2021-01-10 Thread liwei (Jira)
liwei created HUDI-1520:
---

 Summary: add configure for spark sql overwrite use replace
 Key: HUDI-1520
 URL: https://issues.apache.org/jira/browse/HUDI-1520
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1399) support a independent clustering spark job to asynchronously clustering

2021-01-09 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei closed HUDI-1399.
---

> support a independent clustering spark job to asynchronously clustering 
> 
>
> Key: HUDI-1399
> URL: https://issues.apache.org/jira/browse/HUDI-1399
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1399) support a independent clustering spark job to asynchronously clustering

2021-01-09 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei resolved HUDI-1399.
-
Resolution: Fixed

> support a independent clustering spark job to asynchronously clustering 
> 
>
> Key: HUDI-1399
> URL: https://issues.apache.org/jira/browse/HUDI-1399
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1399) support a independent clustering spark job to asynchronously clustering

2021-01-09 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1399:

Status: Closed  (was: Patch Available)

> support a independent clustering spark job to asynchronously clustering 
> 
>
> Key: HUDI-1399
> URL: https://issues.apache.org/jira/browse/HUDI-1399
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1399) support a independent clustering spark job to asynchronously clustering

2021-01-09 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reopened HUDI-1399:
-

> support a independent clustering spark job to asynchronously clustering 
> 
>
> Key: HUDI-1399
> URL: https://issues.apache.org/jira/browse/HUDI-1399
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1516) refactored testHoodieAsyncClusteringJob in TestHoodieDeltaStreamer.java

2021-01-09 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-1516:
---

Assignee: liwei

> refactored testHoodieAsyncClusteringJob in TestHoodieDeltaStreamer.java
> ---
>
> Key: HUDI-1516
> URL: https://issues.apache.org/jira/browse/HUDI-1516
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
> I'm worried this is polluting the tests in hoodie delta streamer, is this 
> test case to be refactored after deltastreamer natively supports clustering 
> https://github.com/apache/hudi/pull/2379



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1516) refactored testHoodieAsyncClusteringJob in TestHoodieDeltaStreamer.java

2021-01-09 Thread liwei (Jira)
liwei created HUDI-1516:
---

 Summary: refactored testHoodieAsyncClusteringJob in 
TestHoodieDeltaStreamer.java
 Key: HUDI-1516
 URL: https://issues.apache.org/jira/browse/HUDI-1516
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: DeltaStreamer
Reporter: liwei


I'm worried this is polluting the tests in hoodie delta streamer, is this test 
case to be refactored after deltastreamer natively supports clustering 

https://github.com/apache/hudi/pull/2379



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1482) async clustering for spark streaming

2021-01-04 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1482:

Status: Open  (was: New)

> async clustering for spark streaming
> 
>
> Key: HUDI-1482
> URL: https://issues.apache.org/jira/browse/HUDI-1482
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1482) async clustering for spark streaming

2021-01-04 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1482:

Status: In Progress  (was: Open)

> async clustering for spark streaming
> 
>
> Key: HUDI-1482
> URL: https://issues.apache.org/jira/browse/HUDI-1482
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1500) support incremental read clustering commit in deltastreamer

2020-12-30 Thread liwei (Jira)
liwei created HUDI-1500:
---

 Summary: support incremental read clustering  commit in 
deltastreamer
 Key: HUDI-1500
 URL: https://issues.apache.org/jira/browse/HUDI-1500
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: DeltaStreamer
Reporter: liwei


now in DeltaSync.readFromSource() can  not read last instant as replace commit, 
such as clustering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1481) support inline clustering unit tests for spark datasource and deltastreamer

2020-12-28 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei resolved HUDI-1481.
-
Resolution: Fixed

> support inline clustering unit tests for spark datasource and deltastreamer
> ---
>
> Key: HUDI-1481
> URL: https://issues.apache.org/jira/browse/HUDI-1481
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1354) Block updates and replace on file groups in clustering

2020-12-28 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei resolved HUDI-1354.
-
Resolution: Fixed

> Block updates and replace on file groups in clustering
> --
>
> Key: HUDI-1354
> URL: https://issues.apache.org/jira/browse/HUDI-1354
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1350) Support Partition level delete API in HUDI on top on Insert Overwrite

2020-12-28 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei resolved HUDI-1350.
-
Resolution: Fixed

> Support Partition level delete API in HUDI on top on Insert Overwrite
> -
>
> Key: HUDI-1350
> URL: https://issues.apache.org/jira/browse/HUDI-1350
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1354) Block updates and replace on file groups in clustering

2020-12-28 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1354:

Status: Closed  (was: Patch Available)

> Block updates and replace on file groups in clustering
> --
>
> Key: HUDI-1354
> URL: https://issues.apache.org/jira/browse/HUDI-1354
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1354) Block updates and replace on file groups in clustering

2020-12-28 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1354:

Status: Patch Available  (was: In Progress)

> Block updates and replace on file groups in clustering
> --
>
> Key: HUDI-1354
> URL: https://issues.apache.org/jira/browse/HUDI-1354
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1354) Block updates and replace on file groups in clustering

2020-12-28 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reopened HUDI-1354:
-

> Block updates and replace on file groups in clustering
> --
>
> Key: HUDI-1354
> URL: https://issues.apache.org/jira/browse/HUDI-1354
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1498) Always read clustering plan from requested file

2020-12-28 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1498:

Status: Open  (was: New)

> Always read clustering plan from requested file
> ---
>
> Key: HUDI-1498
> URL: https://issues.apache.org/jira/browse/HUDI-1498
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Clustering inflight doesnt have 'ClusteringPlan'. Read content from 
> corresponding requested file to make updates work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1350) Support Partition level delete API in HUDI on top on Insert Overwrite

2020-12-28 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1350:

Status: Patch Available  (was: In Progress)

> Support Partition level delete API in HUDI on top on Insert Overwrite
> -
>
> Key: HUDI-1350
> URL: https://issues.apache.org/jira/browse/HUDI-1350
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1350) Support Partition level delete API in HUDI on top on Insert Overwrite

2020-12-28 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1350:

Status: Closed  (was: Patch Available)

> Support Partition level delete API in HUDI on top on Insert Overwrite
> -
>
> Key: HUDI-1350
> URL: https://issues.apache.org/jira/browse/HUDI-1350
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1350) Support Partition level delete API in HUDI on top on Insert Overwrite

2020-12-28 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reopened HUDI-1350:
-

> Support Partition level delete API in HUDI on top on Insert Overwrite
> -
>
> Key: HUDI-1350
> URL: https://issues.apache.org/jira/browse/HUDI-1350
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1074) implement merge-sort based clustering strategy

2020-12-22 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei closed HUDI-1074.
---
Resolution: Fixed

> implement merge-sort based clustering strategy
> --
>
> Key: HUDI-1074
> URL: https://issues.apache.org/jira/browse/HUDI-1074
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Major
>
> implement a merge-sort based clustering algorithm. Example: i) sort all small 
> files by specified column(s)  ii) merge N small files into M larger files by 
> respecting sort order (M < N)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1074) implement merge-sort based clustering strategy

2020-12-22 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1074:

Status: Open  (was: New)

> implement merge-sort based clustering strategy
> --
>
> Key: HUDI-1074
> URL: https://issues.apache.org/jira/browse/HUDI-1074
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Major
>
> implement a merge-sort based clustering algorithm. Example: i) sort all small 
> files by specified column(s)  ii) merge N small files into M larger files by 
> respecting sort order (M < N)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-1074) implement merge-sort based clustering strategy

2020-12-22 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253874#comment-17253874
 ] 

liwei edited comment on HUDI-1074 at 12/23/20, 3:54 AM:


[~satishkotha] got it, i will do some performance test recently , then decide 
if we need optimize it


was (Author: 309637554):
[~satishkotha] got it

> implement merge-sort based clustering strategy
> --
>
> Key: HUDI-1074
> URL: https://issues.apache.org/jira/browse/HUDI-1074
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Major
>
> implement a merge-sort based clustering algorithm. Example: i) sort all small 
> files by specified column(s)  ii) merge N small files into M larger files by 
> respecting sort order (M < N)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1074) implement merge-sort based clustering strategy

2020-12-22 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253874#comment-17253874
 ] 

liwei commented on HUDI-1074:
-

[~satishkotha] got it

> implement merge-sort based clustering strategy
> --
>
> Key: HUDI-1074
> URL: https://issues.apache.org/jira/browse/HUDI-1074
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Major
>
> implement a merge-sort based clustering algorithm. Example: i) sort all small 
> files by specified column(s)  ii) merge N small files into M larger files by 
> respecting sort order (M < N)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-1042) [Umbrella] Support clustering on filegroups

2020-12-22 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253814#comment-17253814
 ] 

liwei edited comment on HUDI-1042 at 12/23/20, 12:49 AM:
-

okay, i have some issue is in progress 


was (Author: 309637554):
okay

> [Umbrella] Support clustering on filegroups
> ---
>
> Key: HUDI-1042
> URL: https://issues.apache.org/jira/browse/HUDI-1042
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: leesf
>Assignee: leesf
>Priority: Major
> Fix For: 0.7.0
>
>
> please see 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1074) implement merge-sort based clustering strategy

2020-12-22 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253815#comment-17253815
 ] 

liwei commented on HUDI-1074:
-

got it , i have not begin. if  [https://github.com/apache/hudi/pull/2263] have 
resolved the issue? or need implement a more complete  strategy? 

> implement merge-sort based clustering strategy
> --
>
> Key: HUDI-1074
> URL: https://issues.apache.org/jira/browse/HUDI-1074
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Major
>
> implement a merge-sort based clustering algorithm. Example: i) sort all small 
> files by specified column(s)  ii) merge N small files into M larger files by 
> respecting sort order (M < N)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-1074) implement merge-sort based clustering strategy

2020-12-22 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253815#comment-17253815
 ] 

liwei edited comment on HUDI-1074 at 12/23/20, 12:48 AM:
-

[~satishkotha]  got it , i have not begin. if  
[https://github.com/apache/hudi/pull/2263] have resolved the issue? or need 
implement a more complete  strategy? 


was (Author: 309637554):
got it , i have not begin. if  [https://github.com/apache/hudi/pull/2263] have 
resolved the issue? or need implement a more complete  strategy? 

> implement merge-sort based clustering strategy
> --
>
> Key: HUDI-1074
> URL: https://issues.apache.org/jira/browse/HUDI-1074
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Major
>
> implement a merge-sort based clustering algorithm. Example: i) sort all small 
> files by specified column(s)  ii) merge N small files into M larger files by 
> respecting sort order (M < N)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1042) [Umbrella] Support clustering on filegroups

2020-12-22 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253814#comment-17253814
 ] 

liwei commented on HUDI-1042:
-

okay

> [Umbrella] Support clustering on filegroups
> ---
>
> Key: HUDI-1042
> URL: https://issues.apache.org/jira/browse/HUDI-1042
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: leesf
>Assignee: leesf
>Priority: Major
> Fix For: 0.7.0
>
>
> please see 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1487) after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random

2020-12-22 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1487:

Description: 
 

 

TestCOWDataSource.testCopyOnWriteStorage will failed random. Because  before 
the  incremental read, add a new upsert commit.

// pull the latest commit
 val hoodieIncViewDF2 = spark.read.format("org.apache.hudi")
 .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
 .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2)
 .load(basePath)

the new commit is :

// Upsert based on the written table with Hudi metadata columns
 val verificationRowKey = 
snapshotDF1.limit(1).select("_row_key").first.getString(0)

as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : 
"expected: <65> but was: <66>"

 

 

[https://travis-ci.com/github/apache/hudi/jobs/463879606]

 

org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173)
 at 
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275)
 ... 30 more
 [WARN ] 2020-12-22 12:32:40,788 
org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system 
instance used in previous test-run
 [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.352 
s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource
 [ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage 
Time elapsed: 15.275 s <<< FAILURE!
 org.opentest4j.AssertionFailedError: expected: <65> but was: <66>
 at 
org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160)
 [INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap
 [WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
 [WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
 [WARN ] 2020-12-22 12:32:50,921 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
 [WARN ] 2020-12-22 12:32:56,169 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
 [WARN ] 2020-12-22 12:32:56,793 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
 [WARN ] 2020-12-22 12:32:57,388 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
 [WARN ] 2020-12-22 12:33:05,191 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
 [WARN ] 2020-12-22 12:33:10,221 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
 [WARN ] 2020-12-22 12:33:17,985 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
 [WARN ] 2020-12-22 12:33:22,498 org.apache.hudi.DefaultSource - Loading Base 
File Only View.

  was:
 

 

TestCOWDataSource.testCopyOnWriteStorage will failed random. Because  before 
the  incremental read, add a new upsert commit.

// pull the latest commit
val hoodieIncViewDF2 = spark.read.format("org.apache.hudi")
 .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
 .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2)
 .load(basePath)

the new commit is :

// Upsert based on the written table with Hudi metadata columns
val verificationRowKey = 
snapshotDF1.limit(1).select("_row_key").first.getString(0)

as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : 
"expected: <65> but was: <66>"

 

 

org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173)
 at 
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275)
 ... 30 more
[WARN ] 2020-12-22 12:32:40,788 
org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system 
instance used in previous test-run
[ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.352 
s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource
[ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage 
Time elapsed: 15.275 s <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <65> but was: <66>
 at 
org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160)
[INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap
[WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:32:50,921 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:32:56,169 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:32:56,793 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:32:57,388 org.apache.hudi.DefaultSource - Loading Base 
File 

[jira] [Updated] (HUDI-1487) after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random

2020-12-22 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1487:

Status: Open  (was: New)

> after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random
> --
>
> Key: HUDI-1487
> URL: https://issues.apache.org/jira/browse/HUDI-1487
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
>  
>  
> TestCOWDataSource.testCopyOnWriteStorage will failed random. Because  before 
> the  incremental read, add a new upsert commit.
> // pull the latest commit
> val hoodieIncViewDF2 = spark.read.format("org.apache.hudi")
>  .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
> DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
>  .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2)
>  .load(basePath)
> the new commit is :
> // Upsert based on the written table with Hudi metadata columns
> val verificationRowKey = 
> snapshotDF1.limit(1).select("_row_key").first.getString(0)
> as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : 
> "expected: <65> but was: <66>"
>  
>  
> org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173)
>  at 
> org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275)
>  ... 30 more
> [WARN ] 2020-12-22 12:32:40,788 
> org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system 
> instance used in previous test-run
> [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 35.352 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource
> [ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage 
> Time elapsed: 15.275 s <<< FAILURE!
> org.opentest4j.AssertionFailedError: expected: <65> but was: <66>
>  at 
> org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160)
> [INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap
> [WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:32:50,921 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:32:56,169 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:32:56,793 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:32:57,388 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:33:05,191 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:33:10,221 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:33:17,985 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:33:22,498 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1487) after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random

2020-12-22 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1487:

Status: In Progress  (was: Open)

> after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random
> --
>
> Key: HUDI-1487
> URL: https://issues.apache.org/jira/browse/HUDI-1487
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
>  
>  
> TestCOWDataSource.testCopyOnWriteStorage will failed random. Because  before 
> the  incremental read, add a new upsert commit.
> // pull the latest commit
> val hoodieIncViewDF2 = spark.read.format("org.apache.hudi")
>  .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
> DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
>  .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2)
>  .load(basePath)
> the new commit is :
> // Upsert based on the written table with Hudi metadata columns
> val verificationRowKey = 
> snapshotDF1.limit(1).select("_row_key").first.getString(0)
> as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : 
> "expected: <65> but was: <66>"
>  
>  
> org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173)
>  at 
> org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275)
>  ... 30 more
> [WARN ] 2020-12-22 12:32:40,788 
> org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system 
> instance used in previous test-run
> [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 35.352 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource
> [ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage 
> Time elapsed: 15.275 s <<< FAILURE!
> org.opentest4j.AssertionFailedError: expected: <65> but was: <66>
>  at 
> org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160)
> [INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap
> [WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:32:50,921 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:32:56,169 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:32:56,793 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:32:57,388 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:33:05,191 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:33:10,221 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:33:17,985 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:33:22,498 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1487) after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random

2020-12-22 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1487:

Description: 
 

 

TestCOWDataSource.testCopyOnWriteStorage will failed random. Because  before 
the  incremental read, add a new upsert commit.

// pull the latest commit
val hoodieIncViewDF2 = spark.read.format("org.apache.hudi")
 .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
 .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2)
 .load(basePath)

the new commit is :

// Upsert based on the written table with Hudi metadata columns
val verificationRowKey = 
snapshotDF1.limit(1).select("_row_key").first.getString(0)

as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : 
"expected: <65> but was: <66>"

 

 

org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173)
 at 
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275)
 ... 30 more
[WARN ] 2020-12-22 12:32:40,788 
org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system 
instance used in previous test-run
[ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.352 
s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource
[ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage 
Time elapsed: 15.275 s <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <65> but was: <66>
 at 
org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160)
[INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap
[WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:32:50,921 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:32:56,169 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:32:56,793 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:32:57,388 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:33:05,191 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:33:10,221 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:33:17,985 org.apache.hudi.DefaultSource - Loading Base 
File Only View.
[WARN ] 2020-12-22 12:33:22,498 org.apache.hudi.DefaultSource - Loading Base 
File Only View.

> after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random
> --
>
> Key: HUDI-1487
> URL: https://issues.apache.org/jira/browse/HUDI-1487
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
>  
>  
> TestCOWDataSource.testCopyOnWriteStorage will failed random. Because  before 
> the  incremental read, add a new upsert commit.
> // pull the latest commit
> val hoodieIncViewDF2 = spark.read.format("org.apache.hudi")
>  .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
> DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
>  .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2)
>  .load(basePath)
> the new commit is :
> // Upsert based on the written table with Hudi metadata columns
> val verificationRowKey = 
> snapshotDF1.limit(1).select("_row_key").first.getString(0)
> as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : 
> "expected: <65> but was: <66>"
>  
>  
> org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173)
>  at 
> org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275)
>  ... 30 more
> [WARN ] 2020-12-22 12:32:40,788 
> org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system 
> instance used in previous test-run
> [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 35.352 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource
> [ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage 
> Time elapsed: 15.275 s <<< FAILURE!
> org.opentest4j.AssertionFailedError: expected: <65> but was: <66>
>  at 
> org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160)
> [INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap
> [WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
> [WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading 

[jira] [Created] (HUDI-1487) after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random

2020-12-22 Thread liwei (Jira)
liwei created HUDI-1487:
---

 Summary: after HUDI-1376 merged unit test testCopyOnWriteStorage 
will failed random
 Key: HUDI-1487
 URL: https://issues.apache.org/jira/browse/HUDI-1487
 Project: Apache Hudi
  Issue Type: Bug
Reporter: liwei
Assignee: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1483) async clustering for deltastreamer

2020-12-20 Thread liwei (Jira)
liwei created HUDI-1483:
---

 Summary: async clustering for deltastreamer
 Key: HUDI-1483
 URL: https://issues.apache.org/jira/browse/HUDI-1483
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: liwei
Assignee: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1482) async compaction for spark streaming

2020-12-20 Thread liwei (Jira)
liwei created HUDI-1482:
---

 Summary: async compaction for spark streaming
 Key: HUDI-1482
 URL: https://issues.apache.org/jira/browse/HUDI-1482
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: liwei
Assignee: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1482) async clustering for spark streaming

2020-12-20 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1482:

Summary: async clustering for spark streaming  (was: async compaction for 
spark streaming)

> async clustering for spark streaming
> 
>
> Key: HUDI-1482
> URL: https://issues.apache.org/jira/browse/HUDI-1482
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1399) support a independent clustering spark job to asynchronously clustering

2020-12-20 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1399:

Status: In Progress  (was: Open)

> support a independent clustering spark job to asynchronously clustering 
> 
>
> Key: HUDI-1399
> URL: https://issues.apache.org/jira/browse/HUDI-1399
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1399) support a independent clustering spark job to asynchronously clustering

2020-12-20 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1399:

Summary: support a independent clustering spark job to asynchronously 
clustering   (was: support clustering operation can run asynchronously)

> support a independent clustering spark job to asynchronously clustering 
> 
>
> Key: HUDI-1399
> URL: https://issues.apache.org/jira/browse/HUDI-1399
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1481) support inline clustering unit tests for spark datasource and deltastreamer

2020-12-20 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1481:

Status: In Progress  (was: Open)

> support inline clustering unit tests for spark datasource and deltastreamer
> ---
>
> Key: HUDI-1481
> URL: https://issues.apache.org/jira/browse/HUDI-1481
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1481) support inline clustering unit tests for spark datasource and deltastreamer

2020-12-20 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1481:

Status: Open  (was: New)

> support inline clustering unit tests for spark datasource and deltastreamer
> ---
>
> Key: HUDI-1481
> URL: https://issues.apache.org/jira/browse/HUDI-1481
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1472) support inline clustering unit tests for spark datasource and deltastreamer

2020-12-20 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei closed HUDI-1472.
---
Resolution: Fixed

> support inline clustering unit tests for spark datasource and deltastreamer
> ---
>
> Key: HUDI-1472
> URL: https://issues.apache.org/jira/browse/HUDI-1472
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1481) support inline clustering unit tests for spark datasource and deltastreamer

2020-12-20 Thread liwei (Jira)
liwei created HUDI-1481:
---

 Summary: support inline clustering unit tests for spark datasource 
and deltastreamer
 Key: HUDI-1481
 URL: https://issues.apache.org/jira/browse/HUDI-1481
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: liwei
Assignee: liwei
 Fix For: 0.7.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1472) support inline clustering unit tests for spark datasource and deltastreamer

2020-12-20 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1472:

Summary: support inline clustering unit tests for spark datasource and 
deltastreamer  (was: support inline clustering for spark datasource)

> support inline clustering unit tests for spark datasource and deltastreamer
> ---
>
> Key: HUDI-1472
> URL: https://issues.apache.org/jira/browse/HUDI-1472
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1472) support inline clustering for spark datasource

2020-12-18 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1472:

Status: Open  (was: New)

> support inline clustering for spark datasource
> --
>
> Key: HUDI-1472
> URL: https://issues.apache.org/jira/browse/HUDI-1472
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1472) support inline clustering for spark datasource

2020-12-18 Thread liwei (Jira)
liwei created HUDI-1472:
---

 Summary: support inline clustering for spark datasource
 Key: HUDI-1472
 URL: https://issues.apache.org/jira/browse/HUDI-1472
 Project: Apache Hudi
  Issue Type: Task
  Components: Spark Integration
Reporter: liwei
Assignee: liwei
 Fix For: 0.7.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1399) support clustering operation can run asynchronously

2020-12-18 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1399:

Status: Open  (was: New)

> support clustering operation can run asynchronously
> ---
>
> Key: HUDI-1399
> URL: https://issues.apache.org/jira/browse/HUDI-1399
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1399) support clustering operation can run asynchronously

2020-12-16 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250765#comment-17250765
 ] 

liwei commented on HUDI-1399:
-

[~vinoth] code freeze is Dec 31?

just like compaction asynchronously have four option
1. option one: in spark inline clustering 
now in https://github.com/apache/hudi/pull/2263/files have base implementation, 
but have not support run in spark [~satishkotha] 
2. option two: support a independent clustering spark job to asynchronously 
clustering just like HoodieCompactor
3. option three: hudi cli support clustering 
4. option four: DeltaStreamer Continuous mode support clustering

for functional coverage i think we can first support option one and option two.
as https://github.com/apache/hudi/pull/2263/files have not merge, i can land 
this two in on satishkotha:sk/clustering branch. I plan to do it this weekend, 
and submit pr next week. [~vinoth] what do you think ? Does my plan conflict 
with you? [~satishkotha]  cc [~nagarwal] 

> support clustering operation can run asynchronously
> ---
>
> Key: HUDI-1399
> URL: https://issues.apache.org/jira/browse/HUDI-1399
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1399) support clustering operation can run asynchronously

2020-12-16 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250424#comment-17250424
 ] 

liwei commented on HUDI-1399:
-

[~vinoth] i plan to begin it about next week

> support clustering operation can run asynchronously
> ---
>
> Key: HUDI-1399
> URL: https://issues.apache.org/jira/browse/HUDI-1399
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1456) Concurrent Writing to Hudi tables

2020-12-13 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248711#comment-17248711
 ] 

liwei commented on HUDI-1456:
-

[~nishith29] Great , it will begin. If have some independent task, I’m happy to 
take it. :D

> Concurrent Writing to Hudi tables
> -
>
> Key: HUDI-1456
> URL: https://issues.apache.org/jira/browse/HUDI-1456
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.8.0
>
> Attachments: image-2020-12-14-09-48-46-946.png
>
>
> This ticket tracks all the changes needed to support concurrency control for 
> Hudi tables. This work will be done in multiple phases. 
>  # Parallel writing to Hudi tables support -> This feature will allow users 
> to have multiple writers mutate the tables without the ability to perform 
> concurrent update to the same file. 
>  # Concurrency control at file/record level -> This feature will allow users 
> to have multiple writers mutate the tables with the ability to ensure 
> serializability at record level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-1456) Concurrent Writing to Hudi tables

2020-12-13 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248711#comment-17248711
 ] 

liwei edited comment on HUDI-1456 at 12/14/20, 1:49 AM:


[~nishith29] Good news , it will begin. If have some independent task, I’m 
happy to take it. :D


was (Author: 309637554):
[~nishith29] Great , it will begin. If have some independent task, I’m happy to 
take it. :D

> Concurrent Writing to Hudi tables
> -
>
> Key: HUDI-1456
> URL: https://issues.apache.org/jira/browse/HUDI-1456
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.8.0
>
> Attachments: image-2020-12-14-09-48-46-946.png
>
>
> This ticket tracks all the changes needed to support concurrency control for 
> Hudi tables. This work will be done in multiple phases. 
>  # Parallel writing to Hudi tables support -> This feature will allow users 
> to have multiple writers mutate the tables without the ability to perform 
> concurrent update to the same file. 
>  # Concurrency control at file/record level -> This feature will allow users 
> to have multiple writers mutate the tables with the ability to ensure 
> serializability at record level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1456) Concurrent Writing to Hudi tables

2020-12-13 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1456:

Attachment: image-2020-12-14-09-48-46-946.png

> Concurrent Writing to Hudi tables
> -
>
> Key: HUDI-1456
> URL: https://issues.apache.org/jira/browse/HUDI-1456
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.8.0
>
> Attachments: image-2020-12-14-09-48-46-946.png
>
>
> This ticket tracks all the changes needed to support concurrency control for 
> Hudi tables. This work will be done in multiple phases. 
>  # Parallel writing to Hudi tables support -> This feature will allow users 
> to have multiple writers mutate the tables without the ability to perform 
> concurrent update to the same file. 
>  # Concurrency control at file/record level -> This feature will allow users 
> to have multiple writers mutate the tables with the ability to ensure 
> serializability at record level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1454) in unit test have error as Error reading clustering plan 006

2020-12-12 Thread liwei (Jira)
liwei created HUDI-1454:
---

 Summary: in unit test have error as  Error reading clustering plan 
006
 Key: HUDI-1454
 URL: https://issues.apache.org/jira/browse/HUDI-1454
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: liwei
Assignee: liwei


https://travis-ci.com/github/apache/hudi/jobs/458936905

[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.245 s 
- in org.apache.hudi.table.action.compact.TestInlineCompaction[INFO] Tests run: 
3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.245 s - in 
org.apache.hudi.table.action.compact.TestInlineCompaction[INFO] Running 
org.apache.hudi.table.action.compact.TestAsyncCompaction[WARN ] 2020-12-12 
15:13:43,814 org.apache.hudi.testutils.HoodieClientTestHarness  - Closing 
file-system instance used in previous test-run[WARN ] 2020-12-12 15:13:50,370 
org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
instance used in previous test-run[WARN ] 2020-12-12 15:14:02,285 
org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
instance used in previous test-run[WARN ] 2020-12-12 15:14:08,596 
org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
instance used in previous test-run[WARN ] 2020-12-12 15:14:16,857 
org.apache.hudi.common.util.ClusteringUtils  - No content found in requested 
file for instant [==>006__replacecommit__REQUESTED][WARN ] 2020-12-12 
15:14:16,861 org.apache.hudi.common.util.ClusteringUtils  - No content found in 
requested file for instant [==>006__replacecommit__REQUESTED][ERROR] 2020-12-12 
15:14:16,919 org.apache.hudi.timeline.service.FileSystemViewHandler  - Got 
runtime exception servicing request 
partition=2015%2F03%2F17=%2Ftmp%2Fjunit7781027189613842524%2Fdataset=005=ba1d2bb94a4b1d1e6e294e77086957b6c7c43b5a306e36cba6bbaa955a0ed8ceorg.apache.hudi.exception.HoodieIOException:
 Error reading clustering plan 006 at 
org.apache.hudi.common.util.ClusteringUtils.getClusteringPlan(ClusteringUtils.java:85)
 at 
org.apache.hudi.common.util.ClusteringUtils.lambda$getAllPendingClusteringPlans$0(ClusteringUtils.java:67)
 at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) 
at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) 
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at 
org.apache.hudi.common.util.ClusteringUtils.getAllFileGroupsInPendingClusteringPlans(ClusteringUtils.java:100)
 at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:111)
 at 
org.apache.hudi.common.table.view.RocksDbBasedFileSystemView.init(RocksDbBasedFileSystemView.java:91)
 at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.runSync(AbstractTableFileSystemView.java:1077)
 at 
org.apache.hudi.common.table.view.IncrementalTimelineSyncFileSystemView.runSync(IncrementalTimelineSyncFileSystemView.java:97)
 at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.sync(AbstractTableFileSystemView.java:1059)
 at 
org.apache.hudi.timeline.service.FileSystemViewHandler.syncIfLocalViewBehind(FileSystemViewHandler.java:124)
 at 
org.apache.hudi.timeline.service.FileSystemViewHandler.access$100(FileSystemViewHandler.java:55)
 at 
org.apache.hudi.timeline.service.FileSystemViewHandler$ViewHandler.handle(FileSystemViewHandler.java:338)
 at io.javalin.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:22) at 
io.javalin.Javalin.lambda$addHandler$0(Javalin.java:606) at 
io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:46) at 
io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:17) at 
io.javalin.core.JavalinServlet$service$1.invoke(JavalinServlet.kt:143) at 
io.javalin.core.JavalinServlet$service$2.invoke(JavalinServlet.kt:41) at 
io.javalin.core.JavalinServlet.service(JavalinServlet.kt:107) at 
io.javalin.core.util.JettyServerUtil$initialize$httpHandler$1.doHandle(JettyServerUtil.kt:72)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
 at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) 
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1668)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) 
at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:61) at 

[jira] [Assigned] (HUDI-1448) hudi dla sync skip rt create

2020-12-10 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-1448:
---

Assignee: liwei

> hudi  dla sync skip rt create
> -
>
> Key: HUDI-1448
> URL: https://issues.apache.org/jira/browse/HUDI-1448
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1448) hudi dla sync skip rt create

2020-12-10 Thread liwei (Jira)
liwei created HUDI-1448:
---

 Summary: hudi  dla sync skip rt create
 Key: HUDI-1448
 URL: https://issues.apache.org/jira/browse/HUDI-1448
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Hive Integration
Reporter: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1349) spark sql support overwrite use replace action with dynamic partitioning

2020-12-09 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei closed HUDI-1349.
---

> spark sql support overwrite use  replace action with dynamic partitioning
> -
>
> Key: HUDI-1349
> URL: https://issues.apache.org/jira/browse/HUDI-1349
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
>
> now spark sql overwrite just do like this.
> } else if (mode == SaveMode.Overwrite && tableExists) {
>  log.warn(s"hoodie table at $tablePath already exists. Deleting existing data 
> & overwriting with new data.")
>  fs.delete(tablePath, true)
>  tableExists = false
> }
> overwrite need to use replace action
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1349) spark sql support overwrite use replace action with dynamic partitioning

2020-12-09 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reopened HUDI-1349:
-

> spark sql support overwrite use  replace action with dynamic partitioning
> -
>
> Key: HUDI-1349
> URL: https://issues.apache.org/jira/browse/HUDI-1349
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
>
> now spark sql overwrite just do like this.
> } else if (mode == SaveMode.Overwrite && tableExists) {
>  log.warn(s"hoodie table at $tablePath already exists. Deleting existing data 
> & overwriting with new data.")
>  fs.delete(tablePath, true)
>  tableExists = false
> }
> overwrite need to use replace action
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1349) spark sql support overwrite use replace action with dynamic partitioning

2020-12-09 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei resolved HUDI-1349.
-
Resolution: Fixed

> spark sql support overwrite use  replace action with dynamic partitioning
> -
>
> Key: HUDI-1349
> URL: https://issues.apache.org/jira/browse/HUDI-1349
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
>
> now spark sql overwrite just do like this.
> } else if (mode == SaveMode.Overwrite && tableExists) {
>  log.warn(s"hoodie table at $tablePath already exists. Deleting existing data 
> & overwriting with new data.")
>  fs.delete(tablePath, true)
>  tableExists = false
> }
> overwrite need to use replace action
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1349) spark sql support overwrite use replace action with dynamic partitioning

2020-12-09 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1349:

Status: Closed  (was: Patch Available)

> spark sql support overwrite use  replace action with dynamic partitioning
> -
>
> Key: HUDI-1349
> URL: https://issues.apache.org/jira/browse/HUDI-1349
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
>
> now spark sql overwrite just do like this.
> } else if (mode == SaveMode.Overwrite && tableExists) {
>  log.warn(s"hoodie table at $tablePath already exists. Deleting existing data 
> & overwriting with new data.")
>  fs.delete(tablePath, true)
>  tableExists = false
> }
> overwrite need to use replace action
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1437) some description in spark ui is not reality, Not good for performance tracking

2020-12-07 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-1437:
---

Assignee: liwei

> some description in spark ui  is not reality, Not good for performance 
> tracking
> ---
>
> Key: HUDI-1437
> URL: https://issues.apache.org/jira/browse/HUDI-1437
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Performance
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Attachments: image-2020-12-07-23-50-57-212.png
>
>
> some spark action in hudi ,not set the real description, it is not good for 
> performance tracking
>  
> !image-2020-12-07-23-50-57-212.png|width=693,height=375!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1437) some description in spark ui is not reality, Not good for performance tracking

2020-12-07 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1437:

Description: 
some spark action in hudi ,not set the real description, it is not good for 
performance tracking

 

!image-2020-12-07-23-50-57-212.png|width=693,height=375!

  was:
some spark action in hudi ,not set the real description, it is not good for 
performance tracking

 

!image-2020-12-07-23-50-57-212.png!


> some description in spark ui  is not reality, Not good for performance 
> tracking
> ---
>
> Key: HUDI-1437
> URL: https://issues.apache.org/jira/browse/HUDI-1437
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Performance
>Reporter: liwei
>Priority: Major
> Attachments: image-2020-12-07-23-50-57-212.png
>
>
> some spark action in hudi ,not set the real description, it is not good for 
> performance tracking
>  
> !image-2020-12-07-23-50-57-212.png|width=693,height=375!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1437) some description in spark ui is not reality, Not good for performance tracking

2020-12-07 Thread liwei (Jira)
liwei created HUDI-1437:
---

 Summary: some description in spark ui  is not reality, Not good 
for performance tracking
 Key: HUDI-1437
 URL: https://issues.apache.org/jira/browse/HUDI-1437
 Project: Apache Hudi
  Issue Type: Bug
  Components: Performance
Reporter: liwei
 Attachments: image-2020-12-07-23-50-57-212.png

some spark action in hudi ,not set the real description, it is not good for 
performance tracking

 

!image-2020-12-07-23-50-57-212.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1076) CLI tools to support clustering

2020-11-21 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-1076:
---

Assignee: liwei

> CLI tools to support clustering
> ---
>
> Key: HUDI-1076
> URL: https://issues.apache.org/jira/browse/HUDI-1076
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Major
>
> 1) schedule clustering
> 2) complete clustering
> 3) cancel clustering
> 4) rollback clustering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >