[jira] [Closed] (HUDI-1264) incremental read support with replace
[ https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei closed HUDI-1264. --- > incremental read support with replace > - > > Key: HUDI-1264 > URL: https://issues.apache.org/jira/browse/HUDI-1264 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10.0 > > > initial version, we could fail incremental reads if there is a REPLACE > instant. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-1264) incremental read support with replace
[ https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1264: Status: Closed (was: Patch Available) > incremental read support with replace > - > > Key: HUDI-1264 > URL: https://issues.apache.org/jira/browse/HUDI-1264 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10.0 > > > initial version, we could fail incremental reads if there is a REPLACE > instant. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (HUDI-1264) incremental read support with replace
[ https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei resolved HUDI-1264. - > incremental read support with replace > - > > Key: HUDI-1264 > URL: https://issues.apache.org/jira/browse/HUDI-1264 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10.0 > > > initial version, we could fail incremental reads if there is a REPLACE > instant. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Reopened] (HUDI-1264) incremental read support with replace
[ https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reopened HUDI-1264: - > incremental read support with replace > - > > Key: HUDI-1264 > URL: https://issues.apache.org/jira/browse/HUDI-1264 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10.0 > > > initial version, we could fail incremental reads if there is a REPLACE > instant. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-1307) spark datasource load path format is confused for snapshot and increment read mode
[ https://issues.apache.org/jira/browse/HUDI-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420424#comment-17420424 ] liwei commented on HUDI-1307: - [~xushiyan] hello , recently i am focus on ingest kafka data using hudi with clustering, and the online & offline analytics workload schedule on k8s. So this issue have not update. I think we can keep glob path pattern around, but increment mode and snapshot mode can unified.:D > spark datasource load path format is confused for snapshot and increment read > mode > -- > > Key: HUDI-1307 > URL: https://issues.apache.org/jira/browse/HUDI-1307 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Critical > Labels: sev:high, user-support-issues > > as spark datasource read hudi table > 1、snapshot mode > {code:java} > val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*"); > should add "/*" ,otherwise will fail, because in > org.apache.hudi.DefaultSource. > createRelation() will use fs.globStatus(). if do not have "/*" will not get > .hoodie and default dir > val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, > fs){code} > > 2、increment mode > both basePath and basePath + "/*" is ok.This is because in > org.apache.hudi.DefaultSource > DataSourceUtils.getTablePath can support both the two format. > {code:java} > val incViewDF = spark.read.format("org.apache.hudi"). > option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL). > option(BEGIN_INSTANTTIME_OPT_KEY, beginTime). > option(END_INSTANTTIME_OPT_KEY, endTime). > load(basePath){code} > > {code:java} > val incViewDF = spark.read.format("org.apache.hudi"). > option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL). > option(BEGIN_INSTANTTIME_OPT_KEY, beginTime). > option(END_INSTANTTIME_OPT_KEY, endTime). > load(basePath + "/*") > {code} > > as increment mode and snapshot mode not coincide, user will confuse .Also > load use basepath +"/*" *or "/***/*"* is confuse. I know this is to support > partition. > but i think this api will more clear for user > > {code:java} > partition = "year = '2019'" > spark.read .format("hudi") .load(path) .where(partition) {code} > > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-52) Implement Savepoints for Merge On Read table #88
[ https://issues.apache.org/jira/browse/HUDI-52?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405053#comment-17405053 ] liwei edited comment on HUDI-52 at 8/26/21, 8:21 AM: - [~vinoth] hello, We are going to build table backup and recovery based on savepoint and wal, but MOR mode does not support savepoint now. Are there any blocking problems about this? :) was (Author: 309637554): [~vinoth] hello, We are going to build table backup and recovery based on savepoint and wal, but the discovery MOR mode does not support savepoint now. Are there any blocking problems about this? :) > Implement Savepoints for Merge On Read table #88 > > > Key: HUDI-52 > URL: https://issues.apache.org/jira/browse/HUDI-52 > Project: Apache Hudi > Issue Type: Improvement > Components: Storage Management, Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Assignee: liwei >Priority: Major > Labels: help-requested, starter > Fix For: 0.10.0 > > > https://github.com/uber/hudi/issues/88 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-52) Implement Savepoints for Merge On Read table #88
[ https://issues.apache.org/jira/browse/HUDI-52?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405053#comment-17405053 ] liwei commented on HUDI-52: --- [~vinoth] hello, We are going to build table backup and recovery based on savepoint and wal, but the discovery MOR mode does not support savepoint now. Are there any blocking problems about this? :) > Implement Savepoints for Merge On Read table #88 > > > Key: HUDI-52 > URL: https://issues.apache.org/jira/browse/HUDI-52 > Project: Apache Hudi > Issue Type: Improvement > Components: Storage Management, Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Assignee: liwei >Priority: Major > Labels: help-requested, starter > Fix For: 0.10.0 > > > https://github.com/uber/hudi/issues/88 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-2355) after clustering with archive meet data incorrect
[ https://issues.apache.org/jira/browse/HUDI-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reassigned HUDI-2355: --- Assignee: liwei > after clustering with archive meet data incorrect > -- > > Key: HUDI-2355 > URL: https://issues.apache.org/jira/browse/HUDI-2355 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Major > > after [https://github.com/apache/hudi/pull/3310] replace data file clean in > clean. but if replacecommit file deleted , in clean can not read the > datafile. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2355) after clustering with archive meet data incorrect
liwei created HUDI-2355: --- Summary: after clustering with archive meet data incorrect Key: HUDI-2355 URL: https://issues.apache.org/jira/browse/HUDI-2355 Project: Apache Hudi Issue Type: Sub-task Reporter: liwei after [https://github.com/apache/hudi/pull/3310] replace data file clean in clean. but if replacecommit file deleted , in clean can not read the datafile. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2354) archive delete replacecommit, but stop timeline server meet file not found
liwei created HUDI-2354: --- Summary: archive delete replacecommit, but stop timeline server meet file not found Key: HUDI-2354 URL: https://issues.apache.org/jira/browse/HUDI-2354 Project: Apache Hudi Issue Type: Sub-task Reporter: liwei 1、in spark writeclient postcommit will archive replacecommit which meet the archive Requirement 21/08/23 14:57:12 INFO HoodieTimelineArchiveLog: Archived and deleted instant file .hoodie/20210823114552.commit 21/08/23 14:57:13 INFO HoodieTimelineArchiveLog: Archived and deleted instant file .hoodie/20210823114553.replacecommit.requested 21/08/23 14:57:13 INFO HoodieTimelineArchiveLog: Archived and deleted instant file .hoodie/20210823114553.replacecommit.inflight 21/08/23 14:57:13 INFO HoodieTimelineArchiveLog: Archived and deleted instant file .hoodie/20210823114553.replacecommit 2、if you start timelineservice, after sparksqlwrite post commit it will stop . In HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:106) need to read replace instant metadata , but the replace instant file is delete , but the timeline not update org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:297) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193) Caused by: org.apache.hudi.exception.HoodieIOException: Could not read commit details from .hoodie/20210823114553.replacecommit at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:555) at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:219) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$resetFileGroupsReplaced$8(AbstractTableFileSystemView.java:217) at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.resetFileGroupsReplaced(AbstractTableFileSystemView.java:228) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:106) at org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:106) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.reset(AbstractTableFileSystemView.java:248) at org.apache.hudi.common.table.view.HoodieTableFileSystemView.close(HoodieTableFileSystemView.java:353) at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4707) at org.apache.hudi.common.table.view.FileSystemViewManager.close(FileSystemViewManager.java:118) at org.apache.hudi.timeline.service.TimelineService.close(TimelineService.java:207) at org.apache.hudi.client.embedded.EmbeddedTimelineService.stop(EmbeddedTimelineService.java:121) at org.apache.hudi.client.AbstractHoodieClient.stopEmbeddedServerView(AbstractHoodieClient.java:94) at org.apache.hudi.client.AbstractHoodieClient.close(AbstractHoodieClient.java:86) at org.apache.hudi.client.AbstractHoodieWriteClient.close(AbstractHoodieWriteClient.java:1094) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:509) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:226) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) at
[jira] [Assigned] (HUDI-2354) archive delete replacecommit, but stop timeline server meet file not found
[ https://issues.apache.org/jira/browse/HUDI-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reassigned HUDI-2354: --- Assignee: liwei > archive delete replacecommit, but stop timeline server meet file not found > -- > > Key: HUDI-2354 > URL: https://issues.apache.org/jira/browse/HUDI-2354 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Major > > 1、in spark writeclient postcommit will archive replacecommit which meet the > archive Requirement > 21/08/23 14:57:12 INFO HoodieTimelineArchiveLog: Archived and deleted instant > file .hoodie/20210823114552.commit > 21/08/23 14:57:13 INFO HoodieTimelineArchiveLog: Archived and deleted instant > file .hoodie/20210823114553.replacecommit.requested > 21/08/23 14:57:13 INFO HoodieTimelineArchiveLog: Archived and deleted instant > file .hoodie/20210823114553.replacecommit.inflight > 21/08/23 14:57:13 INFO HoodieTimelineArchiveLog: Archived and deleted instant > file .hoodie/20210823114553.replacecommit > > 2、if you start timelineservice, after sparksqlwrite post commit it will stop > . In HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:106) need > to read replace instant metadata , but the replace instant file is delete , > but the timeline not update > > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:297) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193) > Caused by: org.apache.hudi.exception.HoodieIOException: Could not read commit > details from .hoodie/20210823114553.replacecommit > at > org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:555) > at > org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:219) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$resetFileGroupsReplaced$8(AbstractTableFileSystemView.java:217) > at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.resetFileGroupsReplaced(AbstractTableFileSystemView.java:228) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:106) > at > org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:106) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.reset(AbstractTableFileSystemView.java:248) > at > org.apache.hudi.common.table.view.HoodieTableFileSystemView.close(HoodieTableFileSystemView.java:353) > at > java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4707) > at > org.apache.hudi.common.table.view.FileSystemViewManager.close(FileSystemViewManager.java:118) > at > org.apache.hudi.timeline.service.TimelineService.close(TimelineService.java:207) > at > org.apache.hudi.client.embedded.EmbeddedTimelineService.stop(EmbeddedTimelineService.java:121) > at > org.apache.hudi.client.AbstractHoodieClient.stopEmbeddedServerView(AbstractHoodieClient.java:94) > at > org.apache.hudi.client.AbstractHoodieClient.close(AbstractHoodieClient.java:86) > at > org.apache.hudi.client.AbstractHoodieWriteClient.close(AbstractHoodieWriteClient.java:1094) > at > org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:509) > at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:226) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at >
[jira] [Updated] (HUDI-2301) fix FileSliceMetrics utils bug
[ https://issues.apache.org/jira/browse/HUDI-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-2301: Status: In Progress (was: Open) > fix FileSliceMetrics utils bug > -- > > Key: HUDI-2301 > URL: https://issues.apache.org/jira/browse/HUDI-2301 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: WangZhongze >Priority: Major > Labels: pull-request-available > > Fix bug of metrics calculation error > In the original code, the calculation of totalReadIO and totalWriteIO will > only obtain the size of one of the Fileslice -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2301) fix FileSliceMetrics utils bug
[ https://issues.apache.org/jira/browse/HUDI-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-2301: Status: Closed (was: Patch Available) > fix FileSliceMetrics utils bug > -- > > Key: HUDI-2301 > URL: https://issues.apache.org/jira/browse/HUDI-2301 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: WangZhongze >Priority: Major > Labels: pull-request-available > > Fix bug of metrics calculation error > In the original code, the calculation of totalReadIO and totalWriteIO will > only obtain the size of one of the Fileslice -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2301) fix FileSliceMetrics utils bug
[ https://issues.apache.org/jira/browse/HUDI-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-2301: Status: Patch Available (was: In Progress) > fix FileSliceMetrics utils bug > -- > > Key: HUDI-2301 > URL: https://issues.apache.org/jira/browse/HUDI-2301 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: WangZhongze >Priority: Major > Labels: pull-request-available > > Fix bug of metrics calculation error > In the original code, the calculation of totalReadIO and totalWriteIO will > only obtain the size of one of the Fileslice -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2301) fix FileSliceMetrics utils bug
liwei created HUDI-2301: --- Summary: fix FileSliceMetrics utils bug Key: HUDI-2301 URL: https://issues.apache.org/jira/browse/HUDI-2301 Project: Apache Hudi Issue Type: Sub-task Reporter: liwei Assignee: liwei -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2300) add ClusteringPlanStrategy unit test
liwei created HUDI-2300: --- Summary: add ClusteringPlanStrategy unit test Key: HUDI-2300 URL: https://issues.apache.org/jira/browse/HUDI-2300 Project: Apache Hudi Issue Type: Sub-task Reporter: liwei Assignee: liwei -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1468) incremental read support with clustering
[ https://issues.apache.org/jira/browse/HUDI-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377966#comment-17377966 ] liwei commented on HUDI-1468: - [~vinoth] hello , is [https://github.com/apache/hudi/pull/3139/files] land it , this issue can close? > incremental read support with clustering > > > Key: HUDI-1468 > URL: https://issues.apache.org/jira/browse/HUDI-1468 > Project: Apache Hudi > Issue Type: Sub-task > Components: Incremental Pull >Affects Versions: 0.9.0 >Reporter: satish >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.9.0 > > > As part of clustering, metadata such as hoodie_commit_time changes for > records that are clustered. This is specific to > SparkBulkInsertBasedRunClusteringStrategy implementation. Figure out a way to > carry commit_time from original record to support incremental queries. > Also, incremental queries dont work with 'replacecommit' used by clustering > HUDI-1264. Change incremental query to work for replacecommits created by > Clustering. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1042) [Umbrella] Support clustering on filegroups
[ https://issues.apache.org/jira/browse/HUDI-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reassigned HUDI-1042: --- Assignee: liwei (was: leesf) > [Umbrella] Support clustering on filegroups > --- > > Key: HUDI-1042 > URL: https://issues.apache.org/jira/browse/HUDI-1042 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.9.0 >Reporter: leesf >Assignee: liwei >Priority: Major > Labels: hudi-umbrellas > Fix For: 0.9.0 > > > please see > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1043) Support clustering in CoW mode
[ https://issues.apache.org/jira/browse/HUDI-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reassigned HUDI-1043: --- Assignee: liwei (was: leesf) > Support clustering in CoW mode > -- > > Key: HUDI-1043 > URL: https://issues.apache.org/jira/browse/HUDI-1043 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: leesf >Assignee: liwei >Priority: Major > > updates are not allowed during clustering -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server
[ https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363315#comment-17363315 ] liwei commented on HUDI-1138: - [~guoyihua] thanks . “We may consider blocking the requests for batching so that the timeline server sends the actual responses only after MARKERS are overwritten / updated.” If waiting the batch requests overwrite/updated successfully. The create marker file request from spark task will wait long time such as 200ms interval plus the markerfiles read and overwrite. Do you have same plan to update the marker file? > Re-implement marker files via timeline server > - > > Key: HUDI-1138 > URL: https://issues.apache.org/jira/browse/HUDI-1138 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.9.0 > > > Even as you can argue that RFC-15/consolidated metadata, removes the need for > deleting partial files written due to spark task failures/stage retries. It > will still leave extra files inside the table (and users will pay for it > every month) and we need the marker mechanism to be able to delete these > partial files. > Here we explore if we can improve the current marker file mechanism, that > creates one marker file per data file written, by > Delegating the createMarker() call to the driver/timeline server, and have it > create marker metadata into a single file handle, that is flushed for > durability guarantees > > P.S: I was tempted to think Spark listener mechanism can help us deal with > failed tasks, but it has no guarantees. the writer job could die without > deleting a partial file. i.e it can improve things, but cant provide > guarantees -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server
[ https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362535#comment-17362535 ] liwei commented on HUDI-1138: - [~guoyihua] [~vinoth]:D hello , i have a question * "When using S3, use overwrite operation for MARKERS file, and batch requests within an interval, say a few hundred milliseconds (configurable)." If the timeline server crashed before it overwrite MARKERS file with the latest batch request. The latest batch files will not rollback? > Re-implement marker files via timeline server > - > > Key: HUDI-1138 > URL: https://issues.apache.org/jira/browse/HUDI-1138 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.9.0 > > > Even as you can argue that RFC-15/consolidated metadata, removes the need for > deleting partial files written due to spark task failures/stage retries. It > will still leave extra files inside the table (and users will pay for it > every month) and we need the marker mechanism to be able to delete these > partial files. > Here we explore if we can improve the current marker file mechanism, that > creates one marker file per data file written, by > Delegating the createMarker() call to the driver/timeline server, and have it > create marker metadata into a single file handle, that is flushed for > durability guarantees > > P.S: I was tempted to think Spark listener mechanism can help us deal with > failed tasks, but it has no guarantees. the writer job could die without > deleting a partial file. i.e it can improve things, but cant provide > guarantees -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-1768) spark datasource support schema validate add column
[ https://issues.apache.org/jira/browse/HUDI-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei resolved HUDI-1768. - Resolution: Fixed > spark datasource support schema validate add column > > > Key: HUDI-1768 > URL: https://issues.apache.org/jira/browse/HUDI-1768 > Project: Apache Hudi > Issue Type: Improvement >Reporter: liwei >Assignee: liwei >Priority: Major > Labels: pull-request-available > > spark datasource now not support set avro column default value. > but it set column to nullable and use SchemaConverters.toAvroType transform > to union type which has null, such as : > Registered avro schema : { > "type" : "record", > "name" : "hoodie_test_record", > "namespace" : "hoodie.hoodie_test", > "fields" : [ { > "name" : "_row_key", > "type" : [ "string", "null" ] > }, { > "name" : "name", > "type" : [ "string", "null" ] > }, { > "name" : "timestamp", > "type" : [ "int", "null" ] > }, { > "name" : "partition", > "type" : [ "int", "null" ] > } ] > } > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1295) RFC-15: Track bloom filters as a part of metadata table
[ https://issues.apache.org/jira/browse/HUDI-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338774#comment-17338774 ] liwei commented on HUDI-1295: - [~vinoth] got it. Thanks > RFC-15: Track bloom filters as a part of metadata table > --- > > Key: HUDI-1295 > URL: https://issues.apache.org/jira/browse/HUDI-1295 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Priority: Major > Fix For: 0.9.0 > > > Idea here to maintain our bloom filters outside of parquet for speedier > access from bloom. index -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1295) RFC-15: Track bloom filters as a part of metadata table
[ https://issues.apache.org/jira/browse/HUDI-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337947#comment-17337947 ] liwei commented on HUDI-1295: - [~vinoth] hello , do we also land this in [RFC-27|https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance] or can land it use metatable now? :D > RFC-15: Track bloom filters as a part of metadata table > --- > > Key: HUDI-1295 > URL: https://issues.apache.org/jira/browse/HUDI-1295 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Priority: Major > Fix For: 0.9.0 > > > Idea here to maintain our bloom filters outside of parquet for speedier > access from bloom. index -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server
[ https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329158#comment-17329158 ] liwei commented on HUDI-1138: - [~vinoth] ok, i will move the discussion to RFC-27.:D > Re-implement marker files via timeline server > - > > Key: HUDI-1138 > URL: https://issues.apache.org/jira/browse/HUDI-1138 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Priority: Blocker > Fix For: 0.9.0 > > > Even as you can argue that RFC-15/consolidated metadata, removes the need for > deleting partial files written due to spark task failures/stage retries. It > will still leave extra files inside the table (and users will pay for it > every month) and we need the marker mechanism to be able to delete these > partial files. > Here we explore if we can improve the current marker file mechanism, that > creates one marker file per data file written, by > Delegating the createMarker() call to the driver/timeline server, and have it > create marker metadata into a single file handle, that is flushed for > durability guarantees > > P.S: I was tempted to think Spark listener mechanism can help us deal with > failed tasks, but it has no guarantees. the writer job could die without > deleting a partial file. i.e it can improve things, but cant provide > guarantees -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server
[ https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327052#comment-17327052 ] liwei commented on HUDI-1138: - [~vinoth] thanks 1. I have a idea, can we update the file to metatable in timeline server. As we can unify the meta info to metatable ? 2. Now rollback is not a frequency action. So we need poc the perf first. 3. I recently also research RFC-27. I think if we can unify the metadata such as partitions, markfiles, statistics ,index or others. Just as delta lake use delta log store this , and snowflake use metaservice . The unify metatable can resolve cloud storage poor meta manage 、 compute and storage query performance . I think RFC-27. RFC - 15 . RFC-08 have some overlaps. Want to discuss with you ! Thanks > Re-implement marker files via timeline server > - > > Key: HUDI-1138 > URL: https://issues.apache.org/jira/browse/HUDI-1138 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Priority: Blocker > Fix For: 0.9.0 > > > Even as you can argue that RFC-15/consolidated metadata, removes the need for > deleting partial files written due to spark task failures/stage retries. It > will still leave extra files inside the table (and users will pay for it > every month) and we need the marker mechanism to be able to delete these > partial files. > Here we explore if we can improve the current marker file mechanism, that > creates one marker file per data file written, by > Delegating the createMarker() call to the driver/timeline server, and have it > create marker metadata into a single file handle, that is flushed for > durability guarantees > > P.S: I was tempted to think Spark listener mechanism can help us deal with > failed tasks, but it has no guarantees. the writer job could die without > deleting a partial file. i.e it can improve things, but cant provide > guarantees -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server
[ https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324649#comment-17324649 ] liwei commented on HUDI-1138: - [~uditme] [~vinoth] i also think listing will be performance improvement point. In cloud storage such as S3 and OSS of alibaba cloud list is expensive and slow. can we use P.S: I was tempted to think Spark listener mechanism can help us deal with failed tasks, but it has no guarantees. the writer job could die without deleting a partial file. i.e it can improve things, but cant provide guarantees and delete the residue files in clean ? > Re-implement marker files via timeline server > - > > Key: HUDI-1138 > URL: https://issues.apache.org/jira/browse/HUDI-1138 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Priority: Blocker > Fix For: 0.9.0 > > > Even as you can argue that RFC-15/consolidated metadata, removes the need for > deleting partial files written due to spark task failures/stage retries. It > will still leave extra files inside the table (and users will pay for it > every month) and we need the marker mechanism to be able to delete these > partial files. > Here we explore if we can improve the current marker file mechanism, that > creates one marker file per data file written, by > Delegating the createMarker() call to the driver/timeline server, and have it > create marker metadata into a single file handle, that is flushed for > durability guarantees > > P.S: I was tempted to think Spark listener mechanism can help us deal with > failed tasks, but it has no guarantees. the writer job could die without > deleting a partial file. i.e it can improve things, but cant provide > guarantees -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction
[ https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-897: --- Status: In Progress (was: Open) > hudi support log append scenario with better write and asynchronous compaction > -- > > Key: HUDI-897 > URL: https://issues.apache.org/jira/browse/HUDI-897 > Project: Apache Hudi > Issue Type: Improvement > Components: Compaction, Performance >Affects Versions: 0.9.0 >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.9.0 > > Attachments: image-2020-05-14-19-51-37-938.png, > image-2020-05-14-20-14-59-429.png > > > 一、scenario > The business scenarios of the data lake mainly include analysis of databases, > logs, and files. > !image-2020-05-14-20-14-59-429.png|width=444,height=286! > Databricks delta lake also aim at these three scenario. [1] > > 二、Hudi current situation > At present, hudi can better support the scenario where the database cdc is > incrementally written to hudi, and it is also doing bulkload files to hudi. > However, there is no good native support for log scenarios (requiring > high-throughput writes, no updates, deletions, and focusing on small file > scenarios);now can write through inserts without deduplication, but they will > still merge on the write side. > * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but > every batch small will cost some time for merge,it will reduce write > throughput. > * This scene is not suitable for merge on read. > * the actual scenario only needs to write parquet in batches when writing, > and then provide reverse compaction (similar to delta lake ) > 三、what we can do > > 1.On the write side, just write every batch to parquet file base on the > snapshot mechanism,default open the merge,use can close the auto merge for > more write throughput. > 2. hudi support asynchronous merge small parquet files like databricks delta > lake's OPTIMIZE command [2] > > [1] [https://databricks.com/product/delta-lake-on-databricks] > [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction
[ https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324647#comment-17324647 ] liwei commented on HUDI-897: okay > hudi support log append scenario with better write and asynchronous compaction > -- > > Key: HUDI-897 > URL: https://issues.apache.org/jira/browse/HUDI-897 > Project: Apache Hudi > Issue Type: Improvement > Components: Compaction, Performance >Affects Versions: 0.9.0 >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.9.0 > > Attachments: image-2020-05-14-19-51-37-938.png, > image-2020-05-14-20-14-59-429.png > > > 一、scenario > The business scenarios of the data lake mainly include analysis of databases, > logs, and files. > !image-2020-05-14-20-14-59-429.png|width=444,height=286! > Databricks delta lake also aim at these three scenario. [1] > > 二、Hudi current situation > At present, hudi can better support the scenario where the database cdc is > incrementally written to hudi, and it is also doing bulkload files to hudi. > However, there is no good native support for log scenarios (requiring > high-throughput writes, no updates, deletions, and focusing on small file > scenarios);now can write through inserts without deduplication, but they will > still merge on the write side. > * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but > every batch small will cost some time for merge,it will reduce write > throughput. > * This scene is not suitable for merge on read. > * the actual scenario only needs to write parquet in batches when writing, > and then provide reverse compaction (similar to delta lake ) > 三、what we can do > > 1.On the write side, just write every batch to parquet file base on the > snapshot mechanism,default open the merge,use can close the auto merge for > more write throughput. > 2. hudi support asynchronous merge small parquet files like databricks delta > lake's OPTIMIZE command [2] > > [1] [https://databricks.com/product/delta-lake-on-databricks] > [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction
[ https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei resolved HUDI-897. Resolution: Fixed > hudi support log append scenario with better write and asynchronous compaction > -- > > Key: HUDI-897 > URL: https://issues.apache.org/jira/browse/HUDI-897 > Project: Apache Hudi > Issue Type: Improvement > Components: Compaction, Performance >Affects Versions: 0.9.0 >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.9.0 > > Attachments: image-2020-05-14-19-51-37-938.png, > image-2020-05-14-20-14-59-429.png > > > 一、scenario > The business scenarios of the data lake mainly include analysis of databases, > logs, and files. > !image-2020-05-14-20-14-59-429.png|width=444,height=286! > Databricks delta lake also aim at these three scenario. [1] > > 二、Hudi current situation > At present, hudi can better support the scenario where the database cdc is > incrementally written to hudi, and it is also doing bulkload files to hudi. > However, there is no good native support for log scenarios (requiring > high-throughput writes, no updates, deletions, and focusing on small file > scenarios);now can write through inserts without deduplication, but they will > still merge on the write side. > * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but > every batch small will cost some time for merge,it will reduce write > throughput. > * This scene is not suitable for merge on read. > * the actual scenario only needs to write parquet in batches when writing, > and then provide reverse compaction (similar to delta lake ) > 三、what we can do > > 1.On the write side, just write every batch to parquet file base on the > snapshot mechanism,default open the merge,use can close the auto merge for > more write throughput. > 2. hudi support asynchronous merge small parquet files like databricks delta > lake's OPTIMIZE command [2] > > [1] [https://databricks.com/product/delta-lake-on-databricks] > [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1674) add partition level delete DOC or example
[ https://issues.apache.org/jira/browse/HUDI-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316820#comment-17316820 ] liwei commented on HUDI-1674: - [~shivnarayan] spark datasource do not have the delete partition API. It need use the catalog. https://stackoverflow.com/questions/52531327/drop-partitions-from-spark After [https://github.com/apache/hudi/pull/2645] is landed, We can support 'alter table xx drop partition ()' > add partition level delete DOC or example > - > > Key: HUDI-1674 > URL: https://issues.apache.org/jira/browse/HUDI-1674 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Priority: Minor > Labels: docs, user-support-issues > Attachments: image-2021-03-08-09-57-05-768.png > > > !image-2021-03-08-09-57-05-768.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1768) spark datasource support schema validate add column
[ https://issues.apache.org/jira/browse/HUDI-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1768: Issue Type: Improvement (was: Bug) > spark datasource support schema validate add column > > > Key: HUDI-1768 > URL: https://issues.apache.org/jira/browse/HUDI-1768 > Project: Apache Hudi > Issue Type: Improvement >Reporter: liwei >Assignee: liwei >Priority: Major > > spark datasource now not support set avro column default value. > but it set column to nullable and use SchemaConverters.toAvroType transform > to union type which has null, such as : > Registered avro schema : { > "type" : "record", > "name" : "hoodie_test_record", > "namespace" : "hoodie.hoodie_test", > "fields" : [ { > "name" : "_row_key", > "type" : [ "string", "null" ] > }, { > "name" : "name", > "type" : [ "string", "null" ] > }, { > "name" : "timestamp", > "type" : [ "int", "null" ] > }, { > "name" : "partition", > "type" : [ "int", "null" ] > } ] > } > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1768) spark datasource support schema validate add column
[ https://issues.apache.org/jira/browse/HUDI-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reassigned HUDI-1768: --- Assignee: liwei > spark datasource support schema validate add column > > > Key: HUDI-1768 > URL: https://issues.apache.org/jira/browse/HUDI-1768 > Project: Apache Hudi > Issue Type: Bug >Reporter: liwei >Assignee: liwei >Priority: Major > > spark datasource now not support set avro column default value. > but it set column to nullable and use SchemaConverters.toAvroType transform > to union type which has null, such as : > Registered avro schema : { > "type" : "record", > "name" : "hoodie_test_record", > "namespace" : "hoodie.hoodie_test", > "fields" : [ { > "name" : "_row_key", > "type" : [ "string", "null" ] > }, { > "name" : "name", > "type" : [ "string", "null" ] > }, { > "name" : "timestamp", > "type" : [ "int", "null" ] > }, { > "name" : "partition", > "type" : [ "int", "null" ] > } ] > } > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1768) spark datasource support schema validate add column
liwei created HUDI-1768: --- Summary: spark datasource support schema validate add column Key: HUDI-1768 URL: https://issues.apache.org/jira/browse/HUDI-1768 Project: Apache Hudi Issue Type: Bug Reporter: liwei spark datasource now not support set avro column default value. but it set column to nullable and use SchemaConverters.toAvroType transform to union type which has null, such as : Registered avro schema : { "type" : "record", "name" : "hoodie_test_record", "namespace" : "hoodie.hoodie_test", "fields" : [ { "name" : "_row_key", "type" : [ "string", "null" ] }, { "name" : "name", "type" : [ "string", "null" ] }, { "name" : "timestamp", "type" : [ "int", "null" ] }, { "name" : "partition", "type" : [ "int", "null" ] } ] } -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1590) Support async clustering w/ test suite job
[ https://issues.apache.org/jira/browse/HUDI-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297421#comment-17297421 ] liwei commented on HUDI-1590: - [~legendtkl] try it . :D > Support async clustering w/ test suite job > -- > > Key: HUDI-1590 > URL: https://issues.apache.org/jira/browse/HUDI-1590 > Project: Apache Hudi > Issue Type: Improvement > Components: Testing >Reporter: sivabalan narayanan >Assignee: Kelu Tao >Priority: Major > Fix For: 0.8.0 > > > As of now, we only have inline clustering support w/ hoodie test suite job. > we need to add support for async clustering. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1674) add partition level delete DOC or example
liwei created HUDI-1674: --- Summary: add partition level delete DOC or example Key: HUDI-1674 URL: https://issues.apache.org/jira/browse/HUDI-1674 Project: Apache Hudi Issue Type: Sub-task Reporter: liwei Attachments: image-2021-03-08-09-57-05-768.png !image-2021-03-08-09-57-05-768.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord
[ https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292942#comment-17292942 ] liwei commented on HUDI-797: [~pwason] Hello i also meet the performance problem. When i ingest large logs, it is so slowly when HoodieCreateHandle.rewrite() just 2MB/s to object store. Do we have any other method to solve this problem? :) > Improve performance of rewriting AVRO records in > HoodieAvroUtils::rewriteRecord > --- > > Key: HUDI-797 > URL: https://issues.apache.org/jira/browse/HUDI-797 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO > encoded records. These records have a [schema > |https://avro.apache.org/docs/current/spec.html]which is determined by the > dataset user and provided to HUDI during the writing process (as part of > HUDIWriteConfig). The records are finally saved in [parquet > |https://parquet.apache.org/]files which include the schema (in parquet > format) in the footer of individual files. > > HUDI design requires addition of some metadata fields to all incoming records > to aid in book-keeping and indexing. To achieve this, the incoming schema > needs to be modified by adding the HUDI metadata fields and is called the > HUDI schema for the dataset. Each incoming record is then re-written to > translate it from the incoming schema into the HUDI schema. Re-writing the > incoming records to a new schema is reasonably fast as it looks up all fields > in the incoming record and adds them to a new record. But since this takes > place for each and every incoming record. > When ingestion large datasets (billions of records) or large number of > datasets, even small improvements in the CPU-bound conversion can translate > into notable improvement in compute efficiency. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-1520) add configure for spark sql overwrite use replace
[ https://issues.apache.org/jira/browse/HUDI-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei resolved HUDI-1520. - Resolution: Fixed > add configure for spark sql overwrite use replace > - > > Key: HUDI-1520 > URL: https://issues.apache.org/jira/browse/HUDI-1520 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1520) add configure for spark sql overwrite use replace
liwei created HUDI-1520: --- Summary: add configure for spark sql overwrite use replace Key: HUDI-1520 URL: https://issues.apache.org/jira/browse/HUDI-1520 Project: Apache Hudi Issue Type: Sub-task Reporter: liwei -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-1399) support a independent clustering spark job to asynchronously clustering
[ https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei closed HUDI-1399. --- > support a independent clustering spark job to asynchronously clustering > > > Key: HUDI-1399 > URL: https://issues.apache.org/jira/browse/HUDI-1399 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-1399) support a independent clustering spark job to asynchronously clustering
[ https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei resolved HUDI-1399. - Resolution: Fixed > support a independent clustering spark job to asynchronously clustering > > > Key: HUDI-1399 > URL: https://issues.apache.org/jira/browse/HUDI-1399 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1399) support a independent clustering spark job to asynchronously clustering
[ https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1399: Status: Closed (was: Patch Available) > support a independent clustering spark job to asynchronously clustering > > > Key: HUDI-1399 > URL: https://issues.apache.org/jira/browse/HUDI-1399 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HUDI-1399) support a independent clustering spark job to asynchronously clustering
[ https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reopened HUDI-1399: - > support a independent clustering spark job to asynchronously clustering > > > Key: HUDI-1399 > URL: https://issues.apache.org/jira/browse/HUDI-1399 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1516) refactored testHoodieAsyncClusteringJob in TestHoodieDeltaStreamer.java
[ https://issues.apache.org/jira/browse/HUDI-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reassigned HUDI-1516: --- Assignee: liwei > refactored testHoodieAsyncClusteringJob in TestHoodieDeltaStreamer.java > --- > > Key: HUDI-1516 > URL: https://issues.apache.org/jira/browse/HUDI-1516 > Project: Apache Hudi > Issue Type: Sub-task > Components: DeltaStreamer >Reporter: liwei >Assignee: liwei >Priority: Major > > I'm worried this is polluting the tests in hoodie delta streamer, is this > test case to be refactored after deltastreamer natively supports clustering > https://github.com/apache/hudi/pull/2379 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1516) refactored testHoodieAsyncClusteringJob in TestHoodieDeltaStreamer.java
liwei created HUDI-1516: --- Summary: refactored testHoodieAsyncClusteringJob in TestHoodieDeltaStreamer.java Key: HUDI-1516 URL: https://issues.apache.org/jira/browse/HUDI-1516 Project: Apache Hudi Issue Type: Sub-task Components: DeltaStreamer Reporter: liwei I'm worried this is polluting the tests in hoodie delta streamer, is this test case to be refactored after deltastreamer natively supports clustering https://github.com/apache/hudi/pull/2379 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1482) async clustering for spark streaming
[ https://issues.apache.org/jira/browse/HUDI-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1482: Status: Open (was: New) > async clustering for spark streaming > > > Key: HUDI-1482 > URL: https://issues.apache.org/jira/browse/HUDI-1482 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1482) async clustering for spark streaming
[ https://issues.apache.org/jira/browse/HUDI-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1482: Status: In Progress (was: Open) > async clustering for spark streaming > > > Key: HUDI-1482 > URL: https://issues.apache.org/jira/browse/HUDI-1482 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1500) support incremental read clustering commit in deltastreamer
liwei created HUDI-1500: --- Summary: support incremental read clustering commit in deltastreamer Key: HUDI-1500 URL: https://issues.apache.org/jira/browse/HUDI-1500 Project: Apache Hudi Issue Type: Sub-task Components: DeltaStreamer Reporter: liwei now in DeltaSync.readFromSource() can not read last instant as replace commit, such as clustering. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-1481) support inline clustering unit tests for spark datasource and deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei resolved HUDI-1481. - Resolution: Fixed > support inline clustering unit tests for spark datasource and deltastreamer > --- > > Key: HUDI-1481 > URL: https://issues.apache.org/jira/browse/HUDI-1481 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-1354) Block updates and replace on file groups in clustering
[ https://issues.apache.org/jira/browse/HUDI-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei resolved HUDI-1354. - Resolution: Fixed > Block updates and replace on file groups in clustering > -- > > Key: HUDI-1354 > URL: https://issues.apache.org/jira/browse/HUDI-1354 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-1350) Support Partition level delete API in HUDI on top on Insert Overwrite
[ https://issues.apache.org/jira/browse/HUDI-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei resolved HUDI-1350. - Resolution: Fixed > Support Partition level delete API in HUDI on top on Insert Overwrite > - > > Key: HUDI-1350 > URL: https://issues.apache.org/jira/browse/HUDI-1350 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: liwei >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1354) Block updates and replace on file groups in clustering
[ https://issues.apache.org/jira/browse/HUDI-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1354: Status: Closed (was: Patch Available) > Block updates and replace on file groups in clustering > -- > > Key: HUDI-1354 > URL: https://issues.apache.org/jira/browse/HUDI-1354 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1354) Block updates and replace on file groups in clustering
[ https://issues.apache.org/jira/browse/HUDI-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1354: Status: Patch Available (was: In Progress) > Block updates and replace on file groups in clustering > -- > > Key: HUDI-1354 > URL: https://issues.apache.org/jira/browse/HUDI-1354 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HUDI-1354) Block updates and replace on file groups in clustering
[ https://issues.apache.org/jira/browse/HUDI-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reopened HUDI-1354: - > Block updates and replace on file groups in clustering > -- > > Key: HUDI-1354 > URL: https://issues.apache.org/jira/browse/HUDI-1354 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Blocker > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1498) Always read clustering plan from requested file
[ https://issues.apache.org/jira/browse/HUDI-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1498: Status: Open (was: New) > Always read clustering plan from requested file > --- > > Key: HUDI-1498 > URL: https://issues.apache.org/jira/browse/HUDI-1498 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Blocker > Labels: pull-request-available > Fix For: 0.7.0 > > > Clustering inflight doesnt have 'ClusteringPlan'. Read content from > corresponding requested file to make updates work -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1350) Support Partition level delete API in HUDI on top on Insert Overwrite
[ https://issues.apache.org/jira/browse/HUDI-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1350: Status: Patch Available (was: In Progress) > Support Partition level delete API in HUDI on top on Insert Overwrite > - > > Key: HUDI-1350 > URL: https://issues.apache.org/jira/browse/HUDI-1350 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: liwei >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1350) Support Partition level delete API in HUDI on top on Insert Overwrite
[ https://issues.apache.org/jira/browse/HUDI-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1350: Status: Closed (was: Patch Available) > Support Partition level delete API in HUDI on top on Insert Overwrite > - > > Key: HUDI-1350 > URL: https://issues.apache.org/jira/browse/HUDI-1350 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: liwei >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HUDI-1350) Support Partition level delete API in HUDI on top on Insert Overwrite
[ https://issues.apache.org/jira/browse/HUDI-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reopened HUDI-1350: - > Support Partition level delete API in HUDI on top on Insert Overwrite > - > > Key: HUDI-1350 > URL: https://issues.apache.org/jira/browse/HUDI-1350 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: liwei >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-1074) implement merge-sort based clustering strategy
[ https://issues.apache.org/jira/browse/HUDI-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei closed HUDI-1074. --- Resolution: Fixed > implement merge-sort based clustering strategy > -- > > Key: HUDI-1074 > URL: https://issues.apache.org/jira/browse/HUDI-1074 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Major > > implement a merge-sort based clustering algorithm. Example: i) sort all small > files by specified column(s) ii) merge N small files into M larger files by > respecting sort order (M < N) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1074) implement merge-sort based clustering strategy
[ https://issues.apache.org/jira/browse/HUDI-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1074: Status: Open (was: New) > implement merge-sort based clustering strategy > -- > > Key: HUDI-1074 > URL: https://issues.apache.org/jira/browse/HUDI-1074 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Major > > implement a merge-sort based clustering algorithm. Example: i) sort all small > files by specified column(s) ii) merge N small files into M larger files by > respecting sort order (M < N) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-1074) implement merge-sort based clustering strategy
[ https://issues.apache.org/jira/browse/HUDI-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253874#comment-17253874 ] liwei edited comment on HUDI-1074 at 12/23/20, 3:54 AM: [~satishkotha] got it, i will do some performance test recently , then decide if we need optimize it was (Author: 309637554): [~satishkotha] got it > implement merge-sort based clustering strategy > -- > > Key: HUDI-1074 > URL: https://issues.apache.org/jira/browse/HUDI-1074 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Major > > implement a merge-sort based clustering algorithm. Example: i) sort all small > files by specified column(s) ii) merge N small files into M larger files by > respecting sort order (M < N) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1074) implement merge-sort based clustering strategy
[ https://issues.apache.org/jira/browse/HUDI-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253874#comment-17253874 ] liwei commented on HUDI-1074: - [~satishkotha] got it > implement merge-sort based clustering strategy > -- > > Key: HUDI-1074 > URL: https://issues.apache.org/jira/browse/HUDI-1074 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Major > > implement a merge-sort based clustering algorithm. Example: i) sort all small > files by specified column(s) ii) merge N small files into M larger files by > respecting sort order (M < N) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-1042) [Umbrella] Support clustering on filegroups
[ https://issues.apache.org/jira/browse/HUDI-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253814#comment-17253814 ] liwei edited comment on HUDI-1042 at 12/23/20, 12:49 AM: - okay, i have some issue is in progress was (Author: 309637554): okay > [Umbrella] Support clustering on filegroups > --- > > Key: HUDI-1042 > URL: https://issues.apache.org/jira/browse/HUDI-1042 > Project: Apache Hudi > Issue Type: Bug >Reporter: leesf >Assignee: leesf >Priority: Major > Fix For: 0.7.0 > > > please see > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1074) implement merge-sort based clustering strategy
[ https://issues.apache.org/jira/browse/HUDI-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253815#comment-17253815 ] liwei commented on HUDI-1074: - got it , i have not begin. if [https://github.com/apache/hudi/pull/2263] have resolved the issue? or need implement a more complete strategy? > implement merge-sort based clustering strategy > -- > > Key: HUDI-1074 > URL: https://issues.apache.org/jira/browse/HUDI-1074 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Major > > implement a merge-sort based clustering algorithm. Example: i) sort all small > files by specified column(s) ii) merge N small files into M larger files by > respecting sort order (M < N) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-1074) implement merge-sort based clustering strategy
[ https://issues.apache.org/jira/browse/HUDI-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253815#comment-17253815 ] liwei edited comment on HUDI-1074 at 12/23/20, 12:48 AM: - [~satishkotha] got it , i have not begin. if [https://github.com/apache/hudi/pull/2263] have resolved the issue? or need implement a more complete strategy? was (Author: 309637554): got it , i have not begin. if [https://github.com/apache/hudi/pull/2263] have resolved the issue? or need implement a more complete strategy? > implement merge-sort based clustering strategy > -- > > Key: HUDI-1074 > URL: https://issues.apache.org/jira/browse/HUDI-1074 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Major > > implement a merge-sort based clustering algorithm. Example: i) sort all small > files by specified column(s) ii) merge N small files into M larger files by > respecting sort order (M < N) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1042) [Umbrella] Support clustering on filegroups
[ https://issues.apache.org/jira/browse/HUDI-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253814#comment-17253814 ] liwei commented on HUDI-1042: - okay > [Umbrella] Support clustering on filegroups > --- > > Key: HUDI-1042 > URL: https://issues.apache.org/jira/browse/HUDI-1042 > Project: Apache Hudi > Issue Type: Bug >Reporter: leesf >Assignee: leesf >Priority: Major > Fix For: 0.7.0 > > > please see > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1487) after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random
[ https://issues.apache.org/jira/browse/HUDI-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1487: Description: TestCOWDataSource.testCopyOnWriteStorage will failed random. Because before the incremental read, add a new upsert commit. // pull the latest commit val hoodieIncViewDF2 = spark.read.format("org.apache.hudi") .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2) .load(basePath) the new commit is : // Upsert based on the written table with Hudi metadata columns val verificationRowKey = snapshotDF1.limit(1).select("_row_key").first.getString(0) as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : "expected: <65> but was: <66>" [https://travis-ci.com/github/apache/hudi/jobs/463879606] org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173) at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275) ... 30 more [WARN ] 2020-12-22 12:32:40,788 org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system instance used in previous test-run [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.352 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource [ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage Time elapsed: 15.275 s <<< FAILURE! org.opentest4j.AssertionFailedError: expected: <65> but was: <66> at org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160) [INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap [WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:50,921 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:56,169 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:56,793 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:57,388 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:33:05,191 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:33:10,221 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:33:17,985 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:33:22,498 org.apache.hudi.DefaultSource - Loading Base File Only View. was: TestCOWDataSource.testCopyOnWriteStorage will failed random. Because before the incremental read, add a new upsert commit. // pull the latest commit val hoodieIncViewDF2 = spark.read.format("org.apache.hudi") .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2) .load(basePath) the new commit is : // Upsert based on the written table with Hudi metadata columns val verificationRowKey = snapshotDF1.limit(1).select("_row_key").first.getString(0) as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : "expected: <65> but was: <66>" org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173) at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275) ... 30 more [WARN ] 2020-12-22 12:32:40,788 org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system instance used in previous test-run [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.352 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource [ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage Time elapsed: 15.275 s <<< FAILURE! org.opentest4j.AssertionFailedError: expected: <65> but was: <66> at org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160) [INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap [WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:50,921 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:56,169 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:56,793 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:57,388 org.apache.hudi.DefaultSource - Loading Base File
[jira] [Updated] (HUDI-1487) after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random
[ https://issues.apache.org/jira/browse/HUDI-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1487: Status: Open (was: New) > after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random > -- > > Key: HUDI-1487 > URL: https://issues.apache.org/jira/browse/HUDI-1487 > Project: Apache Hudi > Issue Type: Bug >Reporter: liwei >Assignee: liwei >Priority: Major > > > > TestCOWDataSource.testCopyOnWriteStorage will failed random. Because before > the incremental read, add a new upsert commit. > // pull the latest commit > val hoodieIncViewDF2 = spark.read.format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, > DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL) > .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2) > .load(basePath) > the new commit is : > // Upsert based on the written table with Hudi metadata columns > val verificationRowKey = > snapshotDF1.limit(1).select("_row_key").first.getString(0) > as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : > "expected: <65> but was: <66>" > > > org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173) > at > org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275) > ... 30 more > [WARN ] 2020-12-22 12:32:40,788 > org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system > instance used in previous test-run > [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: > 35.352 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource > [ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage > Time elapsed: 15.275 s <<< FAILURE! > org.opentest4j.AssertionFailedError: expected: <65> but was: <66> > at > org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160) > [INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap > [WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:32:50,921 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:32:56,169 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:32:56,793 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:32:57,388 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:33:05,191 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:33:10,221 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:33:17,985 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:33:22,498 org.apache.hudi.DefaultSource - Loading Base > File Only View. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1487) after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random
[ https://issues.apache.org/jira/browse/HUDI-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1487: Status: In Progress (was: Open) > after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random > -- > > Key: HUDI-1487 > URL: https://issues.apache.org/jira/browse/HUDI-1487 > Project: Apache Hudi > Issue Type: Bug >Reporter: liwei >Assignee: liwei >Priority: Major > > > > TestCOWDataSource.testCopyOnWriteStorage will failed random. Because before > the incremental read, add a new upsert commit. > // pull the latest commit > val hoodieIncViewDF2 = spark.read.format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, > DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL) > .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2) > .load(basePath) > the new commit is : > // Upsert based on the written table with Hudi metadata columns > val verificationRowKey = > snapshotDF1.limit(1).select("_row_key").first.getString(0) > as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : > "expected: <65> but was: <66>" > > > org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173) > at > org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275) > ... 30 more > [WARN ] 2020-12-22 12:32:40,788 > org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system > instance used in previous test-run > [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: > 35.352 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource > [ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage > Time elapsed: 15.275 s <<< FAILURE! > org.opentest4j.AssertionFailedError: expected: <65> but was: <66> > at > org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160) > [INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap > [WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:32:50,921 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:32:56,169 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:32:56,793 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:32:57,388 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:33:05,191 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:33:10,221 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:33:17,985 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:33:22,498 org.apache.hudi.DefaultSource - Loading Base > File Only View. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1487) after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random
[ https://issues.apache.org/jira/browse/HUDI-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1487: Description: TestCOWDataSource.testCopyOnWriteStorage will failed random. Because before the incremental read, add a new upsert commit. // pull the latest commit val hoodieIncViewDF2 = spark.read.format("org.apache.hudi") .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2) .load(basePath) the new commit is : // Upsert based on the written table with Hudi metadata columns val verificationRowKey = snapshotDF1.limit(1).select("_row_key").first.getString(0) as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : "expected: <65> but was: <66>" org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173) at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275) ... 30 more [WARN ] 2020-12-22 12:32:40,788 org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system instance used in previous test-run [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.352 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource [ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage Time elapsed: 15.275 s <<< FAILURE! org.opentest4j.AssertionFailedError: expected: <65> but was: <66> at org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160) [INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap [WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:50,921 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:56,169 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:56,793 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:32:57,388 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:33:05,191 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:33:10,221 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:33:17,985 org.apache.hudi.DefaultSource - Loading Base File Only View. [WARN ] 2020-12-22 12:33:22,498 org.apache.hudi.DefaultSource - Loading Base File Only View. > after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random > -- > > Key: HUDI-1487 > URL: https://issues.apache.org/jira/browse/HUDI-1487 > Project: Apache Hudi > Issue Type: Bug >Reporter: liwei >Assignee: liwei >Priority: Major > > > > TestCOWDataSource.testCopyOnWriteStorage will failed random. Because before > the incremental read, add a new upsert commit. > // pull the latest commit > val hoodieIncViewDF2 = spark.read.format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, > DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL) > .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2) > .load(basePath) > the new commit is : > // Upsert based on the written table with Hudi metadata columns > val verificationRowKey = > snapshotDF1.limit(1).select("_row_key").first.getString(0) > as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : > "expected: <65> but was: <66>" > > > org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173) > at > org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275) > ... 30 more > [WARN ] 2020-12-22 12:32:40,788 > org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system > instance used in previous test-run > [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: > 35.352 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource > [ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage > Time elapsed: 15.275 s <<< FAILURE! > org.opentest4j.AssertionFailedError: expected: <65> but was: <66> > at > org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160) > [INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap > [WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base > File Only View. > [WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading
[jira] [Created] (HUDI-1487) after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random
liwei created HUDI-1487: --- Summary: after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random Key: HUDI-1487 URL: https://issues.apache.org/jira/browse/HUDI-1487 Project: Apache Hudi Issue Type: Bug Reporter: liwei Assignee: liwei -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1483) async clustering for deltastreamer
liwei created HUDI-1483: --- Summary: async clustering for deltastreamer Key: HUDI-1483 URL: https://issues.apache.org/jira/browse/HUDI-1483 Project: Apache Hudi Issue Type: Sub-task Reporter: liwei Assignee: liwei -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1482) async compaction for spark streaming
liwei created HUDI-1482: --- Summary: async compaction for spark streaming Key: HUDI-1482 URL: https://issues.apache.org/jira/browse/HUDI-1482 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: liwei Assignee: liwei -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1482) async clustering for spark streaming
[ https://issues.apache.org/jira/browse/HUDI-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1482: Summary: async clustering for spark streaming (was: async compaction for spark streaming) > async clustering for spark streaming > > > Key: HUDI-1482 > URL: https://issues.apache.org/jira/browse/HUDI-1482 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1399) support a independent clustering spark job to asynchronously clustering
[ https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1399: Status: In Progress (was: Open) > support a independent clustering spark job to asynchronously clustering > > > Key: HUDI-1399 > URL: https://issues.apache.org/jira/browse/HUDI-1399 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Blocker > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1399) support a independent clustering spark job to asynchronously clustering
[ https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1399: Summary: support a independent clustering spark job to asynchronously clustering (was: support clustering operation can run asynchronously) > support a independent clustering spark job to asynchronously clustering > > > Key: HUDI-1399 > URL: https://issues.apache.org/jira/browse/HUDI-1399 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Blocker > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1481) support inline clustering unit tests for spark datasource and deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1481: Status: In Progress (was: Open) > support inline clustering unit tests for spark datasource and deltastreamer > --- > > Key: HUDI-1481 > URL: https://issues.apache.org/jira/browse/HUDI-1481 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1481) support inline clustering unit tests for spark datasource and deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1481: Status: Open (was: New) > support inline clustering unit tests for spark datasource and deltastreamer > --- > > Key: HUDI-1481 > URL: https://issues.apache.org/jira/browse/HUDI-1481 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-1472) support inline clustering unit tests for spark datasource and deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei closed HUDI-1472. --- Resolution: Fixed > support inline clustering unit tests for spark datasource and deltastreamer > --- > > Key: HUDI-1472 > URL: https://issues.apache.org/jira/browse/HUDI-1472 > Project: Apache Hudi > Issue Type: Task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1481) support inline clustering unit tests for spark datasource and deltastreamer
liwei created HUDI-1481: --- Summary: support inline clustering unit tests for spark datasource and deltastreamer Key: HUDI-1481 URL: https://issues.apache.org/jira/browse/HUDI-1481 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: liwei Assignee: liwei Fix For: 0.7.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1472) support inline clustering unit tests for spark datasource and deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1472: Summary: support inline clustering unit tests for spark datasource and deltastreamer (was: support inline clustering for spark datasource) > support inline clustering unit tests for spark datasource and deltastreamer > --- > > Key: HUDI-1472 > URL: https://issues.apache.org/jira/browse/HUDI-1472 > Project: Apache Hudi > Issue Type: Task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1472) support inline clustering for spark datasource
[ https://issues.apache.org/jira/browse/HUDI-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1472: Status: Open (was: New) > support inline clustering for spark datasource > -- > > Key: HUDI-1472 > URL: https://issues.apache.org/jira/browse/HUDI-1472 > Project: Apache Hudi > Issue Type: Task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1472) support inline clustering for spark datasource
liwei created HUDI-1472: --- Summary: support inline clustering for spark datasource Key: HUDI-1472 URL: https://issues.apache.org/jira/browse/HUDI-1472 Project: Apache Hudi Issue Type: Task Components: Spark Integration Reporter: liwei Assignee: liwei Fix For: 0.7.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1399) support clustering operation can run asynchronously
[ https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1399: Status: Open (was: New) > support clustering operation can run asynchronously > --- > > Key: HUDI-1399 > URL: https://issues.apache.org/jira/browse/HUDI-1399 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Blocker > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1399) support clustering operation can run asynchronously
[ https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250765#comment-17250765 ] liwei commented on HUDI-1399: - [~vinoth] code freeze is Dec 31? just like compaction asynchronously have four option 1. option one: in spark inline clustering now in https://github.com/apache/hudi/pull/2263/files have base implementation, but have not support run in spark [~satishkotha] 2. option two: support a independent clustering spark job to asynchronously clustering just like HoodieCompactor 3. option three: hudi cli support clustering 4. option four: DeltaStreamer Continuous mode support clustering for functional coverage i think we can first support option one and option two. as https://github.com/apache/hudi/pull/2263/files have not merge, i can land this two in on satishkotha:sk/clustering branch. I plan to do it this weekend, and submit pr next week. [~vinoth] what do you think ? Does my plan conflict with you? [~satishkotha] cc [~nagarwal] > support clustering operation can run asynchronously > --- > > Key: HUDI-1399 > URL: https://issues.apache.org/jira/browse/HUDI-1399 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Blocker > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1399) support clustering operation can run asynchronously
[ https://issues.apache.org/jira/browse/HUDI-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250424#comment-17250424 ] liwei commented on HUDI-1399: - [~vinoth] i plan to begin it about next week > support clustering operation can run asynchronously > --- > > Key: HUDI-1399 > URL: https://issues.apache.org/jira/browse/HUDI-1399 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Blocker > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1456) Concurrent Writing to Hudi tables
[ https://issues.apache.org/jira/browse/HUDI-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248711#comment-17248711 ] liwei commented on HUDI-1456: - [~nishith29] Great , it will begin. If have some independent task, I’m happy to take it. :D > Concurrent Writing to Hudi tables > - > > Key: HUDI-1456 > URL: https://issues.apache.org/jira/browse/HUDI-1456 > Project: Apache Hudi > Issue Type: New Feature > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Fix For: 0.8.0 > > Attachments: image-2020-12-14-09-48-46-946.png > > > This ticket tracks all the changes needed to support concurrency control for > Hudi tables. This work will be done in multiple phases. > # Parallel writing to Hudi tables support -> This feature will allow users > to have multiple writers mutate the tables without the ability to perform > concurrent update to the same file. > # Concurrency control at file/record level -> This feature will allow users > to have multiple writers mutate the tables with the ability to ensure > serializability at record level. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-1456) Concurrent Writing to Hudi tables
[ https://issues.apache.org/jira/browse/HUDI-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248711#comment-17248711 ] liwei edited comment on HUDI-1456 at 12/14/20, 1:49 AM: [~nishith29] Good news , it will begin. If have some independent task, I’m happy to take it. :D was (Author: 309637554): [~nishith29] Great , it will begin. If have some independent task, I’m happy to take it. :D > Concurrent Writing to Hudi tables > - > > Key: HUDI-1456 > URL: https://issues.apache.org/jira/browse/HUDI-1456 > Project: Apache Hudi > Issue Type: New Feature > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Fix For: 0.8.0 > > Attachments: image-2020-12-14-09-48-46-946.png > > > This ticket tracks all the changes needed to support concurrency control for > Hudi tables. This work will be done in multiple phases. > # Parallel writing to Hudi tables support -> This feature will allow users > to have multiple writers mutate the tables without the ability to perform > concurrent update to the same file. > # Concurrency control at file/record level -> This feature will allow users > to have multiple writers mutate the tables with the ability to ensure > serializability at record level. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1456) Concurrent Writing to Hudi tables
[ https://issues.apache.org/jira/browse/HUDI-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1456: Attachment: image-2020-12-14-09-48-46-946.png > Concurrent Writing to Hudi tables > - > > Key: HUDI-1456 > URL: https://issues.apache.org/jira/browse/HUDI-1456 > Project: Apache Hudi > Issue Type: New Feature > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Fix For: 0.8.0 > > Attachments: image-2020-12-14-09-48-46-946.png > > > This ticket tracks all the changes needed to support concurrency control for > Hudi tables. This work will be done in multiple phases. > # Parallel writing to Hudi tables support -> This feature will allow users > to have multiple writers mutate the tables without the ability to perform > concurrent update to the same file. > # Concurrency control at file/record level -> This feature will allow users > to have multiple writers mutate the tables with the ability to ensure > serializability at record level. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1454) in unit test have error as Error reading clustering plan 006
liwei created HUDI-1454: --- Summary: in unit test have error as Error reading clustering plan 006 Key: HUDI-1454 URL: https://issues.apache.org/jira/browse/HUDI-1454 Project: Apache Hudi Issue Type: Sub-task Reporter: liwei Assignee: liwei https://travis-ci.com/github/apache/hudi/jobs/458936905 [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.245 s - in org.apache.hudi.table.action.compact.TestInlineCompaction[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.245 s - in org.apache.hudi.table.action.compact.TestInlineCompaction[INFO] Running org.apache.hudi.table.action.compact.TestAsyncCompaction[WARN ] 2020-12-12 15:13:43,814 org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system instance used in previous test-run[WARN ] 2020-12-12 15:13:50,370 org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system instance used in previous test-run[WARN ] 2020-12-12 15:14:02,285 org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system instance used in previous test-run[WARN ] 2020-12-12 15:14:08,596 org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system instance used in previous test-run[WARN ] 2020-12-12 15:14:16,857 org.apache.hudi.common.util.ClusteringUtils - No content found in requested file for instant [==>006__replacecommit__REQUESTED][WARN ] 2020-12-12 15:14:16,861 org.apache.hudi.common.util.ClusteringUtils - No content found in requested file for instant [==>006__replacecommit__REQUESTED][ERROR] 2020-12-12 15:14:16,919 org.apache.hudi.timeline.service.FileSystemViewHandler - Got runtime exception servicing request partition=2015%2F03%2F17=%2Ftmp%2Fjunit7781027189613842524%2Fdataset=005=ba1d2bb94a4b1d1e6e294e77086957b6c7c43b5a306e36cba6bbaa955a0ed8ceorg.apache.hudi.exception.HoodieIOException: Error reading clustering plan 006 at org.apache.hudi.common.util.ClusteringUtils.getClusteringPlan(ClusteringUtils.java:85) at org.apache.hudi.common.util.ClusteringUtils.lambda$getAllPendingClusteringPlans$0(ClusteringUtils.java:67) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.hudi.common.util.ClusteringUtils.getAllFileGroupsInPendingClusteringPlans(ClusteringUtils.java:100) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:111) at org.apache.hudi.common.table.view.RocksDbBasedFileSystemView.init(RocksDbBasedFileSystemView.java:91) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.runSync(AbstractTableFileSystemView.java:1077) at org.apache.hudi.common.table.view.IncrementalTimelineSyncFileSystemView.runSync(IncrementalTimelineSyncFileSystemView.java:97) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.sync(AbstractTableFileSystemView.java:1059) at org.apache.hudi.timeline.service.FileSystemViewHandler.syncIfLocalViewBehind(FileSystemViewHandler.java:124) at org.apache.hudi.timeline.service.FileSystemViewHandler.access$100(FileSystemViewHandler.java:55) at org.apache.hudi.timeline.service.FileSystemViewHandler$ViewHandler.handle(FileSystemViewHandler.java:338) at io.javalin.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:22) at io.javalin.Javalin.lambda$addHandler$0(Javalin.java:606) at io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:46) at io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:17) at io.javalin.core.JavalinServlet$service$1.invoke(JavalinServlet.kt:143) at io.javalin.core.JavalinServlet$service$2.invoke(JavalinServlet.kt:41) at io.javalin.core.JavalinServlet.service(JavalinServlet.kt:107) at io.javalin.core.util.JettyServerUtil$initialize$httpHandler$1.doHandle(JettyServerUtil.kt:72) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1668) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:61) at
[jira] [Assigned] (HUDI-1448) hudi dla sync skip rt create
[ https://issues.apache.org/jira/browse/HUDI-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reassigned HUDI-1448: --- Assignee: liwei > hudi dla sync skip rt create > - > > Key: HUDI-1448 > URL: https://issues.apache.org/jira/browse/HUDI-1448 > Project: Apache Hudi > Issue Type: Sub-task > Components: Hive Integration >Reporter: liwei >Assignee: liwei >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1448) hudi dla sync skip rt create
liwei created HUDI-1448: --- Summary: hudi dla sync skip rt create Key: HUDI-1448 URL: https://issues.apache.org/jira/browse/HUDI-1448 Project: Apache Hudi Issue Type: Sub-task Components: Hive Integration Reporter: liwei -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-1349) spark sql support overwrite use replace action with dynamic partitioning
[ https://issues.apache.org/jira/browse/HUDI-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei closed HUDI-1349. --- > spark sql support overwrite use replace action with dynamic partitioning > - > > Key: HUDI-1349 > URL: https://issues.apache.org/jira/browse/HUDI-1349 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > Labels: pull-request-available > > now spark sql overwrite just do like this. > } else if (mode == SaveMode.Overwrite && tableExists) { > log.warn(s"hoodie table at $tablePath already exists. Deleting existing data > & overwriting with new data.") > fs.delete(tablePath, true) > tableExists = false > } > overwrite need to use replace action > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HUDI-1349) spark sql support overwrite use replace action with dynamic partitioning
[ https://issues.apache.org/jira/browse/HUDI-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reopened HUDI-1349: - > spark sql support overwrite use replace action with dynamic partitioning > - > > Key: HUDI-1349 > URL: https://issues.apache.org/jira/browse/HUDI-1349 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > Labels: pull-request-available > > now spark sql overwrite just do like this. > } else if (mode == SaveMode.Overwrite && tableExists) { > log.warn(s"hoodie table at $tablePath already exists. Deleting existing data > & overwriting with new data.") > fs.delete(tablePath, true) > tableExists = false > } > overwrite need to use replace action > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-1349) spark sql support overwrite use replace action with dynamic partitioning
[ https://issues.apache.org/jira/browse/HUDI-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei resolved HUDI-1349. - Resolution: Fixed > spark sql support overwrite use replace action with dynamic partitioning > - > > Key: HUDI-1349 > URL: https://issues.apache.org/jira/browse/HUDI-1349 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > Labels: pull-request-available > > now spark sql overwrite just do like this. > } else if (mode == SaveMode.Overwrite && tableExists) { > log.warn(s"hoodie table at $tablePath already exists. Deleting existing data > & overwriting with new data.") > fs.delete(tablePath, true) > tableExists = false > } > overwrite need to use replace action > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1349) spark sql support overwrite use replace action with dynamic partitioning
[ https://issues.apache.org/jira/browse/HUDI-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1349: Status: Closed (was: Patch Available) > spark sql support overwrite use replace action with dynamic partitioning > - > > Key: HUDI-1349 > URL: https://issues.apache.org/jira/browse/HUDI-1349 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Major > Labels: pull-request-available > > now spark sql overwrite just do like this. > } else if (mode == SaveMode.Overwrite && tableExists) { > log.warn(s"hoodie table at $tablePath already exists. Deleting existing data > & overwriting with new data.") > fs.delete(tablePath, true) > tableExists = false > } > overwrite need to use replace action > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1437) some description in spark ui is not reality, Not good for performance tracking
[ https://issues.apache.org/jira/browse/HUDI-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reassigned HUDI-1437: --- Assignee: liwei > some description in spark ui is not reality, Not good for performance > tracking > --- > > Key: HUDI-1437 > URL: https://issues.apache.org/jira/browse/HUDI-1437 > Project: Apache Hudi > Issue Type: Bug > Components: Performance >Reporter: liwei >Assignee: liwei >Priority: Major > Attachments: image-2020-12-07-23-50-57-212.png > > > some spark action in hudi ,not set the real description, it is not good for > performance tracking > > !image-2020-12-07-23-50-57-212.png|width=693,height=375! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1437) some description in spark ui is not reality, Not good for performance tracking
[ https://issues.apache.org/jira/browse/HUDI-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei updated HUDI-1437: Description: some spark action in hudi ,not set the real description, it is not good for performance tracking !image-2020-12-07-23-50-57-212.png|width=693,height=375! was: some spark action in hudi ,not set the real description, it is not good for performance tracking !image-2020-12-07-23-50-57-212.png! > some description in spark ui is not reality, Not good for performance > tracking > --- > > Key: HUDI-1437 > URL: https://issues.apache.org/jira/browse/HUDI-1437 > Project: Apache Hudi > Issue Type: Bug > Components: Performance >Reporter: liwei >Priority: Major > Attachments: image-2020-12-07-23-50-57-212.png > > > some spark action in hudi ,not set the real description, it is not good for > performance tracking > > !image-2020-12-07-23-50-57-212.png|width=693,height=375! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1437) some description in spark ui is not reality, Not good for performance tracking
liwei created HUDI-1437: --- Summary: some description in spark ui is not reality, Not good for performance tracking Key: HUDI-1437 URL: https://issues.apache.org/jira/browse/HUDI-1437 Project: Apache Hudi Issue Type: Bug Components: Performance Reporter: liwei Attachments: image-2020-12-07-23-50-57-212.png some spark action in hudi ,not set the real description, it is not good for performance tracking !image-2020-12-07-23-50-57-212.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1076) CLI tools to support clustering
[ https://issues.apache.org/jira/browse/HUDI-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liwei reassigned HUDI-1076: --- Assignee: liwei > CLI tools to support clustering > --- > > Key: HUDI-1076 > URL: https://issues.apache.org/jira/browse/HUDI-1076 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: liwei >Priority: Major > > 1) schedule clustering > 2) complete clustering > 3) cancel clustering > 4) rollback clustering -- This message was sent by Atlassian Jira (v8.3.4#803005)