date:20230430

[GitHub] [hudi] ad1happy2go commented on issue #7363: [SUPPORT] how to get hudi table schema and get table list under the same database

2023-04-30 Thread via GitHub



ad1happy2go commented on issue #7363:
URL: https://github.com/apache/hudi/issues/7363#issuecomment-1529375213

   @zengqinchris Closing this issue as we have the explanation and workaround 
also. Please reopen in case you have any more issues on the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #7249: [SUPPORT] How to run cleaner table service on DFS source of DeltaStreamer ?

2023-04-30 Thread via GitHub



ad1happy2go commented on issue #7249:
URL: https://github.com/apache/hudi/issues/7249#issuecomment-1529367537

   @rtdt99 Closing this bug as above comment clarifies. Please reopen in case 
of any more concerns on same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #7242: [SUPPORT] Partition field value lost in table column

2023-04-30 Thread via GitHub



ad1happy2go commented on issue #7242:
URL: https://github.com/apache/hudi/issues/7242#issuecomment-1529365385

   @Priyanka128 @ROOBALJINDAL The correct way is to add a new date column 
before writing to hudi. For the same you can use "SqlQueryBasedTransformer".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8611: [HUDI-6157] Fix potential data loss for flink streaming source from table with multi writer

2023-04-30 Thread via GitHub



hudi-bot commented on PR #8611:
URL: https://github.com/apache/hudi/pull/8611#issuecomment-1529347685

   
   ## CI report:
   
   * b184b111c6928408d082ce73486f5bd3ae7c6683 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16767)
 
   * 8a10affd53d66b88abd116587e6dd5e0c43e542a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16772)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8611: [HUDI-6157] Fix potential data loss for flink streaming source from table with multi writer

2023-04-30 Thread via GitHub



hudi-bot commented on PR #8611:
URL: https://github.com/apache/hudi/pull/8611#issuecomment-1529344411

   
   ## CI report:
   
   * b184b111c6928408d082ce73486f5bd3ae7c6683 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16767)
 
   * 8a10affd53d66b88abd116587e6dd5e0c43e542a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #7141: [SUPPORT] Question on Bootstrapped hudi table

2023-04-30 Thread via GitHub



ad1happy2go commented on issue #7141:
URL: https://github.com/apache/hudi/issues/7141#issuecomment-1529333245

   @rtdt99 Currently HoodieSnapshotExporter doesn't provide that functionality.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8613: [HUDI-6158] Strengthen Flink clustering commit and rollback strategy

2023-04-30 Thread via GitHub



hudi-bot commented on PR #8613:
URL: https://github.com/apache/hudi/pull/8613#issuecomment-1529319205

   
   ## CI report:
   
   * ff24199cad215049cc4274aae3a1008bf7053c90 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16771)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8613: [HUDI-6158] Strengthen Flink clustering commit and rollback strategy

2023-04-30 Thread via GitHub



hudi-bot commented on PR #8613:
URL: https://github.com/apache/hudi/pull/8613#issuecomment-1529315186

   
   ## CI report:
   
   * ff24199cad215049cc4274aae3a1008bf7053c90 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2751) To avoid the duplicates for streaming read MOR table

2023-04-30 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718151#comment-17718151
 ] 

Danny Chen commented on HUDI-2751:
--

> So, no records from the new base parquet file created from compaction will be 
> served with incremental read

That's true, to optimize it further, these parquet files from compaction and 
clustering could be skipped for incremental sources, we already impl that for 
Flink streaming reader by adding two options:
{code:java}
read.streaming.skip_compaction
read.streaming.skip_clustering{code}
I think we can close this out.

> To avoid the duplicates for streaming read MOR table
> 
>
> Key: HUDI-2751
> URL: https://issues.apache.org/jira/browse/HUDI-2751
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Danny Chen
>Assignee: sivabalan narayanan
>Priority: Critical
>
> Imagine there are commits on the timeline:
> {noformat}
>  -delta-99 - commit 100(include 99 delta data 
> set) - delta-101 - delta-102 -
>   first read ->| second read ->
>  – range 1 ---| --range 2 
> ---|
> {noformat}
> instant 99, 101, 102 are successful non-compaction delta commits;
> instant 100 is successful compaction instant.
> The first inc read consumes to instant 99 and the second read consumes from 
> instant 100 to instant 102, the second read would consumes the commit files 
> of instant 100 which has already been consumed before.
> The duplicate reading happens when this condition triggers: a compaction 
> instant schedules then completes in *one* consume range.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-2751) To avoid the duplicates for streaming read MOR table

2023-04-30 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-2751.

Fix Version/s: 0.12.0
   0.11.0
   Resolution: Fixed

> To avoid the duplicates for streaming read MOR table
> 
>
> Key: HUDI-2751
> URL: https://issues.apache.org/jira/browse/HUDI-2751
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Danny Chen
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.0, 0.11.0
>
>
> Imagine there are commits on the timeline:
> {noformat}
>  -delta-99 - commit 100(include 99 delta data 
> set) - delta-101 - delta-102 -
>   first read ->| second read ->
>  – range 1 ---| --range 2 
> ---|
> {noformat}
> instant 99, 101, 102 are successful non-compaction delta commits;
> instant 100 is successful compaction instant.
> The first inc read consumes to instant 99 and the second read consumes from 
> instant 100 to instant 102, the second read would consumes the commit files 
> of instant 100 which has already been consumed before.
> The duplicate reading happens when this condition triggers: a compaction 
> instant schedules then completes in *one* consume range.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6158) Strengthen Flink clustering commit and rollback strategy

2023-04-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6158:
-
Labels: pull-request-available  (was: )

> Strengthen Flink clustering commit and rollback strategy
> 
>
> Key: HUDI-6158
> URL: https://issues.apache.org/jira/browse/HUDI-6158
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> `ClusteringCommitSink` could strengthen commit and rollback strategy from two 
> solutions:
>  * Commit: Introduces `clusteringPlanCache` that caches to store clustering 
> plan for each instant. `clusteringPlanCache` stores the mapping of 
> instant_time -> clusteringPlan.
>  * Rolback: Updates `commitBuffer` that stores the mapping of instant_time -> 
> file_ids -> event. Use a map to collect the events because the rolling back 
> of intermediate clustering tasks generates corrupt events.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] SteNicholas opened a new pull request, #8613: [HUDI-6158] Strengthen Flink clustering commit and rollback strategy

2023-04-30 Thread via GitHub



SteNicholas opened a new pull request, #8613:
URL: https://github.com/apache/hudi/pull/8613

   ### Change Logs
   
   `ClusteringCommitSink` could strengthen commit and rollback strategy from 
two solutions:
   
   - Commit: Introduces `clusteringPlanCache` that caches to store clustering 
plan for each instant. `clusteringPlanCache` stores the mapping of instant_time 
-> clusteringPlan.
   - Rolback: Updates `commitBuffer` that stores the mapping of instant_time -> 
file_ids -> event. Use a map to collect the events because the rolling back of 
intermediate clustering tasks generates corrupt events.
   
   ### Impact
   
   Clustering commit and rollback strategy are improved. When the number of 
filegroups contained in the clustering plan is relatively large, it will be 
very expensive to read the clustering plan for each event received. Meanwhile, 
the rolling back of intermediate clustering tasks could generate corrupt events 
and collects the events via the map.
   
   ### Risk level (write none, low medium or high below)
   
   none.
   
   ### Documentation Update
   
   none.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6158) Strengthen Flink clustering commit and rollback strategy

2023-04-30 Thread Nicholas Jiang (Jira)

Nicholas Jiang created HUDI-6158:


 Summary: Strengthen Flink clustering commit and rollback strategy
 Key: HUDI-6158
 URL: https://issues.apache.org/jira/browse/HUDI-6158
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink
Reporter: Nicholas Jiang
Assignee: Nicholas Jiang
 Fix For: 0.14.0


`ClusteringCommitSink` could strengthen commit and rollback strategy from two 
solutions:
 * Commit: Introduces `clusteringPlanCache` that caches to store clustering 
plan for each instant. `clusteringPlanCache` stores the mapping of instant_time 
-> clusteringPlan.
 * Rolback: Updates `commitBuffer` that stores the mapping of instant_time -> 
file_ids -> event. Use a map to collect the events because the rolling back of 
intermediate clustering tasks generates corrupt events.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] nsivabalan commented on a diff in pull request #7826: [HUDI-5675] fix lazy clean schedule rollback on completed instant

2023-04-30 Thread via GitHub



nsivabalan commented on code in PR #7826:
URL: https://github.com/apache/hudi/pull/7826#discussion_r1181278240


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableRollback.java:
##
@@ -912,6 +914,62 @@ public void testLazyRollbackOfFailedCommit(boolean 
rollbackUsingMarkers) throws
 assertEquals(numLogFilesAfterRollback, numLogFilesAfterCompaction);
   }
 
+  @Test
+  public void testGetInstantToRollbackForLazyCleanPolicy() throws Exception {
+Properties properties = new Properties();
+properties.setProperty(HoodieTableConfig.BASE_FILE_FORMAT.key(), 
HoodieTableConfig.BASE_FILE_FORMAT.defaultValue().toString());
+HoodieTableMetaClient metaClient = 
getHoodieMetaClient(HoodieTableType.MERGE_ON_READ, properties);
+
+HoodieWriteConfig cfg = getWriteConfig(true, false);
+HoodieWriteConfig autoCommitFalseCfg = getWriteConfig(false, false);
+autoCommitFalseCfg.setValue(CLIENT_HEARTBEAT_INTERVAL_IN_MS.key(), "2000");
+autoCommitFalseCfg.setValue(CLIENT_HEARTBEAT_NUM_TOLERABLE_MISSES.key(), 
"2");
+HoodieTestDataGenerator dataGen = new HoodieTestDataGenerator();
+
+SparkRDDWriteClient client = getHoodieWriteClient(cfg);
+// Commit 1
+insertRecords(client, dataGen, "001");
+
+// Trigger an inflight commit 2
+SparkRDDWriteClient autoCommitFalseClient = 
getHoodieWriteClient(autoCommitFalseCfg);
+String commitTime2 = "002";
+autoCommitFalseClient.startCommitWithTime(commitTime2);
+List records = dataGen.generateInserts(commitTime2, 20);
+JavaRDD writeRecords = jsc().parallelize(records, 1);
+JavaRDD statuses = autoCommitFalseClient.upsert(writeRecords, 
commitTime2);
+
+// Stop updating heartbeat
+
autoCommitFalseClient.getHeartbeatClient().getHeartbeat(commitTime2).getTimer().cancel();
+// Sleep to make the heartbeat expired. In production env, heartbeat is 
expired because of commit
+Thread.sleep(4000);

Review Comment:
   can we reduce this to 2 secs. bcoz, the heart beat expiry is set to 2 secs 
right. just trying to reduce the total test run time. 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -707,20 +709,34 @@ protected List 
getInstantsToRollback(HoodieTableMetaClient metaClient, H
 }
   }).collect(Collectors.toList());
 } else if (cleaningPolicy.isLazy()) {
-  return inflightInstantsStream.filter(instant -> {
-try {
-  return heartbeatClient.isHeartbeatExpired(instant.getTimestamp());
-} catch (IOException io) {
-  throw new HoodieException("Failed to check heartbeat for instant " + 
instant, io);
-}
-  }).map(HoodieInstant::getTimestamp).collect(Collectors.toList());
+  return getInstantsToRollbackForLazyCleanPolicy(metaClient, 
inflightInstantsStream);
 } else if (cleaningPolicy.isNever()) {
   return Collections.emptyList();
 } else {
   throw new IllegalArgumentException("Invalid Failed Writes Cleaning 
Policy " + config.getFailedWritesCleanPolicy());
 }
   }
 
+  @VisibleForTesting

Review Comment:
   is it not possible to test it at the write client layer. would like to avoid 
using "VisibleForTesting" as much as possible. 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -707,20 +709,34 @@ protected List 
getInstantsToRollback(HoodieTableMetaClient metaClient, H
 }
   }).collect(Collectors.toList());
 } else if (cleaningPolicy.isLazy()) {
-  return inflightInstantsStream.filter(instant -> {
-try {
-  return heartbeatClient.isHeartbeatExpired(instant.getTimestamp());
-} catch (IOException io) {
-  throw new HoodieException("Failed to check heartbeat for instant " + 
instant, io);
-}
-  }).map(HoodieInstant::getTimestamp).collect(Collectors.toList());
+  return getInstantsToRollbackForLazyCleanPolicy(metaClient, 
inflightInstantsStream);
 } else if (cleaningPolicy.isNever()) {
   return Collections.emptyList();
 } else {
   throw new IllegalArgumentException("Invalid Failed Writes Cleaning 
Policy " + config.getFailedWritesCleanPolicy());
 }
   }
 
+  @VisibleForTesting
+  public List 
getInstantsToRollbackForLazyCleanPolicy(HoodieTableMetaClient metaClient,
+  
Stream inflightInstantsStream) {
+// Get expired instants, must store them into list before double-checking
+List expiredInstants = 
inflightInstantsStream.filter(instant -> {
+  try {
+// An instant transformed from inflight to completed have no heartbeat 
file and will be detected as expired instant here
+return heartbeatClient.isHeartbeatExpired(instant.getTimestamp());
+  } catch (IOException io) {
+throw new HoodieException("Failed to

[GitHub] [hudi] nsivabalan commented on pull request #7826: [HUDI-5675] fix lazy clean schedule rollback on completed instant

2023-04-30 Thread via GitHub



nsivabalan commented on PR #7826:
URL: https://github.com/apache/hudi/pull/7826#issuecomment-1529103489

   sorry, dropped the ball. reviewing it again. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-3694) Not use magic number of next block to determine current log block

2023-04-30 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718089#comment-17718089
 ] 

sivabalan narayanan commented on HUDI-3694:
---

This might potentially be an issue only for hdfs like storage schemes where 
appends are feasible. 

> Not use magic number of next block to determine current log block
> -
>
> Key: HUDI-3694
> URL: https://issues.apache.org/jira/browse/HUDI-3694
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: ZiyueGuan
>Priority: Major
>
> HoodieLogFileReader use magic number of next log block to determine if 
> current log block is corrupted. However, when next block has a corrupted 
> magic number, we will abandon current block, which leads to data loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-2751) To avoid the duplicates for streaming read MOR table

2023-04-30 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718088#comment-17718088
 ] 

sivabalan narayanan commented on HUDI-2751:
---

we do preserve commit metadata for compaction. So, no records from the new base 
parquet file created from compaction will be served with incremental read. In 
other words, same record may not be served more than once in incremental read 
even w/ compaction. 

 

> To avoid the duplicates for streaming read MOR table
> 
>
> Key: HUDI-2751
> URL: https://issues.apache.org/jira/browse/HUDI-2751
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Danny Chen
>Assignee: sivabalan narayanan
>Priority: Critical
>
> Imagine there are commits on the timeline:
> {noformat}
>  -delta-99 - commit 100(include 99 delta data 
> set) - delta-101 - delta-102 -
>   first read ->| second read ->
>  – range 1 ---| --range 2 
> ---|
> {noformat}
> instant 99, 101, 102 are successful non-compaction delta commits;
> instant 100 is successful compaction instant.
> The first inc read consumes to instant 99 and the second read consumes from 
> instant 100 to instant 102, the second read would consumes the commit files 
> of instant 100 which has already been consumed before.
> The duplicate reading happens when this condition triggers: a compaction 
> instant schedules then completes in *one* consume range.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] samserpoosh commented on issue #8521: [SUPPORT] Deltastreamer not recognizing config `hoodie.deltastreamer.source.kafka.value.deserializer.class` with PostgresDebeziumSource

2023-04-30 Thread via GitHub



samserpoosh commented on issue #8521:
URL: https://github.com/apache/hudi/issues/8521#issuecomment-1529090886

   @sydneyhoran You might have seen this already, but just in case, I stumbled 
upon [this 
comment](https://github.com/apache/hudi/issues/6348#issuecomment-1223742672) 
which mentioned you should **not** provide `schemaprovider-class` when dealing 
with `XXXDebeziumSource`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8604: [HUDI-6151] Rollback previously applied commits to MDT when operations are retried.

2023-04-30 Thread via GitHub



nsivabalan commented on code in PR #8604:
URL: https://github.com/apache/hudi/pull/8604#discussion_r1181266570


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java:
##
@@ -161,27 +161,28 @@ protected void commit(String instantTime, 
Map alreadyCompletedInstant = 
metadataMetaClient.getActiveTimeline().filterCompletedInstants().filter(entry 
-> entry.getTimestamp().equals(instantTime)).lastInstant();
-if (alreadyCompletedInstant.isPresent()) {
-  // this code path refers to a re-attempted commit that got committed 
to metadata table, but failed in datatable.
-  // for eg, lets say compaction c1 on 1st attempt succeeded in 
metadata table and failed before committing to datatable.
-  // when retried again, data table will first rollback pending 
compaction. these will be applied to metadata table, but all changes
-  // are upserts to metadata table and so only a new delta commit will 
be created.
-  // once rollback is complete, compaction will be retried again, 
which will eventually hit this code block where the respective commit is
-  // already part of completed commit. So, we have to manually remove 
the completed instant and proceed.
-  // and it is for the same reason we enabled 
withAllowMultiWriteOnSameInstant for metadata table.
-  HoodieActiveTimeline.deleteInstantFile(metadataMetaClient.getFs(), 
metadataMetaClient.getMetaPath(), alreadyCompletedInstant.get());
-  metadataMetaClient.reloadActiveTimeline();
+LOG.info(String.format("%s completed commit at %s being applied to 
metadata table",
+alreadyCompletedInstant.isPresent() ? "Already" : "Partially", 
instantTime));
+
+// Rollback the previous committed commit
+if (!writeClient.rollback(instantTime)) {

Review Comment:
   Trying to gauge if we really need this. 
   
   I guess in next couple of patches, you are going to add below change:
   - Any rollback in DT will be an actual rollback in MDT as well. 
   
   having said that, lets go through this use-case. 
   
   Compaction Commit C5 is inflight in DT and succeeded in MDT, but crashed in 
DT. 
   so on restart, a rollback is triggered in DT. which when gets into MDT 
territory, will rollback the succeeded commit in MDT. So, it will be 
automatically taken care of. 
   
   After rollback of C5 is completed, C5 will be re-attempted in DT. and when 
it gets into MDT territory, there won't be any traces of DC5 at all. So, 
wondering when exactly we will hit this case? 
   



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java:
##
@@ -161,27 +161,28 @@ protected void commit(String instantTime, 
Map alreadyCompletedInstant = 
metadataMetaClient.getActiveTimeline().filterCompletedInstants().filter(entry 
-> entry.getTimestamp().equals(instantTime)).lastInstant();
-if (alreadyCompletedInstant.isPresent()) {
-  // this code path refers to a re-attempted commit that got committed 
to metadata table, but failed in datatable.
-  // for eg, lets say compaction c1 on 1st attempt succeeded in 
metadata table and failed before committing to datatable.
-  // when retried again, data table will first rollback pending 
compaction. these will be applied to metadata table, but all changes
-  // are upserts to metadata table and so only a new delta commit will 
be created.
-  // once rollback is complete, compaction will be retried again, 
which will eventually hit this code block where the respective commit is
-  // already part of completed commit. So, we have to manually remove 
the completed instant and proceed.
-  // and it is for the same reason we enabled 
withAllowMultiWriteOnSameInstant for metadata table.
-  HoodieActiveTimeline.deleteInstantFile(metadataMetaClient.getFs(), 
metadataMetaClient.getMetaPath(), alreadyCompletedInstant.get());
-  metadataMetaClient.reloadActiveTimeline();
+LOG.info(String.format("%s completed commit at %s being applied to 
metadata table",
+alreadyCompletedInstant.isPresent() ? "Already" : "Partially", 
instantTime));
+
+// Rollback the previous committed commit
+if (!writeClient.rollback(instantTime)) {

Review Comment:
   if we are talking about a partially failed commit in MDT:
   
   Compaction Commit C5 is inflight in DT and DC5 in MDT is also partitally 
committed and crashed. 
   On restart, any new operation in DT when it gets into MDT territory, on 
deducting a partial commit in MDT, a rollback will be triggered eagerly. Ref: 
https://github.com/apache/hudi/blob/04e54a6187d3aa4f0f05ff4f9ff4c1283a70208c/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java#L151
   
   So, this case is also taken care of. 
   



-- 
This

[hudi] branch master updated (1fa9e37df92 -> 04e54a6187d)

2023-04-30 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 1fa9e37df92 [HUDI-6031] fix bug: checkpoint lost after changing cow to 
mor (#8378)
 add 04e54a6187d Revert "[MINOR] Enable Azure CI to publish test results 
(#7943)" (#8612)

No new revisions were added by this update.

Summary of changes:
 azure-pipelines.yml | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

[GitHub] [hudi] xushiyan merged pull request #8612: Revert "[MINOR] enable publish test results"

2023-04-30 Thread via GitHub



xushiyan merged PR #8612:
URL: https://github.com/apache/hudi/pull/8612


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8493: [HUDI-6098] Use bulk insert prepped for the initial write into MDT.

2023-04-30 Thread via GitHub



nsivabalan commented on code in PR #8493:
URL: https://github.com/apache/hudi/pull/8493#discussion_r1181264755


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -1004,6 +1004,57 @@ public static int mapRecordKeyToFileGroupIndex(String 
recordKey, int numFileGrou
 return Math.abs(Math.abs(h) % numFileGroups);
   }
 
+  /**
+   * Return the complete fileID for a file group within a MDT partition.
+   *
+   * MDT fileGroups have the format -. The fileIDPrefix 
is hardcoded for each MDT partition and index is an integer.
+   *
+   * @param partitionType The type of the MDT partition
+   * @param index Index of the file group within the partition
+   * @return The fileID
+   */
+  public static String getFileIDForFileGroup(MetadataPartitionType 
metadataPartition, int index) {

Review Comment:
   do we have UTs for these.



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieMetadataBulkInsertPartitioner.java:
##
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+
+import scala.Tuple2;
+
+/**
+ * A {@code BulkInsertPartitioner} implementation for Metadata Table to 
improve performance of initialization of metadata
+ * table partition when a very large number of records are inserted.
+ *
+ * This partitioner requires the records to be already tagged with location.
+ */
+public class SparkHoodieMetadataBulkInsertPartitioner implements 
BulkInsertPartitioner> {
+  private class FileGroupPartitioner extends Partitioner {
+
+@Override
+public int getPartition(Object key) {
+  return ((Tuple2)key)._1;
+}
+
+@Override
+public int numPartitions() {
+  // Max number of file groups supported per partition in MDT. Refer to 
HoodieTableMetadataUtil.getFileIDForFileGroup()
+  return 1;
+}
+  }
+
+  // FileIDs for the various partitions
+  private List fileIDPfxs;
+
+  /**
+   * Partition the records by their location. The number of partitions is 
determined by the number of MDT fileGroups being udpated rather than the
+   * specific value of outputSparkPartitions.
+   */
+  @Override
+  public JavaRDD repartitionRecords(JavaRDD 
records, int outputSparkPartitions) {
+Comparator> keyComparator = 
(Comparator> & Serializable)(t1, t2) -> {
+  return t1._2.compareTo(t2._2);
+};
+
+// Partition the records by their file group
+JavaRDD partitionedRDD = records
+// key by . The file group index is used 
to partition and the record key is used to sort within the partition.

Review Comment:
   from what I glean, partitioning is based on fileGroupIndex disregarding the 
MDT partition. So, tell me something. 
   if we have 2 file groups in col stats and 2 file groups for RLI, does 1st 
file group for both col stats and RLI belong to same partition in this 
repartition call ? 
   
   should the partitioning be based on fileId itself and sorting within that 
can be based on the  record keys within each partition. or am I missing 
something ?



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieMetadataBulkInsertPartitioner.java:
##
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is

[GitHub] [hudi] xushiyan opened a new pull request, #8612: Revert "[MINOR] enable publish test results"

2023-04-30 Thread via GitHub



xushiyan opened a new pull request, #8612:
URL: https://github.com/apache/hudi/pull/8612

   Reverts apache/hudi#7943
   
   as this is actually costing money


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch revert-7943-publish_test_results created (now c7f96b31612)

2023-04-30 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a change to branch revert-7943-publish_test_results
in repository https://gitbox.apache.org/repos/asf/hudi.git


  at c7f96b31612 Revert "[MINOR] Enable Azure CI to publish test results 
(#7943)"

This branch includes the following new commits:

 new c7f96b31612 Revert "[MINOR] Enable Azure CI to publish test results 
(#7943)"

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.

[hudi] 01/01: Revert "[MINOR] Enable Azure CI to publish test results (#7943)"

2023-04-30 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch revert-7943-publish_test_results
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit c7f96b316128ff7647e8e296fde1ee90689053de
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Sun Apr 30 10:13:13 2023 -0700

Revert "[MINOR] Enable Azure CI to publish test results (#7943)"

This reverts commit af61dea6f9862538bb869839efbde08cfb838824.
---
 azure-pipelines.yml | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/azure-pipelines.yml b/azure-pipelines.yml
index 4933976dce1..c6d5aee372c 100644
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@@ -120,7 +120,7 @@ stages:
   mavenPomFile: 'pom.xml'
   goals: 'test'
   options: $(MVN_OPTS_TEST) -Punit-tests -pl 
$(JOB1_MODULES),hudi-client/hudi-spark-client
-  publishJUnitResults: true
+  publishJUnitResults: false
   jdkVersionOption: '1.8'
   mavenOptions: '-Xmx4g'
   - task: Maven@4
@@ -129,7 +129,7 @@ stages:
   mavenPomFile: 'pom.xml'
   goals: 'test'
   options: $(MVN_OPTS_TEST) -Pfunctional-tests -pl $(JOB1_MODULES)
-  publishJUnitResults: true
+  publishJUnitResults: false
   jdkVersionOption: '1.8'
   mavenOptions: '-Xmx4g'
   - script: |
@@ -153,7 +153,7 @@ stages:
   mavenPomFile: 'pom.xml'
   goals: 'test'
   options: $(MVN_OPTS_TEST) -Pfunctional-tests -pl $(JOB2_MODULES)
-  publishJUnitResults: true
+  publishJUnitResults: false
   jdkVersionOption: '1.8'
   mavenOptions: '-Xmx4g'
   - script: |
@@ -177,7 +177,7 @@ stages:
   mavenPomFile: 'pom.xml'
   goals: 'test'
   options: $(MVN_OPTS_TEST) -Punit-tests -pl $(JOB3_MODULES)
-  publishJUnitResults: true
+  publishJUnitResults: false
   jdkVersionOption: '1.8'
   mavenOptions: '-Xmx4g'
   - script: |
@@ -201,7 +201,7 @@ stages:
   mavenPomFile: 'pom.xml'
   goals: 'test'
   options: $(MVN_OPTS_TEST) -Punit-tests -pl $(JOB4_UT_MODULES)
-  publishJUnitResults: true
+  publishJUnitResults: false
   jdkVersionOption: '1.8'
   mavenOptions: '-Xmx4g'
   - task: Maven@4
@@ -210,7 +210,7 @@ stages:
   mavenPomFile: 'pom.xml'
   goals: 'test'
   options: $(MVN_OPTS_TEST) -Pfunctional-tests -pl 
$(JOB4_FT_MODULES)
-  publishJUnitResults: true
+  publishJUnitResults: false
   jdkVersionOption: '1.8'
   mavenOptions: '-Xmx4g'
   - script: |
@@ -234,7 +234,7 @@ stages:
   mavenPomFile: 'pom.xml'
   goals: 'test'
   options: $(MVN_OPTS_TEST) -Pintegration-tests -DskipUTs=false 
-DskipITs=true -pl hudi-integ-test
-  publishJUnitResults: true
+  publishJUnitResults: false
   jdkVersionOption: '1.8'
   mavenOptions: '-Xmx4g'
   - task: AzureCLI@2

[GitHub] [hudi] ad1happy2go commented on issue #6881: Processing time is increased with hudi metadata enable

2023-04-30 Thread via GitHub



ad1happy2go commented on issue #6881:
URL: https://github.com/apache/hudi/issues/6881#issuecomment-1529064535

   @koochiswathiTR Can you test out with latest version of hudi. We have done 
lots of improvement related to Metadata server. Let us know if you still see 
the error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #6750: [SUPPORT] SqlQueryBasedTransformer causes memory issues

2023-04-30 Thread via GitHub



ad1happy2go commented on issue #6750:
URL: https://github.com/apache/hudi/issues/6750#issuecomment-1529062128

   @tzhang-fetch Couldn't reproduce this issue as SQL transformer is working 
fine.
   
   Are you saying with same executor and driver memory , hudi job got killed 
when using sql transformer. 
   Are you still Can you let us know the memory configs used?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #6596: [SUPPORT] with Impala 4.0 Records lost

2023-04-30 Thread via GitHub



ad1happy2go commented on issue #6596:
URL: https://github.com/apache/hudi/issues/6596#issuecomment-1529061441

   @zhengyuan-cn Are you still facing this issue with latest Hudi version. Can 
you test out with any version after 0.12 and let us know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #5687: [SUPPORT]hudi sql parser ignores all exceptions of spark sql parser

2023-04-30 Thread via GitHub



ad1happy2go commented on issue #5687:
URL: https://github.com/apache/hudi/issues/5687#issuecomment-1529058559

   @melin Closing this issue as couldn't reproduce it.
   
   In both the cases (spark-sql and spark-shell), only spark exception 
informations is been thrown. Please reopen if the issue still exists.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8527: [HUDI-6117] Parallelize the initial creation of file groups for a new MDT partition.

2023-04-30 Thread via GitHub



nsivabalan commented on code in PR #8527:
URL: https://github.com/apache/hudi/pull/8527#discussion_r1181247986


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -731,21 +733,40 @@ public void 
initializeMetadataPartitions(HoodieTableMetaClient dataMetaClient, L
*/
   private void initializeFileGroups(HoodieTableMetaClient dataMetaClient, 
MetadataPartitionType metadataPartition, String instantTime,
 int fileGroupCount) throws IOException {
-final HashMap blockHeader = new HashMap<>();
-blockHeader.put(HeaderMetadataType.INSTANT_TIME, instantTime);
+// Remove all existing file groups or leftover files in the partition
+final Path partitionPath = new Path(metadataWriteConfig.getBasePath(), 
metadataPartition.getPartitionPath());
+FileSystem fs = metadataMetaClient.getFs();
+try {
+  final FileStatus[] existingFiles = fs.listStatus(partitionPath);
+  if (existingFiles.length > 0) {
+LOG.warn("Deleting all existing files found in MDT partition " + 
metadataPartition.getPartitionPath());
+fs.delete(partitionPath, true);
+ValidationUtils.checkState(!fs.exists(partitionPath), "Failed to 
delete MDT partition " + metadataPartition);
+  }
+} catch (FileNotFoundException e) {

Review Comment:
   if some deletion fails, we will throw all the way right and fail the 
ingestion thread? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8596: [BUG-FIX] use try with resource to close stream

2023-04-30 Thread via GitHub



hudi-bot commented on PR #8596:
URL: https://github.com/apache/hudi/pull/8596#issuecomment-1528982268

   
   ## CI report:
   
   * 0c8c7d99fc250191a7eba156052f01371e431a30 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16768)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8611: [HUDI-6157] Fix potential data loss for flink streaming source from table with multi writer

2023-04-30 Thread via GitHub



hudi-bot commented on PR #8611:
URL: https://github.com/apache/hudi/pull/8611#issuecomment-1528970694

   
   ## CI report:
   
   * b184b111c6928408d082ce73486f5bd3ae7c6683 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16767)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-6031] fix bug: checkpoint lost after changing cow to mor (#8378)

2023-04-30 Thread vbalaji

This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 1fa9e37df92 [HUDI-6031] fix bug: checkpoint lost after changing cow to 
mor (#8378)
1fa9e37df92 is described below

commit 1fa9e37df92c7a38c3909ef1e71f21bad03b3d84
Author: kongwei 
AuthorDate: Sun Apr 30 15:05:37 2023 +0800

[HUDI-6031] fix bug: checkpoint lost after changing cow to mor (#8378)

Co-authored-by: wei.kong 
---
 .../testsuite/HoodieDeltaStreamerWrapper.java  |  2 +-
 .../hudi/utilities/deltastreamer/DeltaSync.java| 48 ++-
 .../deltastreamer/TestHoodieDeltaStreamer.java | 54 ++
 3 files changed, 82 insertions(+), 22 deletions(-)

diff --git 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java
 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java
index 5ff91fa5209..632bbecf10d 100644
--- 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java
+++ 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java
@@ -78,7 +78,7 @@ public class HoodieDeltaStreamerWrapper extends 
HoodieDeltaStreamer {
   public Pair>> 
fetchSource() throws Exception {
 DeltaSync service = getDeltaSync();
 service.refreshTimeline();
-return service.readFromSource(service.getCommitTimelineOpt());
+return service.readFromSource(service.getCommitsTimelineOpt());
   }
 
   public DeltaSync getDeltaSync() {
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
index e2efe1a4e35..c59510f3676 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
@@ -227,11 +227,11 @@ public class DeltaSync implements Serializable, Closeable 
{
   private transient Function 
onInitializingHoodieWriteClient;
 
   /**
-   * Timeline with completed commits.
+   * Timeline with completed commits, including both .commit and .deltacommit.
*/
-  private transient Option commitTimelineOpt;
+  private transient Option commitsTimelineOpt;
 
-  // all commits timeline
+  // all commits timeline, including all (commits, delta commits, compaction, 
clean, savepoint, rollback, replace commits, index)
   private transient Option allCommitsTimelineOpt;
 
   /**
@@ -320,11 +320,9 @@ public class DeltaSync implements Serializable, Closeable {
 .build();
 switch (meta.getTableType()) {
   case COPY_ON_WRITE:
-this.commitTimelineOpt = 
Option.of(meta.getActiveTimeline().getCommitTimeline().filterCompletedInstants());
-this.allCommitsTimelineOpt = 
Option.of(meta.getActiveTimeline().getAllCommitsTimeline());
-break;
   case MERGE_ON_READ:
-this.commitTimelineOpt = 
Option.of(meta.getActiveTimeline().getDeltaCommitTimeline().filterCompletedInstants());
+// we can use getCommitsTimeline for both COW and MOR here, 
because for COW there is no deltacommit
+this.commitsTimelineOpt = 
Option.of(meta.getActiveTimeline().getCommitsTimeline().filterCompletedInstants());
 this.allCommitsTimelineOpt = 
Option.of(meta.getActiveTimeline().getAllCommitsTimeline());
 break;
   default:
@@ -361,7 +359,7 @@ public class DeltaSync implements Serializable, Closeable {
   }
 
   private void initializeEmptyTable() throws IOException {
-this.commitTimelineOpt = Option.empty();
+this.commitsTimelineOpt = Option.empty();
 this.allCommitsTimelineOpt = Option.empty();
 String partitionColumns = 
SparkKeyGenUtils.getPartitionColumns(keyGenerator, props);
 HoodieTableMetaClient.withPropertyBuilder()
@@ -398,7 +396,7 @@ public class DeltaSync implements Serializable, Closeable {
 // Refresh Timeline
 refreshTimeline();
 
-Pair>> 
srcRecordsWithCkpt = readFromSource(commitTimelineOpt);
+Pair>> 
srcRecordsWithCkpt = readFromSource(commitsTimelineOpt);
 
 if (null != srcRecordsWithCkpt) {
   // this is the first input batch. If schemaProvider not set, use it and 
register Avro Schema and start
@@ -448,16 +446,16 @@ public class DeltaSync implements Serializable, Closeable 
{
   /**
* Read from Upstream Source and apply transformation if needed.
*
-   * @param commitTimelineOpt Timeline with completed commits
+   * @param commitsTimelineOpt Timeline with completed commits, including 
.commit and .deltacommit
* @return Pair>> Input 
data read from upstream source, consists
* of schemaProvider, checkpointStr and hoodieRecord
* @throws Exception in

[GitHub] [hudi] bvaradar merged pull request #8378: [HUDI-6031] fix bug: checkpoint lost after changing cow to mor

2023-04-30 Thread via GitHub



bvaradar merged PR #8378:
URL: https://github.com/apache/hudi/pull/8378


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #7363: [SUPPORT] how to get hudi table schema and get table list under the same database

[GitHub] [hudi] ad1happy2go commented on issue #7249: [SUPPORT] How to run cleaner table service on DFS source of DeltaStreamer ?

[GitHub] [hudi] ad1happy2go commented on issue #7242: [SUPPORT] Partition field value lost in table column

[GitHub] [hudi] hudi-bot commented on pull request #8611: [HUDI-6157] Fix potential data loss for flink streaming source from table with multi writer

[GitHub] [hudi] hudi-bot commented on pull request #8611: [HUDI-6157] Fix potential data loss for flink streaming source from table with multi writer

[GitHub] [hudi] ad1happy2go commented on issue #7141: [SUPPORT] Question on Bootstrapped hudi table

[GitHub] [hudi] hudi-bot commented on pull request #8613: [HUDI-6158] Strengthen Flink clustering commit and rollback strategy

[GitHub] [hudi] hudi-bot commented on pull request #8613: [HUDI-6158] Strengthen Flink clustering commit and rollback strategy

[jira] [Commented] (HUDI-2751) To avoid the duplicates for streaming read MOR table

[jira] [Closed] (HUDI-2751) To avoid the duplicates for streaming read MOR table

[jira] [Updated] (HUDI-6158) Strengthen Flink clustering commit and rollback strategy

[GitHub] [hudi] SteNicholas opened a new pull request, #8613: [HUDI-6158] Strengthen Flink clustering commit and rollback strategy

[jira] [Created] (HUDI-6158) Strengthen Flink clustering commit and rollback strategy

[GitHub] [hudi] nsivabalan commented on a diff in pull request #7826: [HUDI-5675] fix lazy clean schedule rollback on completed instant

[GitHub] [hudi] nsivabalan commented on pull request #7826: [HUDI-5675] fix lazy clean schedule rollback on completed instant

[jira] [Commented] (HUDI-3694) Not use magic number of next block to determine current log block

[jira] [Commented] (HUDI-2751) To avoid the duplicates for streaming read MOR table

[GitHub] [hudi] samserpoosh commented on issue #8521: [SUPPORT] Deltastreamer not recognizing config `hoodie.deltastreamer.source.kafka.value.deserializer.class` with PostgresDebeziumSource

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8604: [HUDI-6151] Rollback previously applied commits to MDT when operations are retried.

[hudi] branch master updated (1fa9e37df92 -> 04e54a6187d)

[GitHub] [hudi] xushiyan merged pull request #8612: Revert "[MINOR] enable publish test results"

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8493: [HUDI-6098] Use bulk insert prepped for the initial write into MDT.

[GitHub] [hudi] xushiyan opened a new pull request, #8612: Revert "[MINOR] enable publish test results"

[hudi] branch revert-7943-publish_test_results created (now c7f96b31612)

[hudi] 01/01: Revert "[MINOR] Enable Azure CI to publish test results (#7943)"

[GitHub] [hudi] ad1happy2go commented on issue #6881: Processing time is increased with hudi metadata enable

[GitHub] [hudi] ad1happy2go commented on issue #6750: [SUPPORT] SqlQueryBasedTransformer causes memory issues

[GitHub] [hudi] ad1happy2go commented on issue #6596: [SUPPORT] with Impala 4.0 Records lost

[GitHub] [hudi] ad1happy2go commented on issue #5687: [SUPPORT]hudi sql parser ignores all exceptions of spark sql parser

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8527: [HUDI-6117] Parallelize the initial creation of file groups for a new MDT partition.

[GitHub] [hudi] hudi-bot commented on pull request #8596: [BUG-FIX] use try with resource to close stream

[GitHub] [hudi] hudi-bot commented on pull request #8611: [HUDI-6157] Fix potential data loss for flink streaming source from table with multi writer

[hudi] branch master updated: [HUDI-6031] fix bug: checkpoint lost after changing cow to mor (#8378)

[GitHub] [hudi] bvaradar merged pull request #8378: [HUDI-6031] fix bug: checkpoint lost after changing cow to mor

34 matches

Site Navigation

Mail list logo

Footer information