[I] [SUPPORT] Rollback failed clustering 0.12.2 [hudi]

via GitHub Fri, 05 Apr 2024 10:01:08 -0700


VitoMakarevich opened a new issue, #10964:
URL: https://github.com/apache/hudi/issues/10964

**_Tips before filing an issue_**

- Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?

- Join the mailing list to engage in conversations and get faster support at
[email protected].

- If you have triaged this as a bug, then file an
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.

**Describe the problem you faced**

Hello, this is a followup from https://github.com/apache/hudi/issues/10878.
We managed to run clustering, but I'm obsessed with a potential recovery plan.
So what behavior I know - when `.commit.requested` and `.commit.inflight`
created, but not `.commit` - then subsequent write will do a rollback. - this
works for normal commits.
However, if I start clustering - if the job stops before `.inflight` is
created - subsequent write will fail if affects partition present in
`.replacecommit.requested` - controlled by
[hoodie.clustering.updates.strategy](https://hudi.apache.org/docs/configurations/#hoodieclusteringupdatesstrategy).
So here I can only either run clustering from CLI or just delete instant(can
you confirm? per code looks like it's safe if there is no `.inflight`).
But - if it fails after start writing files(after `.replacecommit.inflight`
is created, but before `.replacecommit` is created) - which choices do I have?
As I checked through the code - it looks like there is no automatic rollback
for `replacecommit`, and `hudi-cli` has rollback only for finished instants.
Given this, can you answer 2 questions:
1. If clustering failed after `.replacecommit.requested`, but before
`.replacecommit.inflight` - is it safe to just delete commit file itself?
Recently you added this PR and it looks to be doing exactly this
https://github.com/apache/hudi/pull/10645/files
2. If clustering failed after `.replacecommit.inflight`, but before
`.replacecommit` - what are recovery steps?

**To Reproduce**

Steps to reproduce the behavior:

1.
2.
3.
4.

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version : 0.12.2

* Spark version : 3.3.0

* Hive version :

* Hadoop version :

* Storage (HDFS/S3/GCS..) :

* Running on Docker? (yes/no) :

**Additional context**

Add any other context about the problem here.

**Stacktrace**

```Add the stacktrace of the error.```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Rollback failed clustering 0.12.2 [hudi]

Reply via email to