ssdong opened a new pull request #2784: URL: https://github.com/apache/hudi/pull/2784
## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request ### Summary: This pull request is to fix the archival logic within insert overwrite API against requested & inflight commit files. ### Issue: `0.7.0` throw exception `Caused by: java.lang.IllegalArgumentException: Positive number of partitions required` while `0.9.0-SNAPSHOT`(latest master branch) adds `java.util.NoSuchElementException: No value present in Option` on top of it when hudi tries to archive replace commit files(`COMPLETED`, `REQUESTED` and `INFLIGHT`) Please checkout issue https://github.com/apache/hudi/issues/2707 and ticket https://issues.apache.org/jira/browse/HUDI-1740 for further information about the above 2 exceptions. The inner causes are somewhat sophisticated, and I've tried my best to understand them and applied the fix for it, instead of giving it a one-line tricky fix or patch and see errors gone. Of course, the approaches I am taking are open to discussions. ### Fixes 1. `Positive number of partitions required` error is easier to be fixed where we just have to filter out empty `partitionToReplaceFileIds` for `COMPLETED` replace commit files within `ReplaceArchivalHelper.java`. 2. `java.util.NoSuchElementException: No value present in Option` is much more complicated and it happened due to a call to `ClusteringUtils.getRequestedReplaceMetadata()` against _both_ `REQUESTED` and `INFLIGHT` commit files for retrieving their meta body. Now, I get the idea that we are encouraged to use existing utils classes for code reuse. However, a closer inspection upon `getRequestedReplaceMetadata` shows that clustering feature retrieves the meta for `INFLIGHT` commit file through a `REQUESTED` instant. Although this is _not_ fundamentally wrong since there is no "clustering plan" for both `REQUESTED` and `INFLIGHT` replace commit files so the outcome is the same for both, as it is also being pointed out in the comment within `getRequestedReplaceMetadata`. However, since `REQUESTED` instant is empty(there is a corresponding [ticket](https://issues.apache.org/jira/browse/HUDI-1740) for it), it generates an `Option.empty()` which is being fetched by `.get()` later which t riggered the `NoSuchElementException`. What's more, it _loses_ information in `INFLIGHT` commit file when fetching via `REQUESTED` instant and so we observed in the following screenshot that: <img width="1037" alt="Screen Shot 2021-04-07 at 2 09 07" src="https://user-images.githubusercontent.com/3754011/113918516-7e937a80-981d-11eb-84b6-e2c4bec2c3b1.png"> This does not make sense to me when we pretty much _abuse_ the `REQUESTED` concept to deal with `INFLIGHT` with `REQUESTED` itself being empty. I've taken the approach to define an extra field(placeholder) for `INFLIGHT` and reuse the `HoodieCommitMetadata` for deserializing `INFLIGHT` since they share the same structure where `COMPLETED` replace commit extends `HoodieCommitMetadata` for the extra `partitionToReplaceFileIds` field. Now, here's what I gained after adopting this strategy: <img width="1065" alt="Screen Shot 2021-04-08 at 3 01 51" src="https://user-images.githubusercontent.com/3754011/113919657-cff03980-981e-11eb-928e-65c719d15ca5.png"> ### The overall outcome after the fix on latest master branch: <img width="924" alt="Screen Shot 2021-04-08 at 3 04 29" src="https://user-images.githubusercontent.com/3754011/113919768-ee563500-981e-11eb-80e8-fe62ba868709.png"> Let me know if there is anything I am missing. :) _The simplest solution to the 2nd issue is to actually having `ClusteringUtils.getRequestedReplaceMetadata` return `Option.of(new HoodieRequestedReplaceMetadata())` upon retrieving the empty `REQUESTED` replace commit file for both `REQUESTED` and `INFLIGHT`. I didn't choose to fix the problem this way fearing it's basically altering the inappropriate approach by giving it a bandage and stopping the bleeding._ ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: - *Fix insert overwrite API archival* ## Committer checklist - [x] Has a corresponding JIRA in PR title & commit - [x] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
