yihua opened a new pull request, #9711:
URL: https://github.com/apache/hudi/pull/9711
### Change Logs
This PR fixes the checkpoint reading in Hudi streaming sink in Spark
structured streaming. The fix avoids reading the requested compaction or
clustering plan which is serialized in Avro, causing the deserialization to
fail and only scan the commit metadata of completed commits, which are
serialized in JSON. Without the fix, the Avro serialized plan throws exceptions
(see below). A new test is added to validate the correctness of reading
checkpoint for Hudi streaming sink in Spark structured streaming.
```
org.apache.hudi.exception.HoodieIOException: Failed to parse
HoodieCommitMetadata for
[==>20230913003800000__compaction__REQUESTED__20230913155321000]
at
org.apache.hudi.common.util.CommitUtils.lambda$getValidCheckpointForCurrentWriter$3(CommitUtils.java:180)
at
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
...
Caused by: java.io.IOException: unable to read commit metadata
at
org.apache.hudi.common.model.HoodieCommitMetadata.fromBytes(HoodieCommitMetadata.java:514)
at
org.apache.hudi.common.util.CommitUtils.lambda$getValidCheckpointForCurrentWriter$3(CommitUtils.java:170)
... 77 more
Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token
'Objavro': was expecting (JSON String, Number, Array, Object or token
'null', 'true' or 'false')
at [Source: (String)"Obj\u0001\u0002\u0016avro.schema�
{"type":"record","name":"HoodieCompactionPlan","namespace":"org.apache.hudi.avro.model","fields":[{"name":"operations","type":["null",{"type":"array","items":{"type":"record","name":"HoodieCompactionOperation","fields":[{"name":"baseInstantTime","type":["null",{"type":"string","avro.java.string":"String"}]},{"name":"deltaFilePaths","type":["null",{"type":"array","items":{"type":"string","avro.java.string":"String"}}],"default":null},{"name":"dataFilePath","type":["null",{"type"[truncated
1614 chars]; line: 1, column: 11]
at
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2391)
at
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:745)
```
```
org.apache.hudi.exception.HoodieIOException: Failed to parse
HoodieCommitMetadata for
[==>20230913004800000__replacecommit__REQUESTED__20230913155245000]
at
org.apache.hudi.common.util.CommitUtils.lambda$getValidCheckpointForCurrentWriter$3(CommitUtils.java:180)
at
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
Caused by: java.io.IOException: unable to read commit metadata
at
org.apache.hudi.common.model.HoodieCommitMetadata.fromBytes(HoodieCommitMetadata.java:514)
at
org.apache.hudi.common.util.CommitUtils.lambda$getValidCheckpointForCurrentWriter$3(CommitUtils.java:170)
... 77 more
Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token
'Objavro': was expecting (JSON String, Number, Array, Object or token
'null', 'true' or 'false')
at [Source:
(String)"Obj\u0001\u0002\u0016avro.schema�&{"type":"record","name":"HoodieRequestedReplaceMetadata","namespace":"org.apache.hudi.avro.model","fields":[{"name":"operationType","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"clusteringPlan","type":["null",{"type":"record","name":"HoodieClusteringPlan","fields":[{"name":"inputGroups","type":["null",{"type":"array","items":{"type":"record","name":"HoodieClusteringGroup","fields":[{"name":"slices","type":["null",{"type":"array","items""[truncated
2037 chars]; line: 1, column: 11]
at
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2391)
at
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:745)
```
### Impact
Bug fix
### Risk level
low
### Documentation Update
N/A
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]