[GitHub] [hudi] yihua opened a new pull request, #9711: [HUDI-6858] Fix checkpoint reading in Spark structured streaming

via GitHub Wed, 13 Sep 2023 16:14:15 -0700


yihua opened a new pull request, #9711:
URL: https://github.com/apache/hudi/pull/9711


   ### Change Logs
   
   This PR fixes the checkpoint reading in Hudi streaming sink in Spark 
structured streaming.  The fix avoids reading the requested compaction or 
clustering plan which is serialized in Avro, causing the deserialization to 
fail and only scan the commit metadata of completed commits, which are 
serialized in JSON. Without the fix, the Avro serialized plan throws exceptions 
(see below).  A new test is added to validate the correctness of reading 
checkpoint for Hudi streaming sink in Spark structured streaming.
   
   ```
   org.apache.hudi.exception.HoodieIOException: Failed to parse 
HoodieCommitMetadata for 
[==>20230913003800000__compaction__REQUESTED__20230913155321000]
   
        at 
org.apache.hudi.common.util.CommitUtils.lambda$getValidCheckpointForCurrentWriter$3(CommitUtils.java:180)
        at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
   ...
   Caused by: java.io.IOException: unable to read commit metadata
        at 
org.apache.hudi.common.model.HoodieCommitMetadata.fromBytes(HoodieCommitMetadata.java:514)
        at 
org.apache.hudi.common.util.CommitUtils.lambda$getValidCheckpointForCurrentWriter$3(CommitUtils.java:170)
        ... 77 more
   Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 
'Objavro': was expecting (JSON String, Number, Array, Object or token 
'null', 'true' or 'false')
    at [Source: (String)"Obj\u0001\u0002\u0016avro.schema� 
{"type":"record","name":"HoodieCompactionPlan","namespace":"org.apache.hudi.avro.model","fields":[{"name":"operations","type":["null",{"type":"array","items":{"type":"record","name":"HoodieCompactionOperation","fields":[{"name":"baseInstantTime","type":["null",{"type":"string","avro.java.string":"String"}]},{"name":"deltaFilePaths","type":["null",{"type":"array","items":{"type":"string","avro.java.string":"String"}}],"default":null},{"name":"dataFilePath","type":["null",{"type"[truncated
 1614 chars]; line: 1, column: 11]
        at 
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2391)
        at 
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:745)
   ```
   ```
   org.apache.hudi.exception.HoodieIOException: Failed to parse 
HoodieCommitMetadata for 
[==>20230913004800000__replacecommit__REQUESTED__20230913155245000]
        at 
org.apache.hudi.common.util.CommitUtils.lambda$getValidCheckpointForCurrentWriter$3(CommitUtils.java:180)
        at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
   Caused by: java.io.IOException: unable to read commit metadata
        at 
org.apache.hudi.common.model.HoodieCommitMetadata.fromBytes(HoodieCommitMetadata.java:514)
        at 
org.apache.hudi.common.util.CommitUtils.lambda$getValidCheckpointForCurrentWriter$3(CommitUtils.java:170)
        ... 77 more
   Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 
'Objavro': was expecting (JSON String, Number, Array, Object or token 
'null', 'true' or 'false')
    at [Source: 
(String)"Obj\u0001\u0002\u0016avro.schema�&{"type":"record","name":"HoodieRequestedReplaceMetadata","namespace":"org.apache.hudi.avro.model","fields":[{"name":"operationType","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"clusteringPlan","type":["null",{"type":"record","name":"HoodieClusteringPlan","fields":[{"name":"inputGroups","type":["null",{"type":"array","items":{"type":"record","name":"HoodieClusteringGroup","fields":[{"name":"slices","type":["null",{"type":"array","items""[truncated
 2037 chars]; line: 1, column: 11]
        at 
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2391)
        at 
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:745)
   ```
   
   ### Impact
   
   Bug fix
   
   ### Risk level
   
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yihua opened a new pull request, #9711: [HUDI-6858] Fix checkpoint reading in Spark structured streaming

Reply via email to