yihua opened a new pull request, #6646:
URL: https://github.com/apache/hudi/pull/6646
### Change Logs
This PR fixes the serialization of commit metadata to remove redundant
fields in serialized commit metadata in JSON.
The commit metadata in JSON (*.commit, *.deltacommit) written to the Hudi
timeline under `.hoodie/` contains redundant fields that can be trimmed. As
shown below, the same set of write stats is written to both
"partitionToWriteStats" and "writeStats", doubling the size and increasing the
serde overhead. Other fields like "totalRecordsDeleted",
"writePartitionPaths", "fileIdAndRelativePaths", etc., can be removed as well
as they are derived from "partitionToWriteStats" and not directly used by
HoodieCommitMetadata class.
The root cause of the problem is that, when serializing the
`HoodieCommitMetadata` and `HoodieReplaceCommitMetadata` class instance to JSON
string, the getters are also included, which introduce necessary fields. The
fix is to exclude getters, setters and creators in the serialization config.
After the fix, the fields are consistent with Avro schema definition, i.e.,
`HoodieCommitMetadata.avsc` and `HoodieReplaceCommitMetadata.avsc`.
Sample commit metadata before this change:
```
{
"partitionToWriteStats" : {
"2022/1/31" : [ {
"fileId" : "0cb6ac8a-ee31-4f00-a359-ba6ebfb80463-0",
"path" :
"2022/1/31/0cb6ac8a-ee31-4f00-a359-ba6ebfb80463-0_0-9-38_20220410134618909.parquet",
"prevCommit" : "20220410134320333",
"numWrites" : 250175,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 50035,
"totalWriteBytes" : 90720802,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "2022/1/31",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 90720802,
"minEventTime" : null,
"maxEventTime" : null
} ],
...
},
"compacted" : false,
"extraMetadata" : {
"schema" :
"{\"type\":\"record\",\"name\":\"hoodie_source\",\"namespace\":\"hoodie.source\",\"fields\":[{\"name\":\"key\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"partition\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"ts\",\"type\":[\"null\",\"long\"],\"default\":null},{\"name\":\"textField\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"decimalField\",\"type\":[\"null\",\"float\"],\"default\":null},{\"name\":\"longField\",\"type\":[\"null\",\"long\"],\"default\":null},{\"name\":\"arrayField\",\"type\":[\"null\",{\"type\":\"array\",\"items\":[\"int\",\"null\"]}],\"default\":null},{\"name\":\"mapField\",\"type\":[\"null\",{\"type\":\"map\",\"values\":[\"int\",\"null\"]}],\"default\":null},{\"name\":\"round\",\"type\":[\"null\",\"int\"],\"default\":null}]}",
"deltastreamer.checkpoint.key" : "17"
},
"operationType" : "INSERT",
"writeStats" : [ {
"fileId" : "0cb6ac8a-ee31-4f00-a359-ba6ebfb80463-0",
"path" :
"2022/1/31/0cb6ac8a-ee31-4f00-a359-ba6ebfb80463-0_0-9-38_20220410134618909.parquet",
"prevCommit" : "20220410134320333",
"numWrites" : 250175,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 50035,
"totalWriteBytes" : 90720802,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "2022/1/31",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 90720802,
"minEventTime" : null,
"maxEventTime" : null
},
...
],
"totalRecordsDeleted" : 0,
"totalLogFilesSize" : 0,
"totalScanTime" : 0,
"totalCreateTime" : 0,
"totalUpsertTime" : 309120,
"minAndMaxEventTime" : {
"Optional.empty" : {
"val" : null,
"present" : false
}
},
"writePartitionPaths" : [ "2022/1/31", "2022/1/30", "2022/1/28",
"2022/1/27", "2022/2/2", "2022/1/29", "2022/1/24", "2022/2/1", "2022/1/26",
"2022/1/25" ],
"fileIdAndRelativePaths" : {
"3e31414c-fb4c-4ce9-aa27-a43640d94430-0" :
"2022/1/25/3e31414c-fb4c-4ce9-aa27-a43640d94430-0_9-9-47_20220410134618909.parquet",
...
},
"totalLogRecordsCompacted" : 0,
"totalLogFilesCompacted" : 0,
"totalCompactedRecordsUpdated" : 0
}
```
Sample commit metadata after this change:
```
{
"partitionToWriteStats" : {
"2022/1/31" : [ {
"fileId" : "0cb6ac8a-ee31-4f00-a359-ba6ebfb80463-0",
"path" :
"2022/1/31/0cb6ac8a-ee31-4f00-a359-ba6ebfb80463-0_0-9-38_20220410134618909.parquet",
"prevCommit" : "20220410134320333",
"numWrites" : 250175,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 50035,
"totalWriteBytes" : 90720802,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "2022/1/31",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 90720802,
"minEventTime" : null,
"maxEventTime" : null
} ],
...
},
"compacted" : false,
"extraMetadata" : {
"schema" :
"{\"type\":\"record\",\"name\":\"hoodie_source\",\"namespace\":\"hoodie.source\",\"fields\":[{\"name\":\"key\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"partition\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"ts\",\"type\":[\"null\",\"long\"],\"default\":null},{\"name\":\"textField\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"decimalField\",\"type\":[\"null\",\"float\"],\"default\":null},{\"name\":\"longField\",\"type\":[\"null\",\"long\"],\"default\":null},{\"name\":\"arrayField\",\"type\":[\"null\",{\"type\":\"array\",\"items\":[\"int\",\"null\"]}],\"default\":null},{\"name\":\"mapField\",\"type\":[\"null\",{\"type\":\"map\",\"values\":[\"int\",\"null\"]}],\"default\":null},{\"name\":\"round\",\"type\":[\"null\",\"int\"],\"default\":null}]}",
"deltastreamer.checkpoint.key" : "17"
},
"operationType" : "INSERT"
}
```
### Impact
**Risk level: low**
This trims the fields in commit metadata. The existing APIs fetching the
commit information still work.
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]