cdmikechen commented on a change in pull request #1119: Fix:
HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361044029
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
##########
@@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
long totalInsertRecordsWritten = 0;
for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
for (HoodieWriteStat stat : stats) {
- if (stat.getPrevCommit() != null &&
stat.getPrevCommit().equalsIgnoreCase("null")) {
Review comment:
@n3nash
I think the processing logic of this method is the same as that of
`fetchTotalFilesUpdated`
```
public long fetchTotalUpdateRecordsWritten() {
long totalUpdateRecordsWritten = 0;
for (List<HoodieWriteStat> stats : partitionToWriteStats.values()) {
for (HoodieWriteStat stat : stats) {
totalUpdateRecordsWritten += stat.getNumUpdateWrites();
}
}
return totalUpdateRecordsWritten;
}
```
Let me give you an example in my test. I create a hudi data with 3 rows
first, and then insert a row, at last insert a row and delete a row. I print
all method return number and test the number to check.
```java
for (HoodieInstant commit : commits) {
HoodieCommitMetadata commitMetadata =
HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(commit).get(),
HoodieCommitMetadata.class);
JSONObject data = new JSONObject();
data.put("timestamp", commit.getTimestamp());
data.put("fetchTotalBytesWritten",
commitMetadata.fetchTotalBytesWritten());
data.put("fetchTotalFilesInsert",
commitMetadata.fetchTotalFilesInsert());
data.put("fetchTotalFilesUpdated",
commitMetadata.fetchTotalFilesUpdated());
data.put("fetchTotalPartitionsWritten",
commitMetadata.fetchTotalPartitionsWritten());
data.put("fetchTotalRecordsWritten",
commitMetadata.fetchTotalRecordsWritten());
data.put("fetchTotalUpdateRecordsWritten",
commitMetadata.fetchTotalUpdateRecordsWritten());
data.put("fetchTotalInsertRecordsWritten",
commitMetadata.fetchTotalInsertRecordsWritten());
data.put("fetchTotalWriteErrors",
commitMetadata.fetchTotalWriteErrors());
long totalDeleteRecordsWritten= 0;
for (List<HoodieWriteStat> stats :
commitMetadata.getPartitionToWriteStats().values()) {
for (HoodieWriteStat stat : stats) {
if (stat.getPrevCommit() != null) {
totalDeleteRecordsWritten += stat.getNumDeletes();
}
}
}
data.put("fetchTotalDeleteRecordsWritten", totalDeleteRecordsWritten);
datas.add(data);
}
log.debug("datas = {}", datas);
```
If I use old if `(stat.getPrevCommit() != null &&
stat.getPrevCommit().equalsIgnoreCase("null"))` it will print:
```json
[
{
"fetchTotalBytesWritten": 435258,
"fetchTotalDeleteRecordsWritten": 1,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 0,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923172508"
},
{
"fetchTotalBytesWritten": 435227,
"fetchTotalDeleteRecordsWritten": 0,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 0,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923160439"
},
{
"fetchTotalBytesWritten": 435478,
"fetchTotalDeleteRecordsWritten": 0,
"fetchTotalFilesInsert": 1,
"fetchTotalFilesUpdated": 0,
"fetchTotalInsertRecordsWritten": 3,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 3,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923160340"
}
]
```
If I use modified if `(stat.getPrevCommit() != null)` it will print:
```json
[
{
"fetchTotalBytesWritten": 435258,
"fetchTotalDeleteRecordsWritten": 1,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 1,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923172508"
},
{
"fetchTotalBytesWritten": 435227,
"fetchTotalDeleteRecordsWritten": 0,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 1,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923160439"
},
{
"fetchTotalBytesWritten": 435478,
"fetchTotalDeleteRecordsWritten": 0,
"fetchTotalFilesInsert": 1,
"fetchTotalFilesUpdated": 0,
"fetchTotalInsertRecordsWritten": 3,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 3,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923160340"
}
]
```
I guess the person who originally wrote this method copied the
`fetchTotalFilesInsert` method at that time, but he forgot that the processing
logic of adding files is different from that adding data in files.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services