[
https://issues.apache.org/jira/browse/HUDI-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
leesf resolved HUDI-574.
------------------------
Fix Version/s: (was: 0.6.0)
0.5.2
Resolution: Fixed
Fixed via master: 01c868ab86e33161b03d422c03af41f01947ea06
> CLI counts small file inserts as updates
> ----------------------------------------
>
> Key: HUDI-574
> URL: https://issues.apache.org/jira/browse/HUDI-574
> Project: Apache Hudi (incubating)
> Issue Type: Bug
> Components: CLI
> Reporter: Vinoth Chandar
> Assignee: lamber-ken
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.5.2
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> User report :
>
> I'm trying to understand the {{.commit}} output and how it relates to the
> output from the {{hudi-cli}} tool and i'm finding it difficult to reconcile
> my findings. specifically, i want to know the number of
> updates/inserts/deletes across all partitions for a given commit (an upsert).
> From the {{cli}}:
> hudi:exec_unit_ver->commit showpartitions --commit 20200108153617
> ╔════════════════╤═══════════════════╤═════════════════════╤════════════════════════╤═══════════════════════╤═════════════════════╤══════════════╗
> ║ Partition Path │ Total Files Added │ Total Files Updated │ Total Records
> Inserted │ Total Records Updated │ Total Bytes Written │ Total Errors ║
> ╠════════════════╪═══════════════════╪═════════════════════╪════════════════════════╪═══════════════════════╪═════════════════════╪══════════════╣
> ║ 0 │ 0 │ 9 │ 0
> │ 2091 │ 983.7 MB │ 0 ║
> ╟────────────────┼───────────────────┼─────────────────────┼────────────────────────┼───────────────────────┼─────────────────────┼──────────────╢
> But in the {{20200108153617.commit}} file for that commit one of the files in
> the partition "0" has
> "numInserts" : 44448,
> so not sure why {{Total Records Inserted}} is reported as zero. I checked
> that the sum of {{numUpdateWrites}} across all files in the partition matches
> 2091. Generally, i think it would be helpful to have {{totalRecordsInserted}}
> {{totalRecordsUpdated}} {{totalRecordsDeleted}} in the commit metadata
> (although it's not a big issue to sum the individual numbers from each file
> in each partition).
>
> [~vinoth]
>
> On the counts, when I checked the code, its counting the inserts as updats,
> since Hudi packed them onto existing files, to honor target file size ..
> for (HoodieWriteStat stat : stats) {
> if (stat.getPrevCommit().equals(HoodieWriteStat.NULL_COMMIT)) {
> totalFilesAdded += 1;
> totalRecordsInserted += stat.getNumWrites();
> } else {
> totalFilesUpdated += 1;
> totalRecordsUpdated += stat.getNumUpdateWrites();
> }
> totalBytesWritten += stat.getTotalWriteBytes();
> totalWriteErrors += stat.getTotalWriteErrors();
> }
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)