[
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yu Li updated HBASE-15171:
--------------------------
Attachment: HBASE-15171.addendum.patch
The addendum patch to adopt [~ram_krish]'s suggestion, thanks Ram for the note.
Thanks [~tedyu] and [~stack] for review.
> Avoid counting duplicate kv and generating lots of small hfiles in
> PutSortReducer
> ---------------------------------------------------------------------------------
>
> Key: HBASE-15171
> URL: https://issues.apache.org/jira/browse/HBASE-15171
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 2.0.0, 1.1.2, 0.98.17
> Reporter: Yu Li
> Assignee: Yu Li
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-15171.addendum.patch, HBASE-15171.patch,
> HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs
> during bulkload, and we found it generated lots of small hfiles and slows
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried
> to handle the pathological case by setting a threshold for single-row size
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
> Put p = iter.next();
> for (List<Cell> cells: p.getFamilyCellMap().values()) {
> for (Cell cell: cells) {
> KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
> map.add(kv);
> curSize += kv.heapSize();
> }
> }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for
> loop
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)