Yu Li created HBASE-15171:
-----------------------------
Summary: Avoid counting duplicated kv and generating lots of small
hfiles in PutSortReducer
Key: HBASE-15171
URL: https://issues.apache.org/jira/browse/HBASE-15171
Project: HBase
Issue Type: Sub-task
Reporter: Yu Li
Assignee: Yu Li
Once there was one of our online user writing huge number of duplicated kvs
during bulkload, and we found it generated lots of small hfiles and slows down
the whole process.
After debugging, we found in PutSortReducer#reduce, although it already tried
to handle the pathological case by setting a threshold for single-row size and
having a TreeMap to avoid writing out duplicated kv, it forgot to exclude
duplicated kv from the accumulated size. As shown in below code segment:
{code}
while (iter.hasNext() && curSize < threshold) {
Put p = iter.next();
for (List<Cell> cells: p.getFamilyCellMap().values()) {
for (Cell cell: cells) {
KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
map.add(kv);
curSize += kv.heapSize();
}
}
}
{code}
We should move the {{curSize += kv.heapSize();}} line out of the outer for loop
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)