[ https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122196#comment-15122196 ]
Hudson commented on HBASE-15171: -------------------------------- FAILURE: Integrated in HBase-Trunk_matrix #665 (See [https://builds.apache.org/job/HBase-Trunk_matrix/665/]) HBASE-15171 Addendum removes extra loop (Yu Li) (tedyu: rev 37ed0f6d0815389e0b368bc98b3a01dd02f193ac) * hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java > Avoid counting duplicate kv and generating lots of small hfiles in > PutSortReducer > --------------------------------------------------------------------------------- > > Key: HBASE-15171 > URL: https://issues.apache.org/jira/browse/HBASE-15171 > Project: HBase > Issue Type: Sub-task > Affects Versions: 2.0.0, 1.1.2, 0.98.17 > Reporter: Yu Li > Assignee: Yu Li > Fix For: 2.0.0, 1.3.0 > > Attachments: HBASE-15171.addendum.patch, HBASE-15171.patch, > HBASE-15171.patch, HBASE-15171.patch > > > Once there was one of our online user writing huge number of duplicated kvs > during bulkload, and we found it generated lots of small hfiles and slows > down the whole process. > After debugging, we found in PutSortReducer#reduce, although it already tried > to handle the pathological case by setting a threshold for single-row size > and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude > duplicated kv from the accumulated size. As shown in below code segment: > {code} > while (iter.hasNext() && curSize < threshold) { > Put p = iter.next(); > for (List<Cell> cells: p.getFamilyCellMap().values()) { > for (Cell cell: cells) { > KeyValue kv = KeyValueUtil.ensureKeyValue(cell); > map.add(kv); > curSize += kv.heapSize(); > } > } > } > {code} > We should move the {{curSize += kv.heapSize();}} line out of the outer for > loop -- This message was sent by Atlassian JIRA (v6.3.4#6332)