[ https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141224#comment-15141224 ]
Hudson commented on HBASE-15171: -------------------------------- FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #1169 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/1169/]) HBASE-15171 Avoid counting duplicate kv and generating lots of small (apurtell: rev 38cd179bb540f0d38c5810a17097c5727947ca73) * hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java HBASE-15171 Addendum removes extra loop (Yu Li) (apurtell: rev de149d0bc4eda960e7246c79a1ad85c9cbe50de0) * hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java > Avoid counting duplicate kv and generating lots of small hfiles in > PutSortReducer > --------------------------------------------------------------------------------- > > Key: HBASE-15171 > URL: https://issues.apache.org/jira/browse/HBASE-15171 > Project: HBase > Issue Type: Sub-task > Affects Versions: 2.0.0, 1.1.2, 0.98.17 > Reporter: Yu Li > Assignee: Yu Li > Fix For: 2.0.0, 1.3.0, 0.98.18 > > Attachments: HBASE-15171.addendum.patch, HBASE-15171.patch, > HBASE-15171.patch, HBASE-15171.patch > > > Once there was one of our online user writing huge number of duplicated kvs > during bulkload, and we found it generated lots of small hfiles and slows > down the whole process. > After debugging, we found in PutSortReducer#reduce, although it already tried > to handle the pathological case by setting a threshold for single-row size > and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude > duplicated kv from the accumulated size. As shown in below code segment: > {code} > while (iter.hasNext() && curSize < threshold) { > Put p = iter.next(); > for (List<Cell> cells: p.getFamilyCellMap().values()) { > for (Cell cell: cells) { > KeyValue kv = KeyValueUtil.ensureKeyValue(cell); > map.add(kv); > curSize += kv.heapSize(); > } > } > } > {code} > We should move the {{curSize += kv.heapSize();}} line out of the outer for > loop -- This message was sent by Atlassian JIRA (v6.3.4#6332)