[ https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15120292#comment-15120292 ]
Hudson commented on HBASE-15171: -------------------------------- FAILURE: Integrated in HBase-1.3 #517 (See [https://builds.apache.org/job/HBase-1.3/517/]) HBASE-15171 Avoid counting duplicate kv and generating lots of small (tedyu: rev 630ad95c923f642d006274b9b1a14397a6713412) * hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java > Avoid counting duplicate kv and generating lots of small hfiles in > PutSortReducer > --------------------------------------------------------------------------------- > > Key: HBASE-15171 > URL: https://issues.apache.org/jira/browse/HBASE-15171 > Project: HBase > Issue Type: Sub-task > Affects Versions: 2.0.0, 1.1.2, 0.98.17 > Reporter: Yu Li > Assignee: Yu Li > Fix For: 2.0.0, 1.3.0 > > Attachments: HBASE-15171.patch, HBASE-15171.patch, HBASE-15171.patch > > > Once there was one of our online user writing huge number of duplicated kvs > during bulkload, and we found it generated lots of small hfiles and slows > down the whole process. > After debugging, we found in PutSortReducer#reduce, although it already tried > to handle the pathological case by setting a threshold for single-row size > and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude > duplicated kv from the accumulated size. As shown in below code segment: > {code} > while (iter.hasNext() && curSize < threshold) { > Put p = iter.next(); > for (List<Cell> cells: p.getFamilyCellMap().values()) { > for (Cell cell: cells) { > KeyValue kv = KeyValueUtil.ensureKeyValue(cell); > map.add(kv); > curSize += kv.heapSize(); > } > } > } > {code} > We should move the {{curSize += kv.heapSize();}} line out of the outer for > loop -- This message was sent by Atlassian JIRA (v6.3.4#6332)