[
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119865#comment-15119865
]
ramkrishna.s.vasudevan commented on HBASE-15171:
------------------------------------------------
Instead of iterating again the map, can we just get the return value of
map.add(kv), it it is false don't add the curSize?
add() javadoc says this
{code}
add
public boolean add(E e)
Adds the specified element to this set if it is not already present. More
formally, adds the specified element e to this set if the set contains no
element e2 such that (e==null ? e2==null : e.equals(e2)). If this set already
contains the element, the call leaves the set unchanged and returns false.
{code}
> Avoid counting duplicate kv and generating lots of small hfiles in
> PutSortReducer
> ---------------------------------------------------------------------------------
>
> Key: HBASE-15171
> URL: https://issues.apache.org/jira/browse/HBASE-15171
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 2.0.0, 1.1.2, 0.98.17
> Reporter: Yu Li
> Assignee: Yu Li
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-15171.patch, HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs
> during bulkload, and we found it generated lots of small hfiles and slows
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried
> to handle the pathological case by setting a threshold for single-row size
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
> Put p = iter.next();
> for (List<Cell> cells: p.getFamilyCellMap().values()) {
> for (Cell cell: cells) {
> KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
> map.add(kv);
> curSize += kv.heapSize();
> }
> }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for
> loop
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)