[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

Hudson (JIRA) Thu, 28 Jan 2016 11:50:38 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122196#comment-15122196
 ]


Hudson commented on HBASE-15171:
--------------------------------

FAILURE: Integrated in HBase-Trunk_matrix #665 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/665/])
HBASE-15171 Addendum removes extra loop (Yu Li) (tedyu: rev 
37ed0f6d0815389e0b368bc98b3a01dd02f193ac)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java


> Avoid counting duplicate kv and generating lots of small hfiles in 
> PutSortReducer
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-15171
>                 URL: https://issues.apache.org/jira/browse/HBASE-15171
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.0.0, 1.1.2, 0.98.17
>            Reporter: Yu Li
>            Assignee: Yu Li
>             Fix For: 2.0.0, 1.3.0
>
>         Attachments: HBASE-15171.addendum.patch, HBASE-15171.patch, 
> HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs 
> during bulkload, and we found it generated lots of small hfiles and slows 
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried 
> to handle the pathological case by setting a threshold for single-row size 
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude 
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List<Cell> cells: p.getFamilyCellMap().values()) {
>     for (Cell cell: cells) {
>       KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>       map.add(kv);
>       curSize += kv.heapSize();
>     }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for 
> loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

Reply via email to