[ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-15171:
---------------------------
    Summary: Avoid counting duplicate kv and generating lots of small hfiles in 
PutSortReducer  (was: Avoid counting duplicated kv and generating lots of small 
hfiles in PutSortReducer)

> Avoid counting duplicate kv and generating lots of small hfiles in 
> PutSortReducer
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-15171
>                 URL: https://issues.apache.org/jira/browse/HBASE-15171
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.0.0, 1.1.2, 0.98.17
>            Reporter: Yu Li
>            Assignee: Yu Li
>             Fix For: 2.0.0, 1.3.0
>
>         Attachments: HBASE-15171.patch, HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs 
> during bulkload, and we found it generated lots of small hfiles and slows 
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried 
> to handle the pathological case by setting a threshold for single-row size 
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude 
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List<Cell> cells: p.getFamilyCellMap().values()) {
>     for (Cell cell: cells) {
>       KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>       map.add(kv);
>       curSize += kv.heapSize();
>     }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for 
> loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to