[ 
https://issues.apache.org/jira/browse/HBASE-26398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Toth updated HBASE-26398:
--------------------------------
    Description: 
CellCounter dumps all cell coordinates into its output, which can become huge.

The spill can fill the local disk on the reducer. 
CellCounter hardcodes *mapreduce.job.reduces* to *1*, so it is not possible to 
use multiple reducers to get around this.

Fixing this is easy, by not hardcoding *mapreduce.job.reduces*, it still 
defaults to 1, but can be overriden by the user. 

CellCounter also generates two extra records with constant keys for each cell, 
which have to be processed by the reducer.
Even with multiple reducers, these (1/3 of the totcal records) will go the same 
reducer, which can also fill up the disk.

This can be fixed by adding a Combiner to the Mapper, which sums the counter 
records, thereby reducing the Mapper output records to 1/3 of their previous 
amount, which can be evenly distibuted between the reducers.

  was:
CellCounter dumps all cell coordinates into its output, which can become huge.

The spill can fill the local disk on the reducer. 
CellCounter hardcodes *mapreduce.job.reduces* to *1*, so it is not possible to 
use multiple reducers to get around this.

Fixing this is easy, by not hardcoding *mapreduce.job.reduces*, it still 
defaults to 1, but can be overriden by the user. 

CellCounter also generates two extra records with constant keys for each cell, 
which have to be processed by the reducer.
Even with multiple reducers, these (1/3 of the totcal records) will got the 
same reducer, which can also fill up the disk.

This can be fixed by adding a Combiner to the Mapper, which sums the counter 
records, thereby reducing the Mapper output records to 1/3 of their previous 
amount.


> CellCounter fails for large tables filling up local disk
> --------------------------------------------------------
>
>                 Key: HBASE-26398
>                 URL: https://issues.apache.org/jira/browse/HBASE-26398
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 3.0.0-alpha-2
>            Reporter: Istvan Toth
>            Assignee: Istvan Toth
>            Priority: Minor
>
> CellCounter dumps all cell coordinates into its output, which can become huge.
> The spill can fill the local disk on the reducer. 
> CellCounter hardcodes *mapreduce.job.reduces* to *1*, so it is not possible 
> to use multiple reducers to get around this.
> Fixing this is easy, by not hardcoding *mapreduce.job.reduces*, it still 
> defaults to 1, but can be overriden by the user. 
> CellCounter also generates two extra records with constant keys for each 
> cell, which have to be processed by the reducer.
> Even with multiple reducers, these (1/3 of the totcal records) will go the 
> same reducer, which can also fill up the disk.
> This can be fixed by adding a Combiner to the Mapper, which sums the counter 
> records, thereby reducing the Mapper output records to 1/3 of their previous 
> amount, which can be evenly distibuted between the reducers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to