[ 
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Cang reassigned PHOENIX-5774:
--------------------------------

    Assignee: Xu Cang

> Phoenix Mapreduce job over hbase snapshots is extremely inefficient.
> --------------------------------------------------------------------
>
>                 Key: PHOENIX-5774
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5774
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.13.1
>            Reporter: Rushabh Shah
>            Assignee: Xu Cang
>            Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of 
> rows each tenant occupy in the cluster. Basically what the framework does is 
> it launch MapReduce(MR) job per table and run the following query : "Select 
> tenant_id from <table-name>" and we do count over this tenant_id in reducer 
> phase.
>  Earlier we use to run this query against live table but we found meta table 
> was getting hammered over the time this job was running so we thought to run 
> the MR job on hbase snapshots instead of live table. Take advantage of this 
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> When we were querying live table, the MR job for one of the biggest table in 
> sandbox cluster took around 2.5 hours.
>  After we started using hbase snapshots, the MR job for the same table took 
> 135 hours. We have maximum concurrent running mapper limit to 15 to avoid 
> hammering meta table when we were querying live tables. We didn't remove that 
> restriction after we moved to hbase snapshots.So ideally it shouldn't take 
> 135 hours to complete if we don't have that restriction.
> Some statistics about that table:
>  Size: -578 GB- 2.70 TB, Num Regions in that table: -161- 670
> The average map time took 3 mins 11 seconds when querying live table.
>  The average map time took 5 hours 33 minutes when querying hbase snapshots.
> The issue is we don't consider snapshots while generating splits. So during 
> map phase, each map task has to go through all regions in snapshots to 
> determine which region has the start and end key assigned to that task. After 
> determining all regions, it has to open each region to scan all hfiles in 
> that region. In one such map task, the start and end key from split was 
> distributed among 289 regions(from snapshot not live table). Reading from 
> each region took an average of 90 seconds, so for 289 regions it took 
> approximately 7 hours.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to