[ https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xu Cang reassigned PHOENIX-5774: -------------------------------- Assignee: Xu Cang > Phoenix Mapreduce job over hbase snapshots is extremely inefficient. > -------------------------------------------------------------------- > > Key: PHOENIX-5774 > URL: https://issues.apache.org/jira/browse/PHOENIX-5774 > Project: Phoenix > Issue Type: Bug > Affects Versions: 4.13.1 > Reporter: Rushabh Shah > Assignee: Xu Cang > Priority: Major > > Internally we have tenant estimation framework which calculates the number of > rows each tenant occupy in the cluster. Basically what the framework does is > it launch MapReduce(MR) job per table and run the following query : "Select > tenant_id from <table-name>" and we do count over this tenant_id in reducer > phase. > Earlier we use to run this query against live table but we found meta table > was getting hammered over the time this job was running so we thought to run > the MR job on hbase snapshots instead of live table. Take advantage of this > feature: https://issues.apache.org/jira/browse/PHOENIX-3744 > When we were querying live table, the MR job for one of the biggest table in > sandbox cluster took around 2.5 hours. > After we started using hbase snapshots, the MR job for the same table took > 135 hours. We have maximum concurrent running mapper limit to 15 to avoid > hammering meta table when we were querying live tables. We didn't remove that > restriction after we moved to hbase snapshots.So ideally it shouldn't take > 135 hours to complete if we don't have that restriction. > Some statistics about that table: > Size: -578 GB- 2.70 TB, Num Regions in that table: -161- 670 > The average map time took 3 mins 11 seconds when querying live table. > The average map time took 5 hours 33 minutes when querying hbase snapshots. > The issue is we don't consider snapshots while generating splits. So during > map phase, each map task has to go through all regions in snapshots to > determine which region has the start and end key assigned to that task. After > determining all regions, it has to open each region to scan all hfiles in > that region. In one such map task, the start and end key from split was > distributed among 289 regions(from snapshot not live table). Reading from > each region took an average of 90 seconds, so for 289 regions it took > approximately 7 hours. -- This message was sent by Atlassian Jira (v8.3.4#803005)