[
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rushabh Shah updated PHOENIX-5774:
----------------------------------
Summary: Phoenix Mapreduce job over hbase snapshots is extremely
inefficient. (was: Phoenix Mapreduce job over hbase Snapshots is extremely
inefficient.)
> Phoenix Mapreduce job over hbase snapshots is extremely inefficient.
> --------------------------------------------------------------------
>
> Key: PHOENIX-5774
> URL: https://issues.apache.org/jira/browse/PHOENIX-5774
> Project: Phoenix
> Issue Type: Bug
> Affects Versions: 4.13.1
> Reporter: Rushabh Shah
> Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of
> rows each tenant occupy in the cluster. Basically what the framework does is
> it launch MapReduce(MR) job per table and run the following query : "Select
> tenant_id from <table-name>" and we do count over this tenant_id in reducer
> phase.
> Earlier we use to run this query against live table but we found meta table
> was getting hammered over the time this job was running so we thought to run
> the MR job on hbase snapshots instead of live table. Take advantage of this
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> When we were querying live table, the MR job for one of the biggest table in
> sandbox cluster took around 2.5 hours.
> After we started using hbase snapshots, the MR job for the same table took
> 135 hours. We have maximum concurrent running mapper limit to 15 to avoid
> hammering meta table when we were querying live tables. We didn't remove that
> restriction after we moved to hbase snapshots.So ideally it shouldn't take
> 135 hours to complete if we don't have that restriction.
> Some statistics about that table:
> Size: -578 GB- 2.70 TB, Num Regions in that table: -161- 670
> The average map time took 3 mins 11 seconds when querying live table.
> The average map time took 5 hours 33 minutes when querying hbase snapshots.
> The issue is we don't consider snapshots while generating splits. So during
> map phase, each map task has to go through all regions in snapshots to
> determine which region has the start and end key assigned to that task. After
> determining all regions, it has to open each region to scan all hfiles in
> that region. In one such map task, the start and end key from split was
> distributed among 289 regions(from snapshot not live table). Reading from
> each region took an average of 90 seconds, so for 289 regions it took
> approximately 7 hours.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)