[ 
https://issues.apache.org/jira/browse/PHOENIX-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657537#comment-16657537
 ] 

Bin Shi edited comment on PHOENIX-4009 at 10/19/18 11:08 PM:
-------------------------------------------------------------

We mightn't need synchronization between MR jobs and "UPDATE STATISTICS ... " 
sql commands (or even among Update Statistics MR jobs) by using SYSTEM.MUTEX or 
distributed lock via ZooKeeper. The reasons are:
 # There are so many failure scenarios and corner cases to handle. 
 # Once we use MR job to update statistics, using "UPDATE STATISTICS ..." sql 
command to update statistics shouldn't be that frequent, so the chance of both 
running simultaneously is low.
 # Since updating SYSTEM.STATS table for a region happens atomically, it's ok 
for them to run simultaneously.
 # MR Job isn't running in region server's process space and its resource is 
controlled by YARN, so it should be ok for them to run simultaneously. 

Besides above, we might need to consider:
 # Run MR job for a specific tenant or key range.
 # if time since last update is less than certain threshold, the job doesn't 
need to update the regions' stats again. Suppose one MR job failed but some 
regions still get updated, the rerun job only needs to update the regions that 
the first job didn't update.
 # Data Locality.

 


was (Author: bin shi):
We mightn't need synchronization between MR jobs and "UPDATE STATISTICS ... " 
sql commands (or even among Update Statistics MR jobs) by using SYSTEM.MUTEX or 
distributed lock via ZooKeeper. The reasons are:
 # There are so many failure scenarios and corner cases to handle. 
 # Once we use MR job to update statistics, using "UPDATE STATISTICS ..." sql 
command to update statistics shouldn't be that frequent, so the chance of both 
running simultaneously is low.
 # Since updating SYSTEM.STATS table for a region happens atomically, it's ok 
for them to run simultaneously.
 # MR Job isn't running in region server's process space and its resource is 
controlled by YARN, so it should be ok for them to run simultaneously. 

 

> Run UPDATE STATISTICS command by using MR integration on snapshots
> ------------------------------------------------------------------
>
>                 Key: PHOENIX-4009
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4009
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Samarth Jain
>            Priority: Major
>
> Now that we have the capability to run queries against table snapshots 
> through our map reduce integration, we can utilize this capability for stats 
> collection too. This would make our stats collection more resilient, resource 
> aware and less resource intensive. The bulk of the plumbing is already in 
> place. We would need to make sure that the integration doesn't barf when the 
> query is an UPDATE STATISTICS command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to