[jira] [Commented] (PHOENIX-4009) Run UPDATE STATISTICS command by using MR integration on snapshots

Karan Mehta (JIRA) Mon, 22 Oct 2018 10:27:30 -0700


    [ 
https://issues.apache.org/jira/browse/PHOENIX-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659195#comment-16659195
 ]


Karan Mehta commented on PHOENIX-4009:
--------------------------------------

{quote}There are so many failure scenarios and corner cases to handle. Once we 
use MR job to update statistics, using "UPDATE STATISTICS ..." sql command to 
update statistics shouldn't be that frequent, so the chance of both running 
simultaneously is low.
{quote}
Yes, using distributed lock using ZK is overkill here. I suggested the 
SYSTEM.MUTEX approach since its a simple approach to do locking in such cases. 

Just a brief background of why SYSTEM.MUTEX was introduced. Phoenix upgrades 
the SYSTEM.CATALOG table whenever a client with newer version connects to a 
server with newer version. Since it is a small one time task, there was never a 
need of any locking here. However when multiple servers were upgraded in prod 
at the same time we hit this race condition, causing it to get corrupt. Such a 
small thing caused issues in prod.

However I thought through this. My primary motivation for introducing this was 
because of the fact that the data should not get mutated when statistics are 
being collected. That is reason why we currently check if the region is doing 
major compaction when stats are being collected via a regular update statistics 
command. This situation is not applicable in this case since snapshots are 
immutable. As you correctly pointed out, we should add the logic to update the 
SYSTEM.STATS table only if the current statistics are not stale or with in a 
certain threshold.
{quote}Run MR job for a specific tenant or key range
{quote}
This would be handled by update statistics sql statement. The query plan should 
generate the appropriate scans with row key filters based on tenant-id. If that 
snapshot scanner is used, it should automatically filter out rows here. The 
only requirement that should be enforced here is that tenant id should be 
leading part of row key (equivalent of saying its multi-tenant table) and the 
key-range contains either complete row keys or prefixed ones. Do we have a use 
case for the latter case?
{quote}Data Locality
{quote}
I am not sure if we can control this directly. We can leave it on the HDFS and 
MR layer since it already optimizes for this one.

> Run UPDATE STATISTICS command by using MR integration on snapshots
> ------------------------------------------------------------------
>
>                 Key: PHOENIX-4009
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4009
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Samarth Jain
>            Priority: Major
>
> Now that we have the capability to run queries against table snapshots 
> through our map reduce integration, we can utilize this capability for stats 
> collection too. This would make our stats collection more resilient, resource 
> aware and less resource intensive. The bulk of the plumbing is already in 
> place. We would need to make sure that the integration doesn't barf when the 
> query is an UPDATE STATISTICS command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PHOENIX-4009) Run UPDATE STATISTICS command by using MR integration on snapshots

Reply via email to