[
https://issues.apache.org/jira/browse/PHOENIX-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659195#comment-16659195
]
Karan Mehta commented on PHOENIX-4009:
--------------------------------------
{quote}There are so many failure scenarios and corner cases to handle. Once we
use MR job to update statistics, using "UPDATE STATISTICS ..." sql command to
update statistics shouldn't be that frequent, so the chance of both running
simultaneously is low.
{quote}
Yes, using distributed lock using ZK is overkill here. I suggested the
SYSTEM.MUTEX approach since its a simple approach to do locking in such cases.
Just a brief background of why SYSTEM.MUTEX was introduced. Phoenix upgrades
the SYSTEM.CATALOG table whenever a client with newer version connects to a
server with newer version. Since it is a small one time task, there was never a
need of any locking here. However when multiple servers were upgraded in prod
at the same time we hit this race condition, causing it to get corrupt. Such a
small thing caused issues in prod.
However I thought through this. My primary motivation for introducing this was
because of the fact that the data should not get mutated when statistics are
being collected. That is reason why we currently check if the region is doing
major compaction when stats are being collected via a regular update statistics
command. This situation is not applicable in this case since snapshots are
immutable. As you correctly pointed out, we should add the logic to update the
SYSTEM.STATS table only if the current statistics are not stale or with in a
certain threshold.
{quote}Run MR job for a specific tenant or key range
{quote}
This would be handled by update statistics sql statement. The query plan should
generate the appropriate scans with row key filters based on tenant-id. If that
snapshot scanner is used, it should automatically filter out rows here. The
only requirement that should be enforced here is that tenant id should be
leading part of row key (equivalent of saying its multi-tenant table) and the
key-range contains either complete row keys or prefixed ones. Do we have a use
case for the latter case?
{quote}Data Locality
{quote}
I am not sure if we can control this directly. We can leave it on the HDFS and
MR layer since it already optimizes for this one.
> Run UPDATE STATISTICS command by using MR integration on snapshots
> ------------------------------------------------------------------
>
> Key: PHOENIX-4009
> URL: https://issues.apache.org/jira/browse/PHOENIX-4009
> Project: Phoenix
> Issue Type: Bug
> Reporter: Samarth Jain
> Priority: Major
>
> Now that we have the capability to run queries against table snapshots
> through our map reduce integration, we can utilize this capability for stats
> collection too. This would make our stats collection more resilient, resource
> aware and less resource intensive. The bulk of the plumbing is already in
> place. We would need to make sure that the integration doesn't barf when the
> query is an UPDATE STATISTICS command.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)