[
https://issues.apache.org/jira/browse/PHOENIX-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656116#comment-16656116
]
Karan Mehta commented on PHOENIX-4009:
--------------------------------------
This is definitely a better way of collecting statistics to reduce the load on
HRegionServer process.
Following are the concerns here.
# Currently Phoenix MR framework only allows running against non-aggregate
queries. {{UPDATE STATISTICS}} is a mutation statement, although the internal
implementation boils down to running {{SELECT COUNT(*) FROM TABLENAME}} to get
the parallel scans, which has a special attribute {{_ANALYZETABLE}} set to it.
Although this is an aggregation, it should be okay to turn this logic to
{{SELECT * FROM TABLENAME}} and check the required implications.
# Statistics updation can be issued by multiple clients or applications at the
same time or it can be instantiated via major compaction. At this point we
prevent multiple instances of statistics running by using a HRegionServer level
singleton class that maintains a list of regions that are collecting
statistics. Updating SYSTEM.STATS table for a region happens atomically. A
potential solution is to use SYSTEM.MUTEX table to acquire time based lock on
the region that is collecting stats so that various apps can check before
proceeding. If the process dies in between, then lock will have a TTL and will
automatically get removed. The only potential concern here is if it collection
takes more time than TTL. We can set TTL to a decent value (like 1 hour) so
that we potentially never run into this case.
I am still exploring more on this and will keep the thoughts posted here.
FYI [~Bin Shi] [~sukumaddineni]
> Run UPDATE STATISTICS command by using MR integration on snapshots
> ------------------------------------------------------------------
>
> Key: PHOENIX-4009
> URL: https://issues.apache.org/jira/browse/PHOENIX-4009
> Project: Phoenix
> Issue Type: Bug
> Reporter: Samarth Jain
> Priority: Major
>
> Now that we have the capability to run queries against table snapshots
> through our map reduce integration, we can utilize this capability for stats
> collection too. This would make our stats collection more resilient, resource
> aware and less resource intensive. The bulk of the plumbing is already in
> place. We would need to make sure that the integration doesn't barf when the
> query is an UPDATE STATISTICS command.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)