[
https://issues.apache.org/jira/browse/PHOENIX-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107099#comment-14107099
]
James Taylor commented on PHOENIX-180:
--------------------------------------
You can do a first pass of perf testing on your own local hbase by using our
bin/performance.py script:
{code}
cd phoenix/bin
./performance.py localhost 10000000
{code}
This will create a table named PERFORMANCE_10000000 with 10M rows with the
following schema:
{code}
CREATE TABLE PERFORMANCE_10000000 (
HOST CHAR(2) NOT NULL,
DOMAIN VARCHAR NOT NULL,
FEATURE VARCHAR NOT NULL,
DATE DATE NOT NULL,
USAGE.CORE BIGINT,
USAGE.DB BIGINT,
STATS.ACTIVE_VISITOR INTEGER,
CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN, FEATURE, DATE)
) SPLIT ON ('CSGoogle','CSSalesforce','EUApple','EUGoogle','EUSalesforce',
'NAApple','NAGoogle','NASalesforce')
{code}
Then you can try some basic queries across a range within the same region like
this:
{code}
SELECT count(*) FROM PERFORMANCE_10000000
WHERE host='CS' AND domain='Google'
{code}
It depends on where the region boundaries are - to see the most dramatic perf
difference, you'll want to do a query that targets a single region. Just take a
look at what the region boundaries are and add AND clauses for the other
primary key colums, like this:
{code}
SELECT count(*) FROM PERFORMANCE_10000000
WHERE host='CS' AND domain='Google' AND feature >= 'A' AND feature <= 'D'
{code}
where the 'A' and 'D' correspond to the value for feature in any particular
region (it'll be the characters after the first 0 byte in the region boundary).
Try this with and without your change and perhaps log and compare the split
points that Phoenix calculates with and without your change (i.e. the value of
ParallelIterators.splits member variable when the query is executed).
> Use stats to guide query parallelization
> ----------------------------------------
>
> Key: PHOENIX-180
> URL: https://issues.apache.org/jira/browse/PHOENIX-180
> Project: Phoenix
> Issue Type: Sub-task
> Reporter: James Taylor
> Assignee: ramkrishna.s.vasudevan
> Labels: enhancement
> Attachments: Phoenix-180_WIP.patch
>
>
> We're currently not using stats, beyond a table-wide min key/max key cached
> per client connection, to guide parallelization. If a query targets just a
> few regions, we don't know how to evenly divide the work among threads,
> because we don't know the data distribution. This other [issue]
> (https://github.com/forcedotcom/phoenix/issues/64) is targeting gather and
> maintaining the stats, while this issue is focused on using the stats.
> The main changes are:
> 1. Create a PTableStats interface that encapsulates the stats information
> (and implements the Writable interface so that it can be serialized back from
> the server).
> 2. Add a stats member variable off of PTable to hold this.
> 3. From MetaDataEndPointImpl, lookup the stats row for the table in the stats
> table. If the stats have changed, return a new PTable with the updated stats
> information. We may want to cache the stats row and have the stats gatherer
> invalidate the cache row when updated so we don't have to always do a scan
> for it. Additionally, it would be idea if we could use the same split policy
> on the stats table that we use on the system table to guarantee co-location
> of data (for the sake of caching).
> - modify the client-side parallelization (ParallelIterators.getSplits()) to
> use this information to guide how to chunk up the scans at query time.
> This should help boost query performance, especially in cases where the data
> is highly skewed. It's likely the cause for the slowness reported in this
> issue: https://github.com/forcedotcom/phoenix/issues/47.
--
This message was sent by Atlassian JIRA
(v6.2#6252)