[ 
https://issues.apache.org/jira/browse/IMPALA-13543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899546#comment-17899546
 ] 

ASF subversion and git services commented on IMPALA-13543:
----------------------------------------------------------

Commit f533225915d78e1a2a3d1e6606cf4f6d064151d2 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f53322591 ]

IMPALA-13543: single_node_perf_run.py must accept tpcds_partitioned

tpcds_partitioned dataset is a fully-partitioned version of tpcds
dataset (the latter only partition store_sales table). It does not have
the default text format database like tpcds dataset. Instead, it relies
on pre-existence of text format tpcds database, which then INSERT
OVERWRITE INTO tpcds_partitioned database equivalent. It does not have
its own queries set, but instead symlinked to share
testdata/workloads/tpcds/queries. It also have slightly different schema
from tpcds dataset, namely column "c_last_review_date" in tpcds dataset
is "c_last_review_date_sk" in tpcds_partitioned (TPC-DS v2.11.0, section
2.4.7). These reasons make tpcds_partitioned ineligible for
perf-AB-test (single_node_perf_run.py).

This patch update single_node_perf_run.py and related scripts to make
tpcds_partitioned eligible for benchmark dataset. It adds an initial
steps to load the text database from tpcds dataset with selected scale
before running the load script for tpcds_partitioned dataset. Compute
stats step also limited to run one at a time to not overadmit the
cluster with concurrent compute stats queries.

Created helper function build_replacement_params() inside
generate-schema-statements.py for common function.

Testing
- Run perf-AB-test-ub2004 with this commit included and confirm
  benchmark works with tpcds_partitioned dataset.
- Run normal data loading. Pass FE tests, and
  query_test/test_tpcds_queries.py.

Change-Id: I4b6f435705dcf873696ffd151052ebeab35d9898
Reviewed-on: http://gerrit.cloudera.org:8080/22061
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Make tpcds_partitioned eligible for single_node_perf_run.py
> -----------------------------------------------------------
>
>                 Key: IMPALA-13543
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13543
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Infrastructure
>            Reporter: Riza Suminto
>            Assignee: Riza Suminto
>            Priority: Major
>
> tpcds_partitioned dataset is a fully-partitioned version of tpcds dataset 
> (the latter only partition store_sales table). It does not have the default 
> text format database like tpcds dataset. Instead, it relies on pre-existence 
> of text format tpcds database, which then INSERT OVERWRITE INTO 
> tpcds_partitioned database equivalent. It does not have its own queries set, 
> but instead symlinked to share testdata/workloads/tpcds/queries. It also have 
> slightly different schema from tpcds dataset, namely column 
> "c_last_review_date" in tpcds dataset is "c_last_review_date_sk" in 
> tpcds_partitioned (TPC-DS v2.11.0, see related commit in 
> [impala-tpcds-kit|https://github.com/cloudera/impala-tpcds-kit/commit/086d7113c8b4172247f83f60f4e274fe3326df11]).
> Those reasons make tpcds_partitioned ineligible for perf-AB-test 
> (single_node_perf_run.py), which require dataset loadable though 
> bin/load-data.py in single execution. single_node_perf_run.py and related 
> scripts must be modified a bit to accept tpcds_partitioned dataset for 
> benchmark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to