[ 
https://issues.apache.org/jira/browse/IMPALA-12601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17797973#comment-17797973
 ] 

ASF subversion and git services commented on IMPALA-12601:
----------------------------------------------------------

Commit 8661f922d3ccb21da73b9f7f8734d9113429e9bb in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8661f922d ]

IMPALA-12601: Add a fully partitioned TPC-DS database

The current tpcds dataset only has store_sales table fully partitioned
and leaves the other facts table unpartitioned. This is intended for
faster data loading during tests. However, this is not an accurate
reflection of the larger scale TPC-DS dataset where all facts tables are
partitioned. Impala planner may change the details of the query plan if
a partition column exists.

This patch adds a new dataset tpcds_partitioned, loading a fully
partitioned TPC-DS db in parquet format named
tpcds_partitioned_parquet_snap. This dataset can not be loaded
independently and requires the base 'tpcds' db from the tpcds dataset to
be preloaded first. An example of how to load this dataset can be seen
at function load-tpcds-data in bin/create-load-data.sh.

This patch also changes PlannerTest#testProcessingCost from targeting
tpcds_parquet to tpcds_partitioned_parquet_snap. Other planner tests are
that currently target tpcds_parquet will be gradually changed to test
against tpcds_partitioned_parquet_snap in follow-up patches.

This addition adds a couple of seconds in the "Computing table stats"
step, but loading itself is negligible since it is parallelized with
TPC-H and functional-query. The total loading time for the three
datasets remains similar after this patch.

This patch also adds several improvements in the following files:

bin/load-data.py:
- Log elapsed time on serial steps.

testdata/bin/create-load-data.sh:
- Rename MSG to LOAD_MSG to avoid collision with the same variable name
  in ./testdata/bin/run-step.sh

testdata/bin/generate-schema-statements.py:
- Remove redundant FILE_FORMAT_MAP.
- Add build_partitioned_load to simplify expressing partitioned insert
  query in SQL template.

testdata/datasets/tpcds/tpcds_schema_template.sql:
- Reorder schema template to load all dimension tables before fact tables.

Testing:
- Pass core tests.

Change-Id: I3a2e66c405639554f325ae78c66628d464f6c453
Reviewed-on: http://gerrit.cloudera.org:8080/20756
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Add a fully partitioned TPC-DS dataset for planner tests
> --------------------------------------------------------
>
>                 Key: IMPALA-12601
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12601
>             Project: IMPALA
>          Issue Type: Test
>          Components: Infrastructure
>    Affects Versions: Impala 4.3.0
>            Reporter: Riza Suminto
>            Assignee: Riza Suminto
>            Priority: Major
>
> testdata/datasets/tpcds only have store_sales table fully-partitioned and 
> leave other facts table unpartitioned. This is probably intended for faster 
> data loading during test. However, this is not an accurate reflection of 
> larger scale TPC-DS dataset where all facts tables are partitioned. Impala 
> planner may change the detail of query plan if partition column exist.
> Today, Impala can load a small scale, fully-partitioned TPC-DS dataset in 
> parquet format to minicluster in reasonable speed. We should consider adding 
> such test db and change PlannerTests to test against that fully-partitioned 
> db instead of current tpcds/tpcds_parquet test db.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to