[Impala-ASF-CR] IMPALA-12601: Add a fully partitioned TPC-DS database

Riza Suminto (Code Review) Wed, 06 Dec 2023 10:44:24 -0800

Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20756 )

Change subject: IMPALA-12601: Add a fully partitioned TPC-DS database
......................................................................

Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/20756/2/testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql
File testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/20756/2/testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql@483
PS2, Line 483: -- The following query options are set to optimize small scale
fact tables with 3 impalads.
: -- Small MAX_SCAN_RANGE_LENGTH will split 1 file into few scan
ranges that can be
: -- distributed across 3 impalads in minicluster. Without this,
only 1 fragment at 1
: -- impalad do
> What do these settings do? Do we still need them?
This set of options is to optimize loading on small scale TPC-DS.
Impala nightly test load only 1GB scale of TPC-DS, and the dsdgen only produce
1 file per table at this scale.

Small MAX_SCAN_RANGE_LENGTH will split 1 file into few scan ranges that can be
distributed across 3 impalad of minicluster. Without this, only 1 scanner node
at 1 impalad does the reading.

MT_DOP=4 is to increase number of scanner and writer to 12 at max for single
insert overwrite query.

I think SORT_RUN_BYTES_LIMIT can be dropped as it primary needed to trigger
more frequent sort-and-spill during larger scale TPC-DS loading.
https://github.com/cloudera/impala-tpcds-kit/blob/d829fc392a70df8300a8d9fd265977fa078a2dab/scripts/impala-insert.sql#L8
This is now removed.

http://gerrit.cloudera.org:8080/#/c/20756/2/testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql@715
PS2, Line 715: foreign key (ss_store_sk) references {db_name}{db_suffix}.store
(s_store_sk) DISABLE NOVALIDATE RELY
: foreign key (ss_promo_sk) references
{db_name}{db_suffix}.promotion (p_promo_sk) DISABLE NOVALIDATE RELY
: ---- PARTITION_COLUMNS
> We don't specify the "partitioned_insert:" multi-statement load here. Is th
Correct. We don't need multi-statement load here because, at small scale
parquet and with query option tuning above, Impala minicluster can handle
loading store_sales in single run. I think I see around 32 MB peak memory usage
per impalad on store_sales loading.

Given the same query option tuning, it might be possible to eliminate
multi-statement load at original tpcds dataset. I just keep it as
multi-statement load to keep the same baseline when measuring the overhead of
this extra dataset load. Not sure if hive loading text is as efficient as
Impala loading parquet in single query.

Historically, I think multi-statement load in tpcds dataset is also better
there if people need to use the same script to load larger TPC-DS scale. I'd
encourage people to move using impala-tpcds-kit instead.

--
To view, visit http://gerrit.cloudera.org:8080/20756
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3a2e66c405639554f325ae78c66628d464f6c453
Gerrit-Change-Number: 20756
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Laszlo Gaal <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 06 Dec 2023 18:40:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-12601: Add a fully partitioned TPC-DS database

Reply via email to