Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/20756 )
Change subject: IMPALA-12601: Add a fully partitioned TPC-DS database ...................................................................... Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/20756/2/testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql File testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/20756/2/testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql@483 PS2, Line 483: -- The following query options are set to optimize small scale fact tables with 3 impalads. : -- Small MAX_SCAN_RANGE_LENGTH will split 1 file into few scan ranges that can be : -- distributed across 3 impalads in minicluster. Without this, only 1 fragment at 1 : -- impalad do > What do these settings do? Do we still need them? This set of options is to optimize loading on small scale TPC-DS. Impala nightly test load only 1GB scale of TPC-DS, and the dsdgen only produce 1 file per table at this scale. Small MAX_SCAN_RANGE_LENGTH will split 1 file into few scan ranges that can be distributed across 3 impalad of minicluster. Without this, only 1 scanner node at 1 impalad does the reading. MT_DOP=4 is to increase number of scanner and writer to 12 at max for single insert overwrite query. I think SORT_RUN_BYTES_LIMIT can be dropped as it primary needed to trigger more frequent sort-and-spill during larger scale TPC-DS loading. https://github.com/cloudera/impala-tpcds-kit/blob/d829fc392a70df8300a8d9fd265977fa078a2dab/scripts/impala-insert.sql#L8 This is now removed. http://gerrit.cloudera.org:8080/#/c/20756/2/testdata/datasets/tpcds_partitioned/tpcds_partitioned_schema_template.sql@715 PS2, Line 715: foreign key (ss_store_sk) references {db_name}{db_suffix}.store (s_store_sk) DISABLE NOVALIDATE RELY : foreign key (ss_promo_sk) references {db_name}{db_suffix}.promotion (p_promo_sk) DISABLE NOVALIDATE RELY : ---- PARTITION_COLUMNS > We don't specify the "partitioned_insert:" multi-statement load here. Is th Correct. We don't need multi-statement load here because, at small scale parquet and with query option tuning above, Impala minicluster can handle loading store_sales in single run. I think I see around 32 MB peak memory usage per impalad on store_sales loading. Given the same query option tuning, it might be possible to eliminate multi-statement load at original tpcds dataset. I just keep it as multi-statement load to keep the same baseline when measuring the overhead of this extra dataset load. Not sure if hive loading text is as efficient as Impala loading parquet in single query. Historically, I think multi-statement load in tpcds dataset is also better there if people need to use the same script to load larger TPC-DS scale. I'd encourage people to move using impala-tpcds-kit instead. -- To view, visit http://gerrit.cloudera.org:8080/20756 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3a2e66c405639554f325ae78c66628d464f6c453 Gerrit-Change-Number: 20756 Gerrit-PatchSet: 4 Gerrit-Owner: Riza Suminto <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Laszlo Gaal <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 06 Dec 2023 18:40:12 +0000 Gerrit-HasComments: Yes
