[
https://issues.apache.org/jira/browse/IMPALA-12726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814178#comment-17814178
]
ASF subversion and git services commented on IMPALA-12726:
----------------------------------------------------------
Commit f87c20800de9f7dc74e47aa9a8c0dc878f4f0840 in impala's branch
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f87c20800 ]
IMPALA-12726: Simulate large-scale query in TpcdsCpuCostPlannerTest
Querying against large-scale databases is a good way for testing Impala.
However, it is impractical to do in a single-node development machine.
Frontend testing does not run the test query in the backend executor and
can benefit from simulated large-scale test cases. This patch attempts
to do it by instrumenting the CatalogD metadata loading code to scale
tpcds_partitioned_parquet_snap by injecting column stats from a 3TB
TPC-DS dataset in TpcdsCpuCostPlannerTest.
The large-scale column stats are expressed in stats-3TB.json, taken by
running "SHOW COLUMN STATS" and "DESCRIBE FORMATTED" queries on a 3TB
dataset loaded using impala-tpcds-kit. It is parsed and then
piggy-backed through RuntimeEnv. Code that populates stats
metadata (caller of FeCatalogUtils.getRowCount(),
FeCatalogUtils.getTotalSize(), and FeCatalogUtils.injectColumnStats())
are instrumented to populate stats from RuntimeEnv instead of Metastore.
Scaled-up tables are invalidated before a test run to reload them with
new high-scale stats. This patch also adds a scan range limit injection
to force ScanNode over a single file table to act as if it scans a
multi-files table.
tpcds_partitioned_schema_template.sql is modified to match column names
and types from impala-tpcds-kit. The test files under
PlannerTest/tpcds_cpu_cost/ are replaced with queries that are
specifically generated to run against the 3TB scale factor of the TPC-DS
dataset
(https://github.com/cloudera/impala-tpcds-kit/blob/separate_queries_per_scale_factor/queries/sf3000/).
All query plans match with query plans obtained through real query runs
in a large cluster except for a few mismatches due to the hard limit on
the number of files at a table. Below are 3 queries out of 103 that
still do not have a matching shape and the reasons.
+-----+----------------------------------------------+
| Q | Reason |
+-----+----------------------------------------------+
| 10a | different num files in customer_demographics |
| 34 | different num files in customer |
| 69 | different num files in customer |
+-----+----------------------------------------------+
Testing:
- Scale tables of tpcds_partitioned_parquet_snap in
TpcdsCpuCostPlannerTest to simulate 3TB TPC-DS. The number of
executors is raised from 3 to 10, and REPLICA_PREFERENCE=REMOTE to
ignore data locality.
- Pass core tests.
Change-Id: Iaffddd70c2da8376ca6c40f65606bbac46c34de7
Reviewed-on: http://gerrit.cloudera.org:8080/20922
Reviewed-by: Wenzhe Zhou <[email protected]>
Reviewed-by: Quanlong Huang <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Simulate large scale query planning in TpcdsCpuCostPlannerTest
> --------------------------------------------------------------
>
> Key: IMPALA-12726
> URL: https://issues.apache.org/jira/browse/IMPALA-12726
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Reporter: Riza Suminto
> Assignee: Riza Suminto
> Priority: Major
>
> Querying against large scale database is a good way to test Impala. However,
> it is impractical to do in single node development machine.
> Frontend testing does not actually run the test query in backend executor and
> can benefit from simulated large scale test cases. This can be done by either
> instrumenting the CatalogD metadata loading code or the COMPUTE STATS query
> to multiply numRows, numNull, numTrues, and numFalses to a scale constant. We
> can start by hacking TpcdsCpuCostPlannerTest to simulate 1TB TPC-DS query
> planning.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]