[jira] [Commented] (IMPALA-12726) Simulate large scale query planning in TpcdsCpuCostPlannerTest

ASF subversion and git services (Jira) Mon, 09 Dec 2024 10:47:57 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-12726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17904247#comment-17904247
 ]


ASF subversion and git services commented on IMPALA-12726:
----------------------------------------------------------

Commit 8e71f5ec8609cc046cf35eb044d91bf34ae9f9c7 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8e71f5ec8 ]

IMPALA-13535: Add script to restore stats on PlannerTest

Impala has several PlannerTest that validate over EXTENDED profile and
validate cardinality. In EXTENDED level, profile display stored table
stats from HMS like 'numRows' and 'totalSize', which can vary between
data loads. They are not validated by PlannerTest. But frequent change
of these lines can disturb code review process because they are mostly
noise.

This patch provides a python script restore-stats-on-planner-tests.py to
fix the table stats information in selected .test files. The test files
to check and fixed table stats is declared inside the script. It is
currently focus on tests under
functional-planner/queries/PlannerTest/tpcds/ and some that test against
tpcds_partitioned_parquet_snap table. critique-gerrit-review.py is
updated to run with python3, trigger restore-stats-on-planner-tests.py,
and warn if there is any unnecessary table stats change detected.

This patch also fixed table size for tests under
functional-planner/queries/PlannerTest/tpcds_cpu_cost/ because all tests
there runs with synthetic stats declared in stats-3TB.json. Before the
patch, the table stats printed in plan is the real stats from HMS. After
this patch, the table stats displayed is calculated from the
stats-3TB.json. See IMPALA-12726 for more detail on large scale planner
test simulation.

Testing:
- Manually run the script and confirm that stats line are replaced
  correctly.
- Run affected PlannerTest and all passed.

Change-Id: I27bab7cee93880cd59f01b9c2d1614dfcabdc682
Reviewed-on: http://gerrit.cloudera.org:8080/22045
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Simulate large scale query planning in TpcdsCpuCostPlannerTest
> --------------------------------------------------------------
>
>                 Key: IMPALA-12726
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12726
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Riza Suminto
>            Assignee: Riza Suminto
>            Priority: Major
>             Fix For: Impala 4.4.0
>
>
> Querying against large scale database is a good way to test Impala. However, 
> it is impractical to do in single node development machine.
> Frontend testing does not actually run the test query in backend executor and 
> can benefit from simulated large scale test cases. This can be done by either 
> instrumenting the CatalogD metadata loading code or the COMPUTE STATS query 
> to multiply numRows, numNull, numTrues, and numFalses to a scale constant. We 
> can start by hacking TpcdsCpuCostPlannerTest to simulate 1TB TPC-DS query 
> planning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-12726) Simulate large scale query planning in TpcdsCpuCostPlannerTest

Reply via email to