Stamatis Zampetakis created HIVE-29476:
------------------------------------------
Summary: Add tests for TPC-DS 30TB metastore content
Key: HIVE-29476
URL: https://issues.apache.org/jira/browse/HIVE-29476
Project: Hive
Issue Type: Test
Components: Test
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis
The [TPC-DS 30TB plan regression
suite|https://github.com/apache/hive/blob/2fa85ab5f6683e16125b30b63b4189b95b098b5a/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestTezTPCDS30TBPerfCliDriver.java]
is based on a pre-built database dump that is loaded via dockerized [Postgres
database|https://github.com/apache/hive/blob/2fa85ab5f6683e16125b30b63b4189b95b098b5a/standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/dbinstall/rules/PostgresTPCDS.java].
The content of the dump is not validated anywhere and we can only verify
what's inside either by manually inspecting the dump or inferring implicit
conclusions from the query plans. The dump has been updated a few times already
and there is also an imminent update that is gonna happen in HIVE-26830. The
creation of the dump is a manual process so it would be helpful to have a basic
set of tests that verify the state of the metastore and how the dump evolves.
Interesting information that we would like to capture includes:
* table and column data types
* constraints (FK, NOT NULL)
* basic table stats such as num_rows, numPartitions, etc.
* basic column stats such as min, max, NDV, num_nulls, etc.
The above can be captured by adding DESCRIBE FORMATTED qtests for each TPC-DS
table and column. As an added bonus this will increase the coverage for
DESCRIBE statements.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)