Csaba Ringhofer created IMPALA-14367: ----------------------------------------
Summary: Reduce/rationalize test vector set for compressed file formats Key: IMPALA-14367 URL: https://issues.apache.org/jira/browse/IMPALA-14367 Project: IMPALA Issue Type: Test Components: Infrastructure, Test Reporter: Csaba Ringhofer During exhaustive tests a lot of test vectors are created for some rarely used file formats (e.g. rc, sequence), because these files can be also compressed and each file format/compression pair is considered a new item in the file_format dimension. Block vs record level compression can be an extra dimension (e.g. seq/gzip/record). Meanwhile the most commonly used file format Parquet can also use several compression types at page level, but only snappy compression is heavily tested. As an example, https://gerrit.cloudera.org/#/c/23342/ fixed pairwise test vector generation, bumping exhaustive EE/custom cluster tests from 11000 to 17000, and restricting the some tests to use only a single compression per file format (single_compression_constraint() ) reduced it to 16000. A few questions arise: 1. what is the priority of testing different file formats? this depends IMO both on the frequency of usage and the development activity in that area 2. what tests should have a file_format dimension at all? 3. what tests should consider compression in the file format dimension? 4. is it possible to also remove some vectors from test data generation, or all are needed to get a good coverage? it is possible that some tables are created but never touched by tests -- This message was sent by Atlassian Jira (v8.20.10#820010)