[jira] [Created] (IMPALA-14367) Reduce/rationalize test vector set for compressed file formats

Csaba Ringhofer (Jira) Thu, 28 Aug 2025 03:09:13 -0700

Csaba Ringhofer created IMPALA-14367:
----------------------------------------


             Summary: Reduce/rationalize test vector set for compressed file 
formats
                 Key: IMPALA-14367
                 URL: https://issues.apache.org/jira/browse/IMPALA-14367
             Project: IMPALA
          Issue Type: Test
          Components: Infrastructure, Test
            Reporter: Csaba Ringhofer


During exhaustive tests a lot of test vectors are created for some rarely used 
file formats (e.g. rc, sequence), because these files can be also compressed 
and each file format/compression pair is considered a new item in the 
file_format dimension. Block vs record level compression can be an extra 
dimension (e.g.  seq/gzip/record). Meanwhile  the most commonly used file 
format Parquet can also use several compression types at page level, but only 
snappy compression is heavily tested.

As an example, https://gerrit.cloudera.org/#/c/23342/ fixed pairwise test 
vector generation, bumping exhaustive EE/custom cluster tests from 11000 to 
17000, and restricting the some tests to use only a single compression per file 
format (single_compression_constraint() ) reduced it to 16000.

A few questions arise:
1. what is the priority of testing different file formats? this depends IMO 
both on the frequency of usage and the development activity in that area
2. what tests should have a file_format dimension at all?
3.  what tests should consider compression in the file format dimension?
4. is it possible to also remove some vectors from test data generation, or all 
are needed to get a good coverage? it is possible that some tables are created 
but never touched by tests









--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IMPALA-14367) Reduce/rationalize test vector set for compressed file formats

Reply via email to