Github user dilipbiswal commented on the issue:
https://github.com/apache/spark/pull/22641
@mgaido91
Thanks for your input.
I took another look at the testcase. Let me outline some of my
understandings first.
- The test validates the precedence rules in determining the resultant
compression to be used in the presence of SessionLevel codecs and Table level
codecs.
- It verifies the correct compression is picked by reading the metadata
information from parquet/orc file metadata.
- The accepted configuration for parquet are : none, uncompressed, snappy,
gzip, lzo, brotli, lz4, zstd
- The accepted configuration for orc are : none, uncompressed, snappy,
zlib, lzo
- The testcase in question use only a SUBSET of allowable codecs for
parquet :
uncompressed, snappy, gzip
- The test case in question use only a SUBSET of allowable codecs for orc :
None, Snappy, Zlib
One thing to note is that, the codecs being tested are not exhaustive and
we pick a subset (perhaps the most popular ones). Other thing is that, we have
a 3 way loop 1) isPartitioned 2) convertMetastore 3) useCTAS on top of the
codec loop. So we will be calling the codec loop 6 times in a test for each
unique combination of (isPartitioned, convertMetastore, useCTAS). And we have
changed the codec loop to randomly pick one combination of table level and
session level codecs.
Given this, i feel we are getting a decent coverage and also i feel we
should be able to catch regression as we will catch it in some jenkin run or
the other. If you still feel uncomfortable, should we take 2 codecs as opposed
to 1 ? It will generate a 24 (4 * 6) times loop as opposed to 54 (9 * 6).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]