Github user scottcarey commented on the issue:
https://github.com/apache/spark/pull/21070
I tested this with the addition of some changes to ParquetOptions.scala,
but this alone does not allow for writing or reading zstd compressed parquet
files, because it is using reflection to acquire hadoop classes for compression
which are not in the supplied dependencies.
From what I can see, anyone that wants to use the new compression codecs
are going to have to build their own custom version of spark. And probably
with modified versions of hadoop libraries as well, including changing how the
native bindings are built.... because that would be easier than updating the
whole thing to hadoop-common 3.0 where the required compressors exist.
Alternatively, spark+parquet should avoid the hadoop dependencies like the
plague for compression / decompression. They bring in a steaming heap of
dependencies and possible library conflicts and users often have versions (or
CDH versions) that don't exactly match.
In my mind, parquet should handle the compression itself, or with a
light-weight dependency.
Perhaps it can use either the hadoop flavor, or if that is not found,
another one, or even a user-supplied one so that it works stand-alone or from
inside hadoop without issue.
Right now it is bound together with reflection and an awkward stack of
brittle dependencies with no escape hatch.
Or am I missing something here, and it is possible to read/write with the
new codecs if I configure it differently?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]