[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

scottcarey Wed, 18 Apr 2018 14:08:25 -0700

Github user scottcarey commented on the issue:

    https://github.com/apache/spark/pull/21070
  
    I tested this with the addition of some changes to ParquetOptions.scala, 
but this alone does not allow for writing or reading zstd compressed parquet 
files, because it is using reflection to acquire hadoop classes for compression 
which are not in the supplied dependencies.
    
    From what I can see, anyone that wants to use the new compression codecs 
are going to have to build their own custom version of spark.  And probably 
with modified versions of hadoop libraries as well, including changing how the 
native bindings are built.... because that would be easier than updating the 
whole thing to hadoop-common 3.0 where the required compressors exist.
    
    Alternatively, spark+parquet should avoid the hadoop dependencies like the 
plague for compression / decompression.  They bring in a steaming heap of 
dependencies and possible library conflicts and users often have versions (or 
CDH versions) that don't exactly match.
    In my mind, parquet should handle the compression itself, or with a 
light-weight dependency.
    Perhaps it can use either the hadoop flavor, or if that is not found, 
another one, or even a user-supplied one so that it works stand-alone or from 
inside hadoop without issue.
    Right now it is bound together with reflection and an awkward stack of 
brittle dependencies with no escape hatch.
    
    Or am I missing something here, and it is possible to read/write with the 
new codecs if I configure it differently?




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

Reply via email to