[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

fjh100456 Mon, 22 Jan 2018 18:54:18 -0800

Github user fjh100456 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20087#discussion_r163132078
  
    --- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
    @@ -55,18 +55,28 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
           customPartitionLocations: Map[TablePartitionSpec, String] = 
Map.empty,
           partitionAttributes: Seq[Attribute] = Nil): Set[String] = {
     
    -    val isCompressed = hadoopConf.get("hive.exec.compress.output", 
"false").toBoolean
    +    val isCompressed =
    +      
fileSinkConf.getTableInfo.getOutputFileFormatClassName.toLowerCase(Locale.ROOT) 
match {
    +        case formatName if formatName.endsWith("orcoutputformat") =>
    +          // For ORC,"mapreduce.output.fileoutputformat.compress",
    +          // "mapreduce.output.fileoutputformat.compress.codec", and
    +          // "mapreduce.output.fileoutputformat.compress.type"
    +          // have no impact because it uses table properties to store 
compression information.
    --- End diff --
    
    For parquet, using a hive client, `parquet.compression` has a higher 
priority than  `mapreduce.output.fileoutputformat.compress`. And table-level 
compression( set by tblproperties) has the highest priority.  
`parquet.compression` set by cli also has a higher priority than 
`mapreduce.output.fileoutputformat.compress`.
    
    After this pr, the priority does not changed. If table-level compression 
was set, other compression would not take effect, even though 
`mapreduce.output....` were set, which is the same with hive. But 
`parquet.compression` set by spark cli does not take effect, unless set 
`hive.exec.compress.output` to true. This may because we do not get 
`parquet.compression` from the session, and I wonder if it's necessary because 
we have `spark.sql.parquet.comression` instead.
    
    For orc, `hive.exec.compress.output` and `mapreduce.output....` have no 
impact really, but table-leval  compression (set by tblproperties) always take 
effect.  `orc.compression` set by spark cli does not take effect too, even 
though  set `hive.exec.compress.output` to true, which is differet with 
parquet. 
    Another question, the comment say `it uses table properties to store 
compression information`, actully, by manul test, I found orc-tables also can 
have mixed compressions, and the data can be read together correctly.
    
    My Hive version for this test is 1.1.0.  Actully it's a little difficut for 
me to get a higher version runable Hive client.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

Reply via email to