Github user fjh100456 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20087#discussion_r163132078
--- Diff:
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
---
@@ -55,18 +55,28 @@ private[hive] trait SaveAsHiveFile extends
DataWritingCommand {
customPartitionLocations: Map[TablePartitionSpec, String] =
Map.empty,
partitionAttributes: Seq[Attribute] = Nil): Set[String] = {
- val isCompressed = hadoopConf.get("hive.exec.compress.output",
"false").toBoolean
+ val isCompressed =
+
fileSinkConf.getTableInfo.getOutputFileFormatClassName.toLowerCase(Locale.ROOT)
match {
+ case formatName if formatName.endsWith("orcoutputformat") =>
+ // For ORC,"mapreduce.output.fileoutputformat.compress",
+ // "mapreduce.output.fileoutputformat.compress.codec", and
+ // "mapreduce.output.fileoutputformat.compress.type"
+ // have no impact because it uses table properties to store
compression information.
--- End diff --
For parquet, using a hive client, `parquet.compression` has a higher
priority than `mapreduce.output.fileoutputformat.compress`. And table-level
compression( set by tblproperties) has the highest priority.
`parquet.compression` set by cli also has a higher priority than
`mapreduce.output.fileoutputformat.compress`.
After this pr, the priority does not changed. If table-level compression
was set, other compression would not take effect, even though
`mapreduce.output....` were set, which is the same with hive. But
`parquet.compression` set by spark cli does not take effect, unless set
`hive.exec.compress.output` to true. This may because we do not get
`parquet.compression` from the session, and I wonder if it's necessary because
we have `spark.sql.parquet.comression` instead.
For orc, `hive.exec.compress.output` and `mapreduce.output....` have no
impact really, but table-leval compression (set by tblproperties) always take
effect. `orc.compression` set by spark cli does not take effect too, even
though set `hive.exec.compress.output` to true, which is differet with
parquet.
Another question, the comment say `it uses table properties to store
compression information`, actully, by manul test, I found orc-tables also can
have mixed compressions, and the data can be read together correctly.
My Hive version for this test is 1.1.0. Actully it's a little difficut for
me to get a higher version runable Hive client.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]