AngersZhuuuu opened a new pull request #35528:
URL: https://github.com/apache/spark/pull/35528
### What changes were proposed in this pull request?
Currently spark sql
```
INSERT OVERWRITE DIRECTORY 'path'
STORED AS PARQUET
query
```
can't be converted to use InsertIntoDataSourceCommand, still use Hive SerDe
to write data, this cause we can't use feature provided by new parquet/orc
version, such as zstd compress.
```
spark-sql> INSERT OVERWRITE DIRECTORY
'hdfs://nameservice/user/hive/warehouse/test_zstd_dir'
> stored as parquet
> select 1 as id;
[Stage 5:> (0 + 1)
/ 1]22/02/15 16:49:31 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5,
ip-xx-xx-xx-xx, executor 21): org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.IllegalArgumentException: No enum constant
parquet.hadoop.metadata.CompressionCodecName.ZSTD
at
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
at
org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:123)
at
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
at
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:269)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:203)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:202)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
### Why are the changes needed?
Convert InsertIntoHiveDirCommand to InsertIntoDataSourceCommand can support
more features of parquet/orc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]