[GitHub] [spark] AngersZhuuuu opened a new pull request #35528: [SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible

GitBox Tue, 15 Feb 2022 01:58:08 -0800


AngersZhuuuu opened a new pull request #35528:
URL: https://github.com/apache/spark/pull/35528



   ### What changes were proposed in this pull request?
   Currently spark sql 
   ```
   INSERT OVERWRITE DIRECTORY 'path'
   STORED AS PARQUET
   query
   ```
   can't be converted to use InsertIntoDataSourceCommand, still use Hive SerDe 
to write data, this cause we can't use feature provided by new parquet/orc 
version, such as zstd compress.
   
   ```
   spark-sql> INSERT OVERWRITE DIRECTORY 
'hdfs://nameservice/user/hive/warehouse/test_zstd_dir'
            > stored as parquet
            > select 1 as id;
   [Stage 5:>                                                          (0 + 1) 
/ 1]22/02/15 16:49:31 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, 
ip-xx-xx-xx-xx, executor 21): org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IllegalArgumentException: No enum constant 
parquet.hadoop.metadata.CompressionCodecName.ZSTD
        at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
        at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:123)
        at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
        at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
        at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
        at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:269)
        at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:203)
        at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:202)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   ```
   
   
   ### Why are the changes needed?
   Convert InsertIntoHiveDirCommand  to InsertIntoDataSourceCommand can support 
more features of parquet/orc
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] AngersZhuuuu opened a new pull request #35528: [SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible

Reply via email to