[GitHub] [spark] wangqia0309 opened a new pull request #32331: [SPARK-35216][SQL] a general auto merge output files feature for datasource api

GitBox Sun, 25 Apr 2021 01:10:38 -0700


wangqia0309 opened a new pull request #32331:
URL: https://github.com/apache/spark/pull/32331



   in most case, users write data to hive table or hdfs dir with spark sql, 
since as spark3.0 released, offical didn't encourge to use hive module to 
read/write hive table, preferred  switching to datasoruce api from hive 
strategy rule, so as to centralize io operation with one module.
   
   so given a general auto merge output files ability for datasource api would 
resolve many users's small files problem in production, and it can bind with 
datasource write framwork tightly, so that the auto merge course is transparent 
to users, and it is capable to handle all kinds of writing method, such as 
writing hdfs dir/non-partitioned hive table/dynamic partition hive table
   
   this is my individual implemetation for the functionality, and it's stable 
in production environment of my company
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wangqia0309 opened a new pull request #32331: [SPARK-35216][SQL] a general auto merge output files feature for datasource api

Reply via email to