[GitHub] Deegue opened a new pull request #23506: [SPARK-26577][SQL] Add input optimizer when reading Hive table by SparkSQL

GitBox Wed, 09 Jan 2019 22:32:14 -0800

Deegue opened a new pull request #23506: [SPARK-26577][SQL] Add input optimizer 
when reading Hive table by SparkSQL
URL: https://github.com/apache/spark/pull/23506
 
 
   ## What changes were proposed in this pull request?
   
   When using SparkSQL, for example the ThriftServer, if we set
   
   `spark.sql.hive.fileInputFormat.enabled=true`
   
   we can optimize the InputFormat to CombineTextInputFormat automatically if 
it's TextInputFormat before. And we can also change the max/min size of input 
splits by setting, for example 
   
   `spark.sql.hive.fileInputFormat.split.maxsize=268435456`
   `spark.sql.hive.fileInputFormat.split.minsize=134217728`
   
   Otherwise, we have to modify Hive Configs and structure of tables.
   
   And we made a test by using a Hive table with a lot of small files in HDFS 
and haven't combined :
   
   Before improved:
   
![image](https://user-images.githubusercontent.com/25916266/50877374-85e43780-140c-11e9-9724-31d367739552.png)
   
   
   After improved:
   
![image](https://user-images.githubusercontent.com/25916266/50877387-9694ad80-140c-11e9-99e2-f55a3c7285e0.png)
   
   
   
   ## How was this patch tested?
   
   Added a test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] Deegue opened a new pull request #23506: [SPARK-26577][SQL] Add input optimizer when reading Hive table by SparkSQL

Reply via email to