[jira] [Updated] (SPARK-22233) filter out empty InputSplit in HadoopRDD

Lijia Liu (JIRA) Mon, 09 Oct 2017 23:51:34 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-22233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lijia Liu updated SPARK-22233:
------------------------------
    Description: 
Sometimes, Hive will create an empty table with many empty files, Spark use the 
InputFormat stored in Hive Meta Store and will not combine the empty files and 
therefore generate many tasks to handle this empty files.
Hive use CombineHiveInputFormat(hive.input.format) by default.
So, in this case, Spark will spends much more resources than hive.

2 suggestions:
1. add a configuration, filter out empty InputSplit in HadoopRDD.
2. add a configuration, user can customize the inputformatclass in 
HadoopTableReader.

  was:
Sometimes, Hive will create an empty table with many empty files, Spark use the 
InputFormat stored in Hive Meta Store and will not combine the empty files and 
therefore generate many tasks to handle this empty files.
Hive use CombineHiveInputFormat(hive.input.format) by default.
So, in this case, Spark will spends much more resources than hive.

3 suggestions:
1. add a configuration, filter out empty InputSplit in HadoopRDD.
2. add a configuration, user can customize the inputformatclass in 
HadoopTableReader.


> filter out empty InputSplit in HadoopRDD
> ----------------------------------------
>
>                 Key: SPARK-22233
>                 URL: https://issues.apache.org/jira/browse/SPARK-22233
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.2.0
>         Environment: spark version:Spark 2.2
> master: yarn
> deploy-mode: cluster
>            Reporter: Lijia Liu
>
> Sometimes, Hive will create an empty table with many empty files, Spark use 
> the InputFormat stored in Hive Meta Store and will not combine the empty 
> files and therefore generate many tasks to handle this empty files.
> Hive use CombineHiveInputFormat(hive.input.format) by default.
> So, in this case, Spark will spends much more resources than hive.
> 2 suggestions:
> 1. add a configuration, filter out empty InputSplit in HadoopRDD.
> 2. add a configuration, user can customize the inputformatclass in 
> HadoopTableReader.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22233) filter out empty InputSplit in HadoopRDD

Reply via email to