GitHub user chenghao-intel reopened a pull request:

    https://github.com/apache/spark/pull/2589

    [SPARK-3739] [SQL] Update the split num base on block size for table 
scanning

    Source file input split is probably better based on block size of HDFS, or 
the setting of 'mapred.map.tasks' while scanning Hive table, rather than 
'defaultMinPartitions'. 
    Checkout 
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.3.0-mr1-cdh5.1.0/org/apache/hadoop/mapred/FileInputFormat.java#203
 for how handles the input splits by default.
    
    Currently, there will be 2 splits for small table file in scanning, which 
probably causes performance issue  or different result with Hive in query.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chenghao-intel/spark source_split

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2589.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2589
    
----
commit c78a0450abde75355b901d5adb78a5a2f73aec64
Author: Cheng Hao <[email protected]>
Date:   2014-10-10T05:26:09Z

    Keep 1 split for small file in table scanning

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to