GitHub user chenghao-intel reopened a pull request:
https://github.com/apache/spark/pull/2589
[SPARK-3739] [SQL] Update the split num base on block size for table
scanning
Source file input split is probably better based on block size of HDFS, or
the setting of 'mapred.map.tasks' while scanning Hive table, rather than
'defaultMinPartitions'.
Checkout
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.3.0-mr1-cdh5.1.0/org/apache/hadoop/mapred/FileInputFormat.java#203
for how handles the input splits by default.
Currently, there will be 2 splits for small table file in scanning, which
probably causes performance issue or different result with Hive in query.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/chenghao-intel/spark source_split
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2589.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2589
----
commit c78a0450abde75355b901d5adb78a5a2f73aec64
Author: Cheng Hao <[email protected]>
Date: 2014-10-10T05:26:09Z
Keep 1 split for small file in table scanning
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]