[
https://issues.apache.org/jira/browse/HIVE-16758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070198#comment-16070198
]
BELUGA BEHR commented on HIVE-16758:
------------------------------------
[~aihuaxu]
The HDFS block replication factor is different from
{{mapreduce.client.submit.file.replication}} replication. Files like job.xml,
and JARs stored in the dist cache are set with a higher replication factor so
that it is more likely that the blocks end up on the nodes assigned to perform
Map and Reduce. This is the same scenario here: distribute the HashSink output
file to a sufficient number of nodes so that there is a good possibility that
the blocks end up nodes running other Spark Executors.
> Better Select Number of Replications
> ------------------------------------
>
> Key: HIVE-16758
> URL: https://issues.apache.org/jira/browse/HIVE-16758
> Project: Hive
> Issue Type: Improvement
> Reporter: BELUGA BEHR
> Assignee: BELUGA BEHR
> Priority: Minor
> Attachments: HIVE-16758.1.patch
>
>
> {{org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.java}}
> We should be smarter about how we pick a replication number. We should add a
> new configuration equivalent to {{mapreduce.client.submit.file.replication}}.
> This value should be around the square root of the number of nodes and not
> hard-coded in the code.
> {code}
> public static final String DFS_REPLICATION_MAX = "dfs.replication.max";
> private int minReplication = 10;
> @Override
> protected void initializeOp(Configuration hconf) throws HiveException {
> ...
> int dfsMaxReplication = hconf.getInt(DFS_REPLICATION_MAX, minReplication);
> // minReplication value should not cross the value of dfs.replication.max
> minReplication = Math.min(minReplication, dfsMaxReplication);
> }
> {code}
> https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)