Milind Bhandarkar commented on PIG-856:


setting dfs.replication for the whole pig script will result in all output 
(including output from store) to have the same replication. I believe you want 
to internally set the replication to 2 for temporary data only.

Secondly, setting dfs replication to 2 has minimal effect on namenode, because 
it results in only one RPC to the namenode regardless of the replication 
factor. In terms of the datanodes, it may have some effect in reducing raw 
diskspace on datanodes (but its temporary data only.) In terms of network usage 
also, it has a really small effect on performance. (First replica is local, 
second across the rack, and third within the same rack as second, so 
replication of 2 will only reduce the intra-rack network usage between second 
and third repica.)

Have you considered failure scenarios if replication for temporary data is set 
to 1 ? (That will be a nicer tradeoff between re-executing a job, and 
inter-rack network usage.)

> PERFORMANCE: reduce number of replicas
> --------------------------------------
>                 Key: PIG-856
>                 URL: https://issues.apache.org/jira/browse/PIG-856
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Olga Natkovich
> Currently Pig uses the default number of replicas between MR jobs. Currently, 
> the number is 3. Given the temp nature of the data, we should never need more 
> than 2 and should explicitely set it to improve performance and to be nicer 
> to the name node.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to