[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721565#action_12721565 ]
Milind Bhandarkar commented on PIG-856: --------------------------------------- Olga, setting dfs.replication for the whole pig script will result in all output (including output from store) to have the same replication. I believe you want to internally set the replication to 2 for temporary data only. Secondly, setting dfs replication to 2 has minimal effect on namenode, because it results in only one RPC to the namenode regardless of the replication factor. In terms of the datanodes, it may have some effect in reducing raw diskspace on datanodes (but its temporary data only.) In terms of network usage also, it has a really small effect on performance. (First replica is local, second across the rack, and third within the same rack as second, so replication of 2 will only reduce the intra-rack network usage between second and third repica.) Have you considered failure scenarios if replication for temporary data is set to 1 ? (That will be a nicer tradeoff between re-executing a job, and inter-rack network usage.) > PERFORMANCE: reduce number of replicas > -------------------------------------- > > Key: PIG-856 > URL: https://issues.apache.org/jira/browse/PIG-856 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.3.0 > Reporter: Olga Natkovich > > Currently Pig uses the default number of replicas between MR jobs. Currently, > the number is 3. Given the temp nature of the data, we should never need more > than 2 and should explicitely set it to improve performance and to be nicer > to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.