[ https://issues.apache.org/jira/browse/CRUNCH-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875038#comment-15875038 ]
Attila Sasvari commented on CRUNCH-636: --------------------------------------- I have a poc that suggests that the approach I previously recommended is fragile (executed 3 times a sample dataflow, and replication settings were not set deterministically). [~joshwills] What is your opinion about this ticket/feature? If we allow users to set different replication factors for intermediate files, and they set it to 1, then if a disk fail that stores the data before the pipeline finishes, the whole Crunch pipeline should crash. If a job has both temporary and non-temporary output, then the replication factor should be the one used for the non-temporary. I don't know all the possible cases, but it doesn't seem that trivial to me. > Make replication factor for temporary files configurable > -------------------------------------------------------- > > Key: CRUNCH-636 > URL: https://issues.apache.org/jira/browse/CRUNCH-636 > Project: Crunch > Issue Type: New Feature > Reporter: Attila Sasvari > Assignee: Attila Sasvari > > As of now, Crunch does not allow having different replication factor for > temporary files and non-temporary files (e.g. final output data of leaf > nodes) at the same time. If a user has a large amount of data (say hundreds a > of gigabytes) to process, they might want to have lower replication factor > for large temporary files between Crunch jobs. > We could make this configurable via a new setting (e.g. > {{crunch.tmp.dir.replication}}). -- This message was sent by Atlassian JIRA (v6.3.15#6346)