[
https://issues.apache.org/jira/browse/HDFS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Harsh J updated HDFS-2936:
--------------------------
Description:
If an admin wishes to enforce replication today for all the users of their
cluster, he may set {{dfs.namenode.replication.min}}. This property prevents
users from creating files with < expected replication factor.
However, the value of minimum replication set by the above value is also
checked at several other points, especially during completeFile (close)
operations. If a condition arises wherein a write's pipeline may have gotten
only < minimum nodes in it, the completeFile operation does not successfully
close the file and the client begins to hang waiting for NN to replicate the
last bad block in the background. This form of hard-guarantee can, for example,
bring down clusters of HBase during high xceiver load on DN, or disk fill-ups
on many of them, etc..
I propose we should split the property in two parts:
* dfs.namenode.replication.min
** Stays the same name, but only checks file creation time replication factor
value and during adjustments made via setrep/etc.
* dfs.namenode.replication.min.for.write
** New property that disconnects the rest of the checks from the above
property, such as the checks done during block commit, file complete/close,
safemode checks for block availability, etc..
Alternatively, we may also choose to remove the client-side hang of
completeFile/close calls with a set number of retries. This would further
require discussion about how a file-closure handle ought to be handled.
was:
Currently, if an admin would like to enforce a replication factor for all files
on his HDFS, he does not have a way. He may arguably set dfs.replication.min
but that is a very hard guarantee and if the pipeline can't afford that number
for some reason/failure, the close() does not succeed on the file being written
and leads to several issues.
After discussing with Todd, we feel it would make sense to introduce a second
config (which is ${dfs.replication.min} by default) which would act as a
minimum specified replication for files. This is different than
dfs.replication.min which also ensures that many replicas are recorded before
completeFile() returns... perhaps something like ${dfs.replication.min.user}.
We can leave dfs.replication.min alone for hard-guarantees and add
${dfs.replication.min.for.block.completion} which could be left at 1 even if
dfs.replication.min is >1, and let files complete normally but not be of a low
replication factor (so can be monitored and accounted-for later).
I'm prefering the second option myself. Will post a patch with tests soon.
Summary: File close()-ing hangs indefinitely if the number of live
blocks does not match the minimum replication (was: Provide a better way to
specify a HDFS-wide minimum replication requirement)
> File close()-ing hangs indefinitely if the number of live blocks does not
> match the minimum replication
> -------------------------------------------------------------------------------------------------------
>
> Key: HDFS-2936
> URL: https://issues.apache.org/jira/browse/HDFS-2936
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: name-node
> Affects Versions: 0.23.0
> Reporter: Harsh J
> Assignee: Harsh J
> Attachments: HDFS-2936.patch
>
>
> If an admin wishes to enforce replication today for all the users of their
> cluster, he may set {{dfs.namenode.replication.min}}. This property prevents
> users from creating files with < expected replication factor.
> However, the value of minimum replication set by the above value is also
> checked at several other points, especially during completeFile (close)
> operations. If a condition arises wherein a write's pipeline may have gotten
> only < minimum nodes in it, the completeFile operation does not successfully
> close the file and the client begins to hang waiting for NN to replicate the
> last bad block in the background. This form of hard-guarantee can, for
> example, bring down clusters of HBase during high xceiver load on DN, or disk
> fill-ups on many of them, etc..
> I propose we should split the property in two parts:
> * dfs.namenode.replication.min
> ** Stays the same name, but only checks file creation time replication factor
> value and during adjustments made via setrep/etc.
> * dfs.namenode.replication.min.for.write
> ** New property that disconnects the rest of the checks from the above
> property, such as the checks done during block commit, file complete/close,
> safemode checks for block availability, etc..
> Alternatively, we may also choose to remove the client-side hang of
> completeFile/close calls with a set number of retries. This would further
> require discussion about how a file-closure handle ought to be handled.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira