[
https://issues.apache.org/jira/browse/HDFS-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13879232#comment-13879232
]
Eric Sirianni commented on HDFS-5318:
-------------------------------------
Arpit - thanks for the quick feedback.
Unless otherwise specified, I have made the changes as you suggested.
bq. I think it would be a good idea to rename READ_ONLY to READ_ONLY_SHARED
everywhere (including comments) and document the meaning. What do you think?
I agree, and have made the change in my new patch.
bq. Remove lines with whitespace-only changes.
I only saw one such instance (in {{SimulatedFsDataset}}) and have corrected it.
bq. I am not sure if this change affects lease recovery. It would be good if
someone familiar with how lease recovery works can take a quick look at the
approach. We don't want read-only replicas to participate in recovery. What if
all read-write replicas are unavailable?
In our {{FsDatasetSpi}} plugin implementation, replicas are not exposed from
{{READ_ONLY_SHARED}} storages until they are _finalized_ (by "exposed" I mean
returned from incremental and full block reports). This skirts many issues
around lease recovery. However:
* The NameNode should not rely on this implementation detail of our
{{FsDatasetSpi}} plugin
* Finalized replicas can still participate in certain lease recovery scenarios
I looked a bit into how the recovery locations are chosen. The locations used
in lease recovery come from
{{BlockInfoUnderConstruction.getExpectedStorageLocations()}}. That list is
potentially updated with {{READ_ONLY_SHARED}} locations when a
{{BLOCK_RECEIVED}} is processed from a {{READ_ONLY_SHARED}} storage.
It is fairly simple to exclude read-only replicas when {{DatanodeManager}}
builds the {{BlockRecoveryCommand}} (much like stale locations are handled).
However, (as you point out), it's unclear what to do here if there are *zero*
read-write replicas available. I believe the desired behavior is for recovery
to fail and the block be discarded (we can't really make any better guarantees
due to lack of fsync()d data available on read/only nodes). I think we would
need to shortcircuit recovery earlier in this case (perhaps in
{{FSNamesystem.internalReleaseLease()}}) since it's probably too late by the
time {{DatanodeManager}} builds the {{BlockRecoveryCommand}}. Am I wrong here?
Another alternative is to skip read/only storages in
{{BlockInfoUnderConstruction.addReplicaIfNotPresent}} (though this may have the
undesired effect of no-opping block-received messages from read/only
storages... Thoughts?
> Support read-only and read-write paths to shared replicas
> ---------------------------------------------------------
>
> Key: HDFS-5318
> URL: https://issues.apache.org/jira/browse/HDFS-5318
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Affects Versions: 2.4.0
> Reporter: Eric Sirianni
> Attachments: HDFS-5318.patch, HDFS-5318a-branch-2.patch,
> HDFS-5318b-branch-2.patch, hdfs-5318.pdf
>
>
> There are several use cases for using shared-storage for datanode block
> storage in an HDFS environment (storing cold blocks on a NAS device, Amazon
> S3, etc.).
> With shared-storage, there is a distinction between:
> # a distinct physical copy of a block
> # an access-path to that block via a datanode.
> A single 'replication count' metric cannot accurately capture both aspects.
> However, for most of the current uses of 'replication count' in the Namenode,
> the "number of physical copies" aspect seems to be the appropriate semantic.
> I propose altering the replication counting algorithm in the Namenode to
> accurately infer distinct physical copies in a shared storage environment.
> With HDFS-5115, a {{StorageID}} is a UUID. I propose associating some minor
> additional semantics to the {{StorageID}} - namely that multiple datanodes
> attaching to the same physical shared storage pool should report the same
> {{StorageID}} for that pool. A minor modification would be required in the
> DataNode to enable the generation of {{StorageID}} s to be pluggable behind
> the {{FsDatasetSpi}} interface.
> With those semantics in place, the number of physical copies of a block in a
> shared storage environment can be calculated as the number of _distinct_
> {{StorageID}} s associated with that block.
> Consider the following combinations for two {{(DataNode ID, Storage ID)}}
> pairs {{(DN_A, S_A) (DN_B, S_B)}} for a given block B:
> * {{DN_A != DN_B && S_A != S_B}} - *different* access paths to *different*
> physical replicas (i.e. the traditional HDFS case with local disks)
> ** → Block B has {{ReplicationCount == 2}}
> * {{DN_A != DN_B && S_A == S_B}} - *different* access paths to the *same*
> physical replica (e.g. HDFS datanodes mounting the same NAS share)
> ** → Block B has {{ReplicationCount == 1}}
> For example, if block B has the following location tuples:
> * {{DN_1, STORAGE_A}}
> * {{DN_2, STORAGE_A}}
> * {{DN_3, STORAGE_B}}
> * {{DN_4, STORAGE_B}},
> the effect of this proposed change would be to calculate the replication
> factor in the namenode as *2* instead of *4*.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)