[
https://issues.apache.org/jira/browse/HDFS-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791865#comment-16791865
]
Erik Krogen commented on HDFS-14211:
------------------------------------
I agree that there is potential issue with that [~csun], but I think allowing
this to be configurable lets different clients make the decision of where on
the performance vs. consistency spectrum they want to sit.
Thanks for the reviews, [~shv], I have uploaded a v001 patch making it
configurable over a time period.
> [Consistent Observer Reads] Allow for configurable "always msync" mode
> ----------------------------------------------------------------------
>
> Key: HDFS-14211
> URL: https://issues.apache.org/jira/browse/HDFS-14211
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client
> Reporter: Erik Krogen
> Assignee: Erik Krogen
> Priority: Major
> Attachments: HDFS-14211.000.patch, HDFS-14211.001.patch
>
>
> To allow for reads to be serviced from an ObserverNode (see HDFS-12943) in a
> consistent way, an {{msync}} API was introduced (HDFS-13688) to allow for a
> client to fetch the latest transaction ID from the Active NN, thereby
> ensuring that subsequent reads from the ObserverNode will be up-to-date with
> the current state of the Active.
> Using this properly, however, requires application-side changes: for
> examples, a NodeManager should call {{msync}} before localizing the resources
> for a client, since it received notification of the existence of those
> resources via communicate which is out-of-band to HDFS and thus could
> potentially attempt to localize them prior to the availability of those
> resources on the ObserverNode.
> Until such application-side changes can be made, which will be a longer-term
> effort, we need to provide a mechanism for unchanged clients to utilize the
> ObserverNode without exposing such a client to inconsistencies. This is
> essentially phase 3 of the roadmap outlined in the [design
> document|https://issues.apache.org/jira/secure/attachment/12915990/ConsistentReadsFromStandbyNode.pdf]
> for HDFS-12943.
> The design document proposes some heuristics based on understanding of how
> common applications (e.g. MR) use HDFS for resources. As an initial pass, we
> can simply have a flag which tells a client to call {{msync}} before _every
> single_ read operation. This may seem counterintuitive, as it turns every
> read operation into two RPCs: {{msync}} to the Active following by an actual
> read operation to the Observer. However, the {{msync}} operation is extremely
> lightweight, as it does not acquire the {{FSNamesystemLock}}, and in
> experiments we have found that this approach can easily scale to well over
> 100,000 {{msync}} operations per second on the Active (while still servicing
> approx. 10,000 write op/s). Combined with the fast-path edit log tailing for
> standby/observer nodes (HDFS-13150), this "always msync" approach should
> introduce only a few ms of extra latency to each read call.
> Below are some experimental results collected from experiments which convert
> a normal RPC workload into one in which all read operations are turned into
> an {{msync}}. The baseline is a workload of 1.5k write op/s and 25k read op/s.
> ||Rate Multiplier|2|4|6|8||
> ||RPC Queue Avg Time (ms)|14|53|110|125||
> ||RPC Queue NumOps Avg (k)|51|102|147|177||
> ||RPC Queue NumOps Max (k)|148|269|306|312||
> _(numbers are approximate and should be viewed primarily for their trends)_
> Results are promising up to between 4x and 6x of the baseline workload, which
> is approx. 100-150k read op/s.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]