[
https://issues.apache.org/jira/browse/HDFS-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794545#comment-16794545
]
Ayush Saxena commented on HDFS-14211:
-------------------------------------
Hi [~xkrogen]
Had a doubt regarding this. if you could help me up.
What I got from here. This intents to bring up a configuration that ensures the
state that the client has isn't older then the time specified.(msync is not
older than say x milliseconds?)
A negative value, makes this disabled. A zero means always sync up and a value
means check if last implicit mysnc isn't older then the value defined.if it is
older then mysc else proceed.Is there some place we document these configs and
the defaults. As we do for other HDFS configs?
The time that we are comparing with is the implicit interval. What say if the
client had an explicit call just 1 millisecond before? and the interval was 25
ms. Will we be still do an msync again? or is somewhere the explicit mysync
scenario is there, I didn't catch up.
If it isn't there as up Chao said: that {{msync}} still has to go through the
RPC queue on the active NN. That could have been prevented. If the idea is just
msync shouldn't be more than x m.seconds old, rather than implicitly try msync
at least every x m.seconds.
If a client works in a manor say it shall require for most calls an x ms of
msync, but for some it shall require immediate, can't it explicitly call for
the purpose?
Apologies if got completely off track and created noise.:)
> [Consistent Observer Reads] Allow for configurable "always msync" mode
> ----------------------------------------------------------------------
>
> Key: HDFS-14211
> URL: https://issues.apache.org/jira/browse/HDFS-14211
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client
> Reporter: Erik Krogen
> Assignee: Erik Krogen
> Priority: Major
> Attachments: HDFS-14211.000.patch, HDFS-14211.001.patch
>
>
> To allow for reads to be serviced from an ObserverNode (see HDFS-12943) in a
> consistent way, an {{msync}} API was introduced (HDFS-13688) to allow for a
> client to fetch the latest transaction ID from the Active NN, thereby
> ensuring that subsequent reads from the ObserverNode will be up-to-date with
> the current state of the Active.
> Using this properly, however, requires application-side changes: for
> examples, a NodeManager should call {{msync}} before localizing the resources
> for a client, since it received notification of the existence of those
> resources via communicate which is out-of-band to HDFS and thus could
> potentially attempt to localize them prior to the availability of those
> resources on the ObserverNode.
> Until such application-side changes can be made, which will be a longer-term
> effort, we need to provide a mechanism for unchanged clients to utilize the
> ObserverNode without exposing such a client to inconsistencies. This is
> essentially phase 3 of the roadmap outlined in the [design
> document|https://issues.apache.org/jira/secure/attachment/12915990/ConsistentReadsFromStandbyNode.pdf]
> for HDFS-12943.
> The design document proposes some heuristics based on understanding of how
> common applications (e.g. MR) use HDFS for resources. As an initial pass, we
> can simply have a flag which tells a client to call {{msync}} before _every
> single_ read operation. This may seem counterintuitive, as it turns every
> read operation into two RPCs: {{msync}} to the Active following by an actual
> read operation to the Observer. However, the {{msync}} operation is extremely
> lightweight, as it does not acquire the {{FSNamesystemLock}}, and in
> experiments we have found that this approach can easily scale to well over
> 100,000 {{msync}} operations per second on the Active (while still servicing
> approx. 10,000 write op/s). Combined with the fast-path edit log tailing for
> standby/observer nodes (HDFS-13150), this "always msync" approach should
> introduce only a few ms of extra latency to each read call.
> Below are some experimental results collected from experiments which convert
> a normal RPC workload into one in which all read operations are turned into
> an {{msync}}. The baseline is a workload of 1.5k write op/s and 25k read op/s.
> ||Rate Multiplier|2|4|6|8||
> ||RPC Queue Avg Time (ms)|14|53|110|125||
> ||RPC Queue NumOps Avg (k)|51|102|147|177||
> ||RPC Queue NumOps Max (k)|148|269|306|312||
> _(numbers are approximate and should be viewed primarily for their trends)_
> Results are promising up to between 4x and 6x of the baseline workload, which
> is approx. 100-150k read op/s.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]