Erik Krogen created HDFS-14211: ---------------------------------- Summary: [Consistent Observer Reads] Allow for configurable "always msync" mode Key: HDFS-14211 URL: https://issues.apache.org/jira/browse/HDFS-14211 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client Reporter: Erik Krogen
To allow for reads to be serviced from an ObserverNode (see HDFS-12943) in a consistent way, an {{msync}} API was introduced (HDFS-13688) to allow for a client to fetch the latest transaction ID from the Active NN, thereby ensuring that subsequent reads from the ObserverNode will be up-to-date with the current state of the Active. Using this properly, however, requires application-side changes: for examples, a NodeManager should call {{msync}} before localizing the resources for a client, since it received notification of the existence of those resources via communicate which is out-of-band to HDFS and thus could potentially attempt to localize them prior to the availability of those resources on the ObserverNode. Until such application-side changes can be made, which will be a longer-term effort, we need to provide a mechanism for unchanged clients to utilize the ObserverNode without exposing such a client to inconsistencies. This is essentially phase 3 of the roadmap outlined in the [design document|https://issues.apache.org/jira/secure/attachment/12915990/ConsistentReadsFromStandbyNode.pdf] for HDFS-12943. The design document proposes some heuristics based on understanding of how common applications (e.g. MR) use HDFS for resources. As an initial pass, we can simply have a flag which tells a client to call {{msync}} before _every single_ read operation. This may seem counterintuitive, as it turns every read operation into two RPCs: {{msync}} to the Active following by an actual read operation to the Observer. However, the {{msync}} operation is extremely lightweight, as it does not acquire the {{FSNamesystemLock}}, and in experiments we have found that this approach can easily scale to well over 100,000 {{msync}} operations per second on the Active (while still servicing approx. 10,000 write op/s). Combined with the fast-path edit log tailing for standby/observer nodes (HDFS-13150), this "always msync" approach should introduce only a few ms of extra latency to each read call. Below are some experimental results collected from experiments which convert a normal RPC workload into one in which all read operations are turned into an {{msync}}. The baseline is a workload of 1.5k write op/s and 25k read op/s. ||Rate Multiplier|2|4|6|8|| ||RPC Queue Avg Time (ms)|14.2|53.2|110.4|125.3|| ||RPC Queue NumOps Avg (k)|51.4|102.3|147.8|177.9|| ||RPC Queue NumOps Max (k)|148.8|269.5|306.3|312.4|| Results are promising up to between 4x and 6x of the baseline workload, which is approx. 100-150k read op/s. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org