[ https://issues.apache.org/jira/browse/AMBARI-16913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Hurley updated AMBARI-16913: ------------------------------------- Resolution: Fixed Status: Resolved (was: Patch Available) > Web Client Requests Handled By Jetty Should Not Be Blocked By JMX Property > Providers > ------------------------------------------------------------------------------------ > > Key: AMBARI-16913 > URL: https://issues.apache.org/jira/browse/AMBARI-16913 > Project: Ambari > Issue Type: Bug > Components: ambari-server > Affects Versions: 2.0.0 > Reporter: Jonathan Hurley > Assignee: Jonathan Hurley > Priority: Blocker > Fix For: 2.4.0 > > Attachments: AMBARI-16913.patch > > > Incoming requests from the web client (or from any REST API) will eventually > be routed to the property provider / subresource framework. It is here were > any JMX data is queried for within the context of the REST request. In large > clusters, these requests can backup quite easily (even with a massive > threadpool), causing UX degradations in the web client: > {code} > Thread [qtp-ambari-client-38] > > JMXPropertyProvider(ThreadPoolEnabledPropertyProvider).populateResources(Set<Resource>, > Request, Predicate) line: 168 > JMXPropertyProvider.populateResources(Set<Resource>, Request, > Predicate) line: 156 > StackDefinedPropertyProvider.populateResources(Set<Resource>, Request, > Predicate) line: 200 > ClusterControllerImpl.populateResources(Type, Set<Resource>, Request, > Predicate) line: 155 > QueryImpl.queryForResources() line: 407 > QueryImpl.execute() line: 217 > ReadHandler.handleRequest(Request) line: 69 > GetRequest(BaseRequest).process() line: 145 > {code} > Consider one of the calls made by the web client: > {code} > GET api/v1/clusters/c1/components/? > ServiceComponentInfo/category=MASTER& > fields= > ServiceComponentInfo/service_name, > host_components/HostRoles/display_name, > host_components/HostRoles/host_name, > host_components/HostRoles/state, > host_components/HostRoles/maintenance_state, > host_components/HostRoles/stale_configs, > host_components/HostRoles/ha_state, > host_components/HostRoles/desired_admin_state, > host_components/metrics/jvm/memHeapUsedM, > host_components/metrics/jvm/HeapMemoryMax, > host_components/metrics/jvm/HeapMemoryUsed, > host_components/metrics/jvm/memHeapCommittedM, > host_components/metrics/mapred/jobtracker/trackers_decommissioned, > host_components/metrics/cpu/cpu_wio, > host_components/metrics/rpc/client/RpcQueueTime_avg_time, > host_components/metrics/dfs/FSNamesystem/*, > host_components/metrics/dfs/namenode/Version, > host_components/metrics/dfs/namenode/LiveNodes, > host_components/metrics/dfs/namenode/DeadNodes, > host_components/metrics/dfs/namenode/DecomNodes, > host_components/metrics/dfs/namenode/TotalFiles, > host_components/metrics/dfs/namenode/UpgradeFinalized, > host_components/metrics/dfs/namenode/Safemode, > host_components/metrics/runtime/StartTime > {code} > This query is essentially saying that for every {{MASTER}}, get metrics from > them. The problem is that in a large cluster, there could be 100 masters, yet > the metrics being asked for are only for NameNode. As a result, the JMX > endpoints for all 100 masters are queried - *live* - as part of the request. > There are two inherent flaws with this approach: > - Even with millisecond JMX response times, multiplying this by 100's and > then adding parsing overhead causes a noticeable delay in the web client as > the federated requests are blocking the main UX request > - Although there is a threadpool which scales up to service these requests - > that only really works for 1 user. With multiple users logged in, you'd need > 100's upon 100's of threads pulling in the same JMX data > This data should never be queried for directly as part of the incoming REST > requests. Instead, an autonomous pool of threads should be constantly > retrieving these point-in-time metrics and updating a cache. The cache is > then used to service all live REST requests. > - On the first request to a resource, a cache miss occurs and no data is > returned. I think this is acceptable since metrics take a few moments to > populate anyway right now. As the web client polls, the next request should > pickup the newly cached metrics. > - Only URLs which are being asked for by incoming REST requests should be > considered for retrieval. After sometime, if they haven't been requested, > then the headless threadpool can stop trying to update their data > - All JMX data will be parsed and stored in-memory, in an expiring cache -- This message was sent by Atlassian JIRA (v6.3.4#6332)