[ 
https://issues.apache.org/jira/browse/AMBARI-16913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hurley updated AMBARI-16913:
-------------------------------------
    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> Web Client Requests Handled By Jetty Should Not Be Blocked By JMX Property 
> Providers
> ------------------------------------------------------------------------------------
>
>                 Key: AMBARI-16913
>                 URL: https://issues.apache.org/jira/browse/AMBARI-16913
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.0.0
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Blocker
>             Fix For: 2.4.0
>
>         Attachments: AMBARI-16913.patch
>
>
> Incoming requests from the web client (or from any REST API) will eventually 
> be routed to the property provider / subresource framework. It is here were 
> any JMX data is queried for within the context of the REST request. In large 
> clusters, these requests can backup quite easily (even with a massive 
> threadpool), causing UX degradations in the web client:
> {code}
> Thread [qtp-ambari-client-38]
>       
> JMXPropertyProvider(ThreadPoolEnabledPropertyProvider).populateResources(Set<Resource>,
>  Request, Predicate) line: 168   
>       JMXPropertyProvider.populateResources(Set<Resource>, Request, 
> Predicate) line: 156      
>       StackDefinedPropertyProvider.populateResources(Set<Resource>, Request, 
> Predicate) line: 200     
>       ClusterControllerImpl.populateResources(Type, Set<Resource>, Request, 
> Predicate) line: 155      
>       QueryImpl.queryForResources() line: 407 
>       QueryImpl.execute() line: 217   
>       ReadHandler.handleRequest(Request) line: 69     
>       GetRequest(BaseRequest).process() line: 145     
> {code}
> Consider one of the calls made by the web client:
> {code}
> GET api/v1/clusters/c1/components/?
> ServiceComponentInfo/category=MASTER&
> fields=
> ServiceComponentInfo/service_name,
> host_components/HostRoles/display_name,
> host_components/HostRoles/host_name,
> host_components/HostRoles/state,
> host_components/HostRoles/maintenance_state,
> host_components/HostRoles/stale_configs,
> host_components/HostRoles/ha_state,
> host_components/HostRoles/desired_admin_state,
> host_components/metrics/jvm/memHeapUsedM,
> host_components/metrics/jvm/HeapMemoryMax,
> host_components/metrics/jvm/HeapMemoryUsed,
> host_components/metrics/jvm/memHeapCommittedM,
> host_components/metrics/mapred/jobtracker/trackers_decommissioned,
> host_components/metrics/cpu/cpu_wio,
> host_components/metrics/rpc/client/RpcQueueTime_avg_time,
> host_components/metrics/dfs/FSNamesystem/*,
> host_components/metrics/dfs/namenode/Version,
> host_components/metrics/dfs/namenode/LiveNodes,
> host_components/metrics/dfs/namenode/DeadNodes,
> host_components/metrics/dfs/namenode/DecomNodes,
> host_components/metrics/dfs/namenode/TotalFiles,
> host_components/metrics/dfs/namenode/UpgradeFinalized,
> host_components/metrics/dfs/namenode/Safemode,
> host_components/metrics/runtime/StartTime
> {code}
> This query is essentially saying that for every {{MASTER}}, get metrics from 
> them. The problem is that in a large cluster, there could be 100 masters, yet 
> the metrics being asked for are only for NameNode. As a result, the JMX 
> endpoints for all 100 masters are queried - *live* - as part of the request.
> There are two inherent flaws with this approach:
> - Even with millisecond JMX response times, multiplying this by 100's and 
> then adding parsing overhead causes a noticeable delay in the web client as 
> the federated requests are blocking the main UX request
> - Although there is a threadpool which scales up to service these requests - 
> that only really works for 1 user. With multiple users logged in, you'd need 
> 100's upon 100's of threads pulling in the same JMX data
> This data should never be queried for directly as part of the incoming REST 
> requests. Instead, an autonomous pool of threads should be constantly 
> retrieving these point-in-time metrics and updating a cache. The cache is 
> then used to service all live REST requests. 
> - On the first request to a resource, a cache miss occurs and no data is 
> returned. I think this is acceptable since metrics take a few moments to 
> populate anyway right now. As the web client polls, the next request should 
> pickup the newly cached metrics.
> - Only URLs which are being asked for by incoming REST requests should be 
> considered for retrieval. After sometime, if they haven't been requested, 
> then the headless threadpool can stop trying to update their data
> - All JMX data will be parsed and stored in-memory, in an expiring cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to