Thanks for some clarifying information. Setting nifi.cluster.node.read.timeout=30 sec seems to have alleviated the problem.
It was determined that there was a relatively long time in performing all the authorizations or each Provenance Event after choosing Global Menu -> Data Provenance. In this case, the Provenance Query Thread authorizes "Data for ..." for each processor. Each such authorization takes approximately 0.5-0.6 ms. (Timing was taken with custom authorization logic disabled.) I have not yet determined if this authorization proceeds for ALL Provenance Events, or only for the 1,000 events which the UI limits for display purposes. I have noted that all authorizations are being handled by a single Provenance Query Thread despite the property nfi.provenance.repository.query.threads=2. I assume this property allows more threads for simultaneous client requests, but each individual request uses only a single thread. Also, it was determined that GC was not a significant factor. The JVM is spending approximately 10% of its time performing GC, but none of it a full GC. And, the time of any one GC is reasonable (approx. 0.5 sec). On Mon, Nov 20, 2017 at 4:11 PM, Mark Payne <[email protected]> wrote: > Mark, > > By and large, when you run into issues with timeouts on cluster > replication, in my experience, the culprit > is usually Garbage Collection. So it may be that you are not > thread-limited or CPU-limited, > or resource limited at all, just that garbage collection is kicking in at > an inopportune time. In such a situation, > my suggestion would be to use a nifi.cluster.node.read.timeout of say 30 > seconds instead of 10, and to > look into how the garbage collection is performing on your system. > > I have answered specific questions below, though, in case they are helpful. > > Thanks > -Mark > > > > On Nov 20, 2017, at 3:25 PM, Mark Bean <[email protected]> wrote: > > > > We are seeing cases where a user attempts to query provenance on a > cluster. > > One or more Nodes may not respond to the request in a timely manner, and > is > > then subsequently disconnected from the cluster. The nifi-app.log shows > log > > messages similar to: > > > > ThreadPoolRequestReplicator Failed to replicate request POST > > /nifi-api/provenance to {host:port} due to > > com.sun.jersy.api.client.ClientHandlerException: > > java.net.SocketTimeoutException: Read timed out > > NodeClusterCoordinator The following nodes failed to process URI > > /nifi-api/provenance '{list of one or more nodes}'. Requesting each node > > disconnect from cluster. > > > > We have implemented a custom authorizer. For certain policies, additional > > authorization checking is performed. Provenance is one such policy which > > performs additional checking. It is surprising that the process is taking > > so long as to time out the request. Currently, timeouts are set as: > > nifi.cluster.node.read.timeout=10 sec > > nifi.cluster.request.replication.claim.timeout=30 sec > > > > This leads me to believe we are thread-limited, not CPU-limited. > > > > In this scenario, what threads are involved? Would > > nifi.cluster.node.protocol.threads (or .max.threads) be limiting the > > processing of such api calls? > > >>> These are the jetty threads that are involved, on the 'receiving' side > and the nifi.cluster.node.protocol.threads on the client side > > > > > Is the api provenance request(s) limited by > > nifi.provenance.repository.query.thread? > > >>> These query threads are background threads that are used to populate > the results > of the query. Client requests will not block on those results. > > > > > Are there other thread-related properties we should be looking at? > > > > >>> I don't think so. I can't think of any off of the top of my head, > anyway. > > > Are thread properties (such as nifi.provenance.repository.query.threads) > > counted against the total threads given by nifi.web.jetty.threads? > > >>> No, these are separate thread pools. > > > > > Thanks, > > Mark > >
