Thanks for some clarifying information. Setting
nifi.cluster.node.read.timeout=30
sec seems to have alleviated the problem.

It was determined that there was a relatively long time in performing all
the authorizations or each Provenance Event after choosing Global Menu ->
Data Provenance. In this case, the Provenance Query Thread authorizes "Data
for ..." for each processor. Each such authorization takes approximately
0.5-0.6 ms. (Timing was taken with custom authorization logic disabled.) I
have not yet determined if this authorization proceeds for ALL Provenance
Events, or only for the 1,000 events which the UI limits for display
purposes. I have noted that all authorizations are being handled by a
single Provenance Query Thread despite the property
nfi.provenance.repository.query.threads=2.
I assume this property allows more threads for simultaneous client
requests, but each individual request uses only a single thread.

Also, it was determined that GC was not a significant factor. The JVM is
spending approximately 10% of its time performing GC, but none of it a full
GC. And, the time of any one GC is reasonable (approx. 0.5 sec).


On Mon, Nov 20, 2017 at 4:11 PM, Mark Payne <[email protected]> wrote:

> Mark,
>
> By and large, when you run into issues with timeouts on cluster
> replication, in my experience, the culprit
> is usually Garbage Collection. So it may be that you are not
> thread-limited or CPU-limited,
> or resource limited at all, just that garbage collection is kicking in at
> an inopportune time. In such a situation,
> my suggestion would be to use a nifi.cluster.node.read.timeout of say 30
> seconds instead of 10, and to
> look into how the garbage collection is performing on your system.
>
> I have answered specific questions below, though, in case they are helpful.
>
> Thanks
> -Mark
>
>
> > On Nov 20, 2017, at 3:25 PM, Mark Bean <[email protected]> wrote:
> >
> > We are seeing cases where a user attempts to query provenance on a
> cluster.
> > One or more Nodes may not respond to the request in a timely manner, and
> is
> > then subsequently disconnected from the cluster. The nifi-app.log shows
> log
> > messages similar to:
> >
> > ThreadPoolRequestReplicator Failed to replicate request POST
> > /nifi-api/provenance to {host:port} due to
> > com.sun.jersy.api.client.ClientHandlerException:
> > java.net.SocketTimeoutException: Read timed out
> > NodeClusterCoordinator The following nodes failed to process URI
> > /nifi-api/provenance '{list of one or more nodes}'. Requesting each node
> > disconnect from cluster.
> >
> > We have implemented a custom authorizer. For certain policies, additional
> > authorization checking is performed. Provenance is one such policy which
> > performs additional checking. It is surprising that the process is taking
> > so long as to time out the request. Currently, timeouts are set as:
> > nifi.cluster.node.read.timeout=10 sec
> > nifi.cluster.request.replication.claim.timeout=30 sec
> >
> > This leads me to believe we are thread-limited, not CPU-limited.
> >
> > In this scenario, what threads are involved? Would
> > nifi.cluster.node.protocol.threads (or .max.threads) be limiting the
> > processing of such api calls?
>
> >>> These are the jetty threads that are involved, on the 'receiving' side
> and the nifi.cluster.node.protocol.threads on the client side
>
> >
> > Is the api provenance request(s) limited by
> > nifi.provenance.repository.query.thread?
>
> >>> These query threads are background threads that are used to populate
> the results
> of the query. Client requests will not block on those results.
>
> >
> > Are there other thread-related properties we should be looking at?
> >
>
> >>> I don't think so. I can't think of any off of the top of my head,
> anyway.
>
> > Are thread properties (such as nifi.provenance.repository.query.threads)
> > counted against the total threads given by nifi.web.jetty.threads?
>
> >>> No, these are separate thread pools.
>
> >
> > Thanks,
> > Mark
>
>

Reply via email to