I started looking on the Remote Process Group side of the connection. It is 
reporting "Pipe closed". The site-to-site connection seems to stay in this 
state irreversibly. This type of error should initiate a restart of the 
connection to the remote port. I'm not sure where the best place to do this is 
though.

Would this be in the StandardRemoteGroupPort? Or maybe within the transaction 
itself (AbstractTransaction)? I want to avoid constantly recreating a 
connection when not required since that could cause a performance impact.

Thanks,
Mark

On 2021/07/08 18:04:14, Mark Bean <mark.o.b...@gmail.com> wrote: 
> We're seeing some odd behavior using site-to-site. The input port on a
> 3-node cluster will eventually stop receiving new data. In the log, I see
> the following:
> 
> 2021-07-08 13:13:14,010 ERROR [NiFi Web Server-43017]
> o.a.nifi.web.api.ApplicationResource Exception detail:
> org.apache.nifi.processor.exception.ProcessException:
> java.lang.InterruptedException
>         at
> org.apache.nifi.remote.StandardPublicPort.receiveFlowFiles(StandardPublicPort.java:588)
>         at
> org.apache.nifi.web.api.DataTransferResource.receiveFlowFiles(DataTransferResource.java:277)
> ...
> 
> Then many more similar messages:
> 2021-07-08 13:13:14,015 ERROR [NiFi Web Server-47691]
> 0.a.nifi.web.api.ApplicationResource Exception detail:
> org.apache.nifi.processor.exception.ProcessException:
> org.apache.nifi.processor.exception.ProcessException: Interrupted while
> waiting for site-to-site request to be serviced
>         at
> org.apache.nifi.remote.StandardPublicPort.receiveFlowFiles(StandardPublicPort.java:588)
>         at
> org.apache.nifi.web.api.DataTransferResource.receiveFlowFiles(DataTransferResource.java:277)
> ...
> 
> It's unclear what is causing the exception (possibly some network
> instability), but the only way we have been able to get data flowing again
> is to restart the NiFi node. Even more concerning is that when NIFi is
> restarted, there are many thousand messages indicating:
> 
> 2021-07-08 15:29:12,097 INFO [main] o.a.n.c.repository.FileSystemRepository
> Found unknown file /cont_repo/content/336/1625700387433-161104 (1333153
> bytes) in File System Repository; removing file
> 
> I suspect the failed site-to-site transfer completed writing data (content)
> to disk, but was interrupted prior to creating a flowfile and
> committing the Process Session. If this is true, this could cause the repo
> to fill with data that will never get cleaned up until a NiFi restart.
> 
> I'm looking for someone with detailed knowledge of the internals of
> site-to-site to comment on this issue - either the hard stop on receiving
> additional data via site-to-site, or the orphaned content.
> 
> NiFi Version: 1.12.1
> 
> Thanks,
> Mark
> 

Reply via email to