Re: Potential 1.11.X showstopper

Pierre Villard Sat, 08 Feb 2020 15:50:55 -0800

Hello,

I believe the problem reported here is similar to the one described in
https://issues.apache.org/jira/browse/NIFI-7114.


However, few community members and myself haven't been able to reproduce
the issue. Can anyone in a position to easily replicate the issue can
clarify:
- the exhaustive list of components (processors, controller services,
reporting tasks) running in the NiFi instance
- details of the NiFi setup: OS, Java version, NiFi version,
standalone/cluster installation, secured/unsecured installation

The only thing that seems common across the occurrences is the Java
version: 8u242. However, I have not been able to reproduce the issue with
this Java version... If someone able to replicate the issue can try to
downgrade the Java version and let the community know if this changes
something, that would be great.

Thanks,
Pierre


Le jeu. 6 févr. 2020 à 17:41, Mike Thomsen <[email protected]> a
écrit :

> My setup was very similar, but didn't have the site to site reporting.
>
> On Thu, Feb 6, 2020 at 5:13 PM Joe Witt <[email protected]> wrote:
>
> > yeah will investigate
> >
> > thanks
> >
> > On Thu, Feb 6, 2020 at 4:49 PM Ryan Hendrickson <
> > [email protected]> wrote:
> >
> > > Joe,
> > >    We're running:
> > >
> > >    - OpenJDK Java 1.8.0_242
> > >    - NiFi 1.11.0
> > >    - CentOS Linux 7.7.1908
> > >
> > >
> > >    We're seeing this across a dozen NiFi's with the same setup.  To
> > > reproduce the issue, Generate Flow Files 100GB across a couple million
> > > files -> Site to Site -> Receive data -> Merge Content.  We had no
> issues
> > > with this stack:
> > >
> > >    - OpenJDK Java 1.8.0_232.
> > >    - NiFi 1.9.2
> > >    - CentOS Linux 7.7.1908
> > >
> > >    Can your team setup a similar stack and test?
> > >
> > > Ryan
> > >
> > > On Thu, Feb 6, 2020 at 4:15 PM Joe Witt <[email protected]> wrote:
> > >
> > > > received a direct reply - Elli cannot share.
> > > >
> > > > I think unless someone else is able to replicate the behavior there
> > isn't
> > > > much more we can tackle on this.
> > > >
> > > > Thanks
> > > >
> > > > On Thu, Feb 6, 2020 at 4:10 PM Joe Witt <[email protected]> wrote:
> > > >
> > > > > Yes Elli it is possible.  Can we please get those lsof outputs in a
> > > JIRA?
> > > > > As well as more details about configuration?
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Thu, Feb 6, 2020 at 2:44 PM Andy LoPresto <[email protected]
> >
> > > > wrote:
> > > > >
> > > > >> I have no input on the specific issue you’re encountering, but a
> > > pattern
> > > > >> we have seen to reduce the overhead of multiple remote input ports
> > > being
> > > > >> required is to use a “central” remote input port and immediately
> > > follow
> > > > it
> > > > >> with a RouteOnAttribute to distribute specific flowfiles to the
> > > > appropriate
> > > > >> downstream flow / process group. Whatever sends data to this port
> > can
> > > > use
> > > > >> an UpdateAttribute to add some “tracking/routing” attribute on the
> > > > >> flowfiles before being sent. Inserting Merge/Split will likely
> > affect
> > > > your
> > > > >> timing due to waiting for bins to fill, depending on your volume.
> > S2S
> > > is
> > > > >> pretty good at transmitting data on-demand with low overhead on
> one
> > > > port;
> > > > >> it’s when you have many remote input ports that there is
> substantial
> > > > >> overhead.
> > > > >>
> > > > >>
> > > > >> Andy LoPresto
> > > > >> [email protected]
> > > > >> [email protected]
> > > > >> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D
> EF69
> > > > >>
> > > > >> > On Feb 6, 2020, at 2:34 PM, Elli Schwarz <
> > [email protected]
> > > > .INVALID>
> > > > >> wrote:
> > > > >> >
> > > > >> > We ran that command - it appears the site-to-sites that are
> > causing
> > > > the
> > > > >> issue. We had a lot of remote process groups that weren't even
> being
> > > > used
> > > > >> (no data was being sent to that part of the dataflow), yet when
> > > running
> > > > the
> > > > >> lsof command they each had a large number of open files - almost
> > 2k! -
> > > > >> showing CLOSE_WAIT. Again, there were no flowfiles being sent to
> > them,
> > > > so
> > > > >> can it be some kind of bug that keeping a remote process group
> open
> > is
> > > > >> somehow opening files and not closing them? (BTW, the reason we
> had
> > to
> > > > >> upgrade from 1.9.2 to 1.11.0 was because we had upgraded our Java
> > > > version
> > > > >> and that cause an IllegalBlockingModeException - is it possible
> that
> > > > >> whatever fixed that problem is now causing an issue with open
> > files?)
> > > > >> >
> > > > >> > We now disabled all of the unused remote process groups. We
> still
> > > have
> > > > >> several remote process groups that we are using so if this is the
> > > issue
> > > > it
> > > > >> might be difficult to avoid, but at least we decreased the number
> of
> > > > remote
> > > > >> process groups we have. Another approach we are trying is a merge
> > > > content
> > > > >> before we send to the Nifi having the most issues, to have fewer
> > flow
> > > > files
> > > > >> sent at once site to site, and then splitting them after they are
> > > > received.
> > > > >> > Thank you!
> > > > >> >
> > > > >> >    On Thursday, February 6, 2020, 2:19:48 PM EST, Mike Thomsen <
> > > > >> [email protected]> wrote:
> > > > >> >
> > > > >> > Can you share a description of your flows in terms of average
> > > flowfile
> > > > >> size, queue size, data velocity, etc.?
> > > > >> > Thanks,
> > > > >> > Mike
> > > > >> >
> > > > >> > On Thu, Feb 6, 2020 at 1:59 PM Elli Schwarz <
> > > > [email protected]>
> > > > >> wrote:
> > > > >> >
> > > > >> >  We seem to be experiencing the same problems. We recently
> > upgraded
> > > > >> several of our Nifis from 1.9.2 to 1.11.0, and now many of them
> are
> > > > failing
> > > > >> with "too many open files". Nothing else changed other than the
> > > upgrade,
> > > > >> and our data volume is the same as before. The only solution we've
> > > been
> > > > >> able to come up with is to run a script to check for this
> condition
> > > and
> > > > >> restart the Nifi. Any other ideas?
> > > > >> > Thank you!
> > > > >> >
> > > > >> >     On Sunday, February 2, 2020, 9:11:34 AM EST, Mike Thomsen <
> > > > >> [email protected]> wrote:
> > > > >> >
> > > > >> >  Without further details, this is what I did to see if it was
> > > > something
> > > > >> > other than the usual issue of having not enough file handlers
> > > > available.
> > > > >> > Something like a legitimate case of someone forgetting to close
> > file
> > > > >> > objects or something in the code itself.
> > > > >> >
> > > > >> > 1. Setup a 8core/32GB VM on AWS w/ Amazon AMI.
> > > > >> > 2. Pushed 1.11.1RC1
> > > > >> > 3. Pushed the RAM settings to 6/12GB
> > > > >> > 4. Disabled flowfile archiving because I only allocated 8GB of
> > > > storage.
> > > > >> > 5. Setup a flow that used 2 generateflow instances to generate
> > > massive
> > > > >> > amounts of garbage data using all available cores. (All queues
> > were
> > > > >> setup
> > > > >> > to hold 250k flow files)
> > > > >> > 6. Kicked it off and let it run for probably about 20 minutes.
> > > > >> >
> > > > >> > No apparent problem with closing and releasing resources here.
> > > > >> >
> > > > >> > On Sat, Feb 1, 2020 at 8:00 AM Joe Witt <[email protected]>
> > wrote:
> > > > >> >
> > > > >> >> these are usually very easy to find.
> > > > >> >>
> > > > >> >> run lsof -p pid.  and share results
> > > > >> >>
> > > > >> >>
> > > > >> >> thanks
> > > > >> >>
> > > > >> >> On Sat, Feb 1, 2020 at 7:56 AM Mike Thomsen <
> > > [email protected]>
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >>>
> > > > >> >>>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://stackoverflow.com/questions/59991035/nifi-1-11-opening-more-than-50k-files/60017064#60017064
> > > > >> >>>
> > > > >> >>> No idea if this is valid or not. I asked for clarification to
> > see
> > > if
> > > > >> >> there
> > > > >> >>> might be a specific processor or something that is triggering
> > > this.
> > > > >> >>>
> > > > >> >>
> > > > >> >
> > > > >>
> > > > >>
> > > >
> > >
> >
>

Re: Potential 1.11.X showstopper

Reply via email to