Re: Potential side-effect of connector code to JM/TM

Yingjie Cao Wed, 18 Dec 2019 00:55:56 -0800

I'd like to do that.

Best,
Yingjie


Till Rohrmann <[email protected]> 于2019年12月18日周三 下午4:48写道：

> I think we should add this check list to the coding guidelines and continue
> extending it there. Do you wanna update the coding guidelines accordingly
> Yingjie?
>
> Cheers,
> Till
>
> On Wed, Dec 18, 2019 at 8:21 AM Yingjie Cao <[email protected]>
> wrote:
>
> > Hi Till & Biao,
> >
> > Thanks for the reply.
> >
> > I agree that supplying some stress or stability tests can really help,
> > except for the jvm resource leak mentioned above, there may be other type
> > of resource leak like slot or network buffer leak. In addition, other
> tests
> > like triggering failover in various different ways, stressing the system
> > with high parallelism and heavy load jobs and running jobs or triggering
> > failover over and over again can also help. I think stress or stability
> > tests is a big topic and resource leak checking can be a good start.
> >
> > As the start of resource leak checking, we may need to collect a check
> list
> > which can also help to troubleshoot resource leak problem manually. From
> my
> > previous experience, I can think of the following ones:
> > 1. File#deleteOnExit hook leaks string of file path. Flink rest server
> once
> > suffered from the problem and it has been fixed currently.
> > 2. Thread leak. OrcInputFormat suffers from this.
> > 3. ApplicationShutDownHook reference user classes.
> > 4. ClassLoader#parallelLockMap may leak because of too many generated
> > classes. Flink also suffers from this problem and the issue is reported
> in
> > FLINK-15024 and need to be resolved.
> > 5. Some other static fields (like caches implemented by map) of classes
> > loaded by system class loader also have a potential of resource leak.
> >
> > Any other supplementation to this check list is welcomed. And even with
> > this checklist, its may not trivial to do the check, dumping and
> analysing
> > the heap may be a choice. I will do some future survey about that.
> >
> > Best,
> > Yingjie
> >
> > Biao Liu <[email protected]> 于2019年12月17日周二 下午9:02写道：
> >
> > > Hi Yingjie,
> > >
> > > Thanks for figuring out the impressive bug and bringing this
> discussion.
> > >
> > > I'm afraid there is no such a silver bullet for isolation from
> > third-party
> > > library. However I agree that resource checking utils might help.
> > > It seems that you and Till have already raised some feasible ideas.
> > > Resource leaking issue looks like quite common. It would be great If
> > > someone could share some experience. Will keep an eye on this
> discussion.
> > >
> > > Thanks,
> > > Biao /'bɪ.aʊ/
> > >
> > >
> > >
> > > On Tue, 17 Dec 2019 at 20:27, Till Rohrmann <[email protected]>
> > wrote:
> > >
> > > > Hi Yingjie,
> > > >
> > > > thanks for reporting this issue and starting this discussion. If we
> are
> > > > dealing with third party libraries I believe there is always the risk
> > > that
> > > > one overlooks closing resources. Ideally we make it as hard from
> > Flink's
> > > > perspective as possible but realistically it is hard to completely
> > avoid.
> > > > Hence, I believe that it would be beneficial to have some tooling
> (e.g.
> > > > stress tests) which could help to surface these kind of problems.
> Maybe
> > > one
> > > > could automate it so that a dev only needs to provide a user jar and
> > then
> > > > this jar is being executed several times and the cluster is checked
> for
> > > > anomalies.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Tue, Dec 17, 2019 at 8:43 AM Yingjie Cao <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Hi community,
> > > > >
> > > > >   After running tpc-ds test suite for several days on a session
> > > cluster,
> > > > we
> > > > > found a resource leak problem of OrcInputFormat which was reported
> in
> > > > > FLINK-15239. The problem comes from the dependent third party
> library
> > > > which
> > > > > creates new internal thread (pool) and never release it. As a
> result,
> > > the
> > > > > user class loader which is referenced by these threads will never
> be
> > > > > garbage collected as well as other classes loaded by the user class
> > > > loader,
> > > > > which finally lead to the continually grow of meta space size for
> JM
> > > (AM)
> > > > > whose meta space size is not limited currently. And for TM whose
> meta
> > > > space
> > > > > size is limited, it will result in meta space oom eventually. I am
> > not
> > > > sure
> > > > > if any other connectors/input formats incurs the similar problem.
> > > > >   In general, it is hard for Flink to restrict the behavior of the
> > > third
> > > > > party dependencies, especially the dependencies of connectors.
> > However,
> > > > it
> > > > > will be better if we can supply some mechanism like stronger
> > isolation
> > > or
> > > > > some test facilities to find potential problems, for example, we
> can
> > > run
> > > > > jobs on a cluster and automatically check something like whether
> user
> > > > class
> > > > > loader can be garbage collected, whether there is thread leak,
> > whether
> > > > some
> > > > > shutdown hooks have been registered and so on.
> > > > >   What do you think? Or should we treat it as a problem?
> > > > >
> > > > > Best,
> > > > > Yingjie
> > > > >
> > > >
> > >
> >
>

Re: Potential side-effect of connector code to JM/TM

Reply via email to