Please use this link. https://gist.github.com/imamitjain/5ab84c2d9eaf06615ad912506a08f7e2
On Thu, Apr 19, 2018 at 10:37 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Amit: > Execution plan attachment didn't come through. > > Please consider using third party website for storing the plan. > > FYI > > On Thu, Apr 19, 2018 at 10:04 AM, Amit Jain <aj201...@gmail.com> wrote: > > > @Ufuk Please find execution plan in the attachment. > > > > @Nico Job is not making progress at all. This issue is happening > > randomly. Few of our jobs are working with only few MB of data and still, > > they are getting stuck even TM have 22G with 2 slots per TM. > > > > I've started using 1.5 and facing few issues which I'm communicating with > > community these days. However, this issue seems to be solved there :-) Do > > you guys have a timeline for 1.5 release? > > > > -- > > Thanks, > > Amit > > > > > > > > > > > > On Fri, Apr 6, 2018 at 10:40 PM, Ufuk Celebi <u...@apache.org> wrote: > > > >> Hey Amit! > >> > >> Thanks for posting this here. I don't think it's an issue of the > >> buffer pool per se. Instead I think there are two potential causes > >> here: > >> > >> 1. The generated flow doesn't use blocking intermediate results for a > >> branching-joining flow. > >> => I think we can check it if you run and post the output of > >> `StreamExecutionEnvironment#getExecutionPlan()` here. Can you please > >> post the result of this here? > >> > >> 2. The blocking intermediate result is used but there is an issue with > >> the implementation of them. > >> => Depending on the output of 1, we can investigate this option. > >> > >> As Fabian mentioned, running this with a newer version of Flink will > >> be very helpful. If the problem still persists, it will also make it > >> more likely that the issue will be fixed faster. ;-) > >> > >> – Ufuk > >> > >> > >> On Fri, Apr 6, 2018 at 5:43 AM, Nico Kruber <n...@data-artisans.com> > >> wrote: > >> > I'm not aware of any changes regarding the blocking buffer pools > though. > >> > > >> > Is it really stuck or just making progress slowly? (You can check with > >> > the number or records sent/received in the Web UI) > >> > > >> > Anyway, this may also simply mean that the task is back-pressured > >> > depending on how the operators are wired together. In that case, all > >> > available buffers for that ResultPartition have been used (records > were > >> > serialized into them) and are now waiting on Netty to send or a > >> > SpillingSubpartition to spill data to disk. > >> > Please also check for warnings or errors in the affected TaskManager's > >> > log files. > >> > > >> > If you can reproduce the problem, could you try reducing your program > to > >> > a minimal working example and provide it here for further debugging? > >> > > >> > > >> > Thanks > >> > Nico > >> > > >> > On 04/04/18 23:00, Fabian Hueske wrote: > >> >> Hi Amit, > >> >> > >> >> The network stack has been redesigned for the upcoming Flink 1.5 > >> release. > >> >> The issue might have been fixed by that. > >> >> > >> >> There's already a first release candidate for Flink 1.5.0 available > >> [1]. > >> >> It would be great if you would have the chance to check if the bug is > >> still > >> >> present. > >> >> > >> >> Best, Fabian > >> >> > >> >> [1] > >> >> https://lists.apache.org/thread.html/a6b6fb1a42a975608fa8641 > >> c86df30b47f022985ade845f1f1ec542a@%3Cdev.flink.apache.org%3E > >> >> > >> >> 2018-04-04 20:23 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > >> >> > >> >>> I searched for 0x00000005e28fe218 in the two files you attached > >> >>> to FLINK-2685 but didn't find any hit. > >> >>> > >> >>> Was this the same instance as the attachment to FLINK-2685 ? > >> >>> > >> >>> Thanks > >> >>> > >> >>> On Wed, Apr 4, 2018 at 10:21 AM, Amit Jain <aj201...@gmail.com> > >> wrote: > >> >>> > >> >>>> +u...@flink.apache.org > >> >>>> > >> >>>> On Wed, Apr 4, 2018 at 11:33 AM, Amit Jain <aj201...@gmail.com> > >> wrote: > >> >>>>> Hi, > >> >>>>> > >> >>>>> We are hitting TaskManager deadlock on NetworkBufferPool bug in > >> Flink > >> >>>> 1.3.2. > >> >>>>> We have set of ETL's merge jobs for a number of tables and stuck > >> with > >> >>>> above > >> >>>>> issue randomly daily. > >> >>>>> > >> >>>>> I'm attaching the thread dump of JobManager and one of the Task > >> Manager > >> >>>> (T1) > >> >>>>> running stuck job. > >> >>>>> We also observed, sometimes new job scheduled on T1 progresses > even > >> >>>> another > >> >>>>> job is stuck there. > >> >>>>> > >> >>>>> "CHAIN DataSource (at createInput(ExecutionEnvironment.java:553) > >> >>>>> (org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) > -> > >> Map > >> >>>> (Map > >> >>>>> at main(MergeTableSecond.java:175)) -> Map (Key Extractor) (6/9)" > >> >>> #1501 > >> >>>>> daemon prio=5 os_prio=0 tid=0x00007f9ea84d2fb0 nid=0x22fe in > >> >>>> Object.wait() > >> >>>>> [0x00007f9ebf102000] > >> >>>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) > >> >>>>> at java.lang.Object.wait(Native Method) > >> >>>>> at > >> >>>>> org.apache.flink.runtime.io.network.buffer. > >> >>>> LocalBufferPool.requestBuffer(LocalBufferPool.java:224) > >> >>>>> - locked <0x00000005e28fe218> (a java.util.ArrayDeque) > >> >>>>> at > >> >>>>> org.apache.flink.runtime.io.network.buffer.LocalBufferPool. > >> >>>> requestBufferBlocking(LocalBufferPool.java:193) > >> >>>>> at > >> >>>>> org.apache.flink.runtime.io.network.api.writer. > >> >>>> RecordWriter.sendToTarget(RecordWriter.java:132) > >> >>>>> - locked <0x00000005e29125f0> (a > >> >>>>> org.apache.flink.runtime.io.network.api.serialization. > >> >>>> SpanningRecordSerializer) > >> >>>>> at > >> >>>>> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit( > >> >>>> RecordWriter.java:89) > >> >>>>> at > >> >>>>> org.apache.flink.runtime.operators.shipping.OutputCollector. > >> collect( > >> >>>> OutputCollector.java:65) > >> >>>>> at > >> >>>>> org.apache.flink.runtime.operators.util.metrics. > >> >>>> CountingCollector.collect(CountingCollector.java:35) > >> >>>>> at > >> >>>>> org.apache.flink.runtime.operators.chaining.ChainedMapDriver > >> .collect( > >> >>>> ChainedMapDriver.java:79) > >> >>>>> at > >> >>>>> org.apache.flink.runtime.operators.util.metrics. > >> >>>> CountingCollector.collect(CountingCollector.java:35) > >> >>>>> at > >> >>>>> org.apache.flink.runtime.operators.chaining.ChainedMapDriver > >> .collect( > >> >>>> ChainedMapDriver.java:79) > >> >>>>> at > >> >>>>> org.apache.flink.runtime.operators.util.metrics. > >> >>>> CountingCollector.collect(CountingCollector.java:35) > >> >>>>> at > >> >>>>> org.apache.flink.runtime.operators.DataSourceTask. > >> >>>> invoke(DataSourceTask.java:168) > >> >>>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702) > >> >>>>> at java.lang.Thread.run(Thread.java:748) > >> >>>>> > >> >>>>> -- > >> >>>>> Thanks, > >> >>>>> Amit > >> >>>> > >> >>> > >> >> > >> > > >> > -- > >> > Nico Kruber | Software Engineer > >> > data Artisans > >> > > >> > Follow us @dataArtisans > >> > -- > >> > Join Flink Forward - The Apache Flink Conference > >> > Stream Processing | Event Driven | Real Time > >> > -- > >> > Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany > >> <https://maps.google.com/?q=Stresemannstr.+121A,10963+ > Berlin,+Germany&entry=gmail&source=g> > >> > data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, > USA > >> <https://maps.google.com/?q=1161+Mission+Street,+San+ > Francisco,+CA-94103,+USA&entry=gmail&source=g> > >> > -- > >> > Data Artisans GmbH > >> > Registered at Amtsgericht Charlottenburg: HRB 158244 B > >> > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > >> > > >> > > > > >