Re: instable checkpointing after migration to flink 1.8

Bekir Oguz Fri, 02 Aug 2019 01:23:20 -0700

Forgot to add the checkpoint details after it was complete. This is for that 
long running checkpoint with id 95632.




> Op 2 aug. 2019, om 11:18 heeft Bekir Oguz <bekir.o...@persgroep.net> het 
> volgende geschreven:
> 
> Hi Congxian,
> I was able to fetch the logs of the task manager (attached) and the 
> screenshots of the latest long checkpoint. I will get the logs of the job 
> manager for the next long running checkpoint. And also I will try to get a 
> jstack during the long running checkpoint.
> 
> Note: Since at the Subtasks tab we do not have the subtask numbers, and at 
> the Details tab of the checkpoint, we have the subtask numbers but not the 
> task manager hosts, it is difficult to match those. We’re assuming they have 
> the same order, so seeing that 3rd subtask is failing, I am getting the 3rd 
> line at the Subtasks tab which leads to the task manager host 
> flink-taskmanager-84ccd5bddf-2cbxn. ***It would be a great feature if you 
> guys also include the subtask-id’s to the Subtasks view.*** 
> 
> Note: timestamps in the task manager log are in UTC and I am at the moment at 
> zone UTC+3, so the time 10:30 at the screenshot matches the time 7:30 in the 
> log.
> 
> 
> Kind regards,
> Bekir
> 
> <task_manager.log>
> 
> <PastedGraphic-4.png>
> <PastedGraphic-3.png>
> <PastedGraphic-2.png>
> 
> 
> 
>> Op 2 aug. 2019, om 07:23 heeft Congxian Qiu <qcx978132...@gmail.com 
>> <mailto:qcx978132...@gmail.com>> het volgende geschreven:
>> 
>> Hi Bekir
>> I’ll first summary the problem here(please correct me if I’m wrong)
>> 1. The same program runs on 1.6 never encounter such problems
>> 2. Some checkpoints completed too long (15+ min), but other normal 
>> checkpoints complete less than 1 min
>> 3. Some  bad checkpoint will have a large sync time, async time seems ok
>> 4. Some bad checkpoint, the e2e duration will much bigger than (sync_time + 
>> async_time) 
>> First, answer the last question, the e2e duration is ack_time - 
>> trigger_time, so it always bigger than (sync_time + async_time), but we have 
>> a big gap here, this may be problematic.
>> According to all the information, maybe the problem is some task start to do 
>> checkpoint too late and the sync checkpoint part took some time too long, 
>> Could you please share some more information such below:
>> - A Screenshot of summary for one bad checkpoint(we call it A here)
>> - The detailed information of checkpoint A(includes all the problematic 
>> subtasks)
>> - Jobmanager.log and the taskmanager.log for the problematic task and a 
>> health task
>> - Share the screenshot of subtasks for the problematic task(includes the 
>> `Bytes received`, `Records received`, `Bytes sent`, `Records sent` column), 
>> here wants to compare the problematic parallelism and good parallelism’s 
>> information, please also share the information is there has a data skew 
>> among the parallelisms,
>> - could you please share some jstacks of the problematic parallelism — here 
>> wants to check whether the task is too busy to handle the barrier. (flame 
>> graph or other things is always welcome here)
>> 
>> Best,
>> Congxian
>> 
>> 
>> Congxian Qiu <qcx978132...@gmail.com <mailto:qcx978132...@gmail.com>> 
>> 于2019年8月1日周四 下午8:26写道：
>> Hi Bekir
>> 
>> I'll first comb through all the information here, and try to find out the 
>> reason with you, maybe need you to share some more information :)
>> 
>> Best,
>> Congxian
>> 
>> 
>> Bekir Oguz <bekir.o...@persgroep.net <mailto:bekir.o...@persgroep.net>> 
>> 于2019年8月1日周四 下午5:00写道：
>> Hi Fabian,
>> Thanks for sharing this with us, but we’re already on version 1.8.1.
>> 
>> What I don’t understand is which mechanism in Flink adds 15 minutes to the 
>> checkpoint duration occasionally. Can you maybe give us some hints on where 
>> to look at? Is there a default timeout of 15 minutes defined somewhere in 
>> Flink? I couldn’t find one.
>> 
>> In our pipeline, most of the checkpoints complete in less than a minute and 
>> some of them completed in 15 minutes+(less than a minute).
>> There’s definitely something which adds 15 minutes. This is happening in one 
>> or more subtasks during checkpointing.
>> 
>> Please see the screenshot below:
>> 
>> 
>> Regards,
>> Bekir
>> 
>> 
>> 
>>> Op 23 jul. 2019, om 16:37 heeft Fabian Hueske <fhue...@gmail.com 
>>> <mailto:fhue...@gmail.com>> het volgende geschreven:
>>> 
>>> Hi Bekir,
>>> 
>>> Another user reported checkpointing issues with Flink 1.8.0 [1].
>>> These seem to be resolved with Flink 1.8.1.
>>> 
>>> Hope this helps,
>>> Fabian
>>> 
>>> [1]
>>> https://lists.apache.org/thread.html/991fe3b09fd6a052ff52e5f7d9cdd9418545e68b02e23493097d9bc4@%3Cuser.flink.apache.org%3E
>>>  
>>> <https://lists.apache.org/thread.html/991fe3b09fd6a052ff52e5f7d9cdd9418545e68b02e23493097d9bc4@%3Cuser.flink.apache.org%3E>
>>> 
>>> Am Mi., 17. Juli 2019 um 09:16 Uhr schrieb Congxian Qiu <
>>> qcx978132...@gmail.com <mailto:qcx978132...@gmail.com>>:
>>> 
>>>> Hi Bekir
>>>> 
>>>> First of all, I think there is something wrong.  the state size is almost
>>>> the same,  but the duration is different so much.
>>>> 
>>>> The checkpoint for RocksDBStatebackend is dump sst files, then copy the
>>>> needed sst files(if you enable incremental checkpoint, the sst files
>>>> already on remote will not upload), then complete checkpoint. Can you check
>>>> the network bandwidth usage during checkpoint?
>>>> 
>>>> Best,
>>>> Congxian
>>>> 
>>>> 
>>>> Bekir Oguz <bekir.o...@persgroep.net <mailto:bekir.o...@persgroep.net>> 
>>>> 于2019年7月16日周二 下午10:45写道：
>>>> 
>>>>> Hi all,
>>>>> We have a flink job with user state, checkpointing to RocksDBBackend
>>>>> which is externally stored in AWS S3.
>>>>> After we have migrated our cluster from 1.6 to 1.8, we see occasionally
>>>>> that some slots do to acknowledge the checkpoints quick enough. As an
>>>>> example: All slots acknowledge between 30-50 seconds except only one slot
>>>>> acknowledges in 15 mins. Checkpoint sizes are similar to each other, like
>>>>> 200-400 MB.
>>>>> 
>>>>> We did not experience this weird behaviour in Flink 1.6. We have 5 min
>>>>> checkpoint interval and this happens sometimes once in an hour sometimes
>>>>> more but not in all the checkpoint requests. Please see the screenshot
>>>>> below.
>>>>> 
>>>>> Also another point: For the faulty slots, the duration is consistently 15
>>>>> mins and some seconds, we couldn’t find out where this 15 mins response
>>>>> time comes from. And each time it is a different task manager, not always
>>>>> the same one.
>>>>> 
>>>>> Do you guys aware of any other users having similar issues with the new
>>>>> version and also a suggested bug fix or solution?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks in advance,
>>>>> Bekir Oguz
>>>>> 
>>>> 
>> 
>

Re: instable checkpointing after migration to flink 1.8

Reply via email to