Hi, I would recommend having a look at the headers of the original email [1]. In those headers, there will be a value for 'Return-Path' with an email address. If you send an email to that address, it will unsubscribe you.
Best regards, Martijn [1] https://support.google.com/mail/answer/29436?hl=en#zippy=%2Cgmail On Mon, 26 Jul 2021 at 12:10, R Bhaaagi <[email protected]> wrote: > Hi Jing, > > > I already sent emails to the mentioned email ids... the reply I got was, my > email I’d is not available... > Not sure why I am still receiving these emails > > On Mon, 26 Jul 2021 at 9:42 AM, JING ZHANG <[email protected]> wrote: > > > Hi Bhagi, > > To unsubscribe emails from Flink JIRA activities, send an email to > > [email protected] > > > > To unsubscribe emails from Flink dev mail list, send an email to dev- > > [email protected] > > > > To unsubscribe emails from Flink user mail list, send an email to user- > > [email protected] > > > > For more information, please go to [1]. > > > > [1] https://flink.apache.org/community.html > > > > Best, > > JING ZHANG > > > > R Bhaaagi <[email protected]> 于2021年7月23日周五 下午5:45写道: > > > > > Hi Team, > > > > > > Please unsubscribe me from all these emails.... > > > > > > On Fri, 23 Jul 2021 at 2:19 PM, LINZ, Arnaud <[email protected] > > > > > wrote: > > > > > > > Hello, > > > > > > > > It’s hard to say what caused the timeout to trigger – I agree with > you > > > > that it should not have stopped the heartbeat thread, but it did. The > > > easy > > > > fix was to increase it until we no longer see our app self-killed. > The > > > task > > > > was using a CPU-intensive computation (with a few threads created at > > some > > > > points… Somehow breaking the “slot number” contract). > > > > For the RAM cache, I believe that the hearbeat timeout may also times > > out > > > > because of a busy network. > > > > > > > > Cheers, > > > > Arnaud > > > > > > > > > > > > De : Till Rohrmann <[email protected]> > > > > Envoyé : jeudi 22 juillet 2021 11:33 > > > > À : LINZ, Arnaud <[email protected]> > > > > Cc : Gen Luo <[email protected]>; Yang Wang <[email protected] > >; > > > > dev <[email protected]>; user <[email protected]> > > > > Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and > interval > > > > default values > > > > > > > > Thanks for your inputs Gen and Arnaud. > > > > > > > > I do agree with you, Gen, that we need better guidance for our users > on > > > > when to change the heartbeat configuration. I think this should > happen > > in > > > > any case. I am, however, not so sure whether we can give hard > threshold > > > > like 5000 tasks, for example, because as Arnaud said it strongly > > depends > > > on > > > > the workload. Maybe we can explain it based on symptoms a user might > > > > experience and what to do then. > > > > > > > > Concerning your workloads, Arnaud, I'd be interested to learn a bit > > more. > > > > The user code runs in its own thread. This means that its operation > > won't > > > > block the main thread/heartbeat. The only thing that can happen is > that > > > the > > > > user code starves the heartbeat in terms of CPU cycles or causes a > lot > > of > > > > GC pauses. If you are observing the former problem, then we might > think > > > > about changing the priorities of the respective threads. This should > > then > > > > improve Flink's stability for these workloads and a shorter heartbeat > > > > timeout should be possible. > > > > > > > > Also for the RAM-cached repositories, what exactly is causing the > > > > heartbeat to time out? Is it because you have a lot of GC or that the > > > > heartbeat thread does not get enough CPU cycles? > > > > > > > > Cheers, > > > > Till > > > > > > > > On Thu, Jul 22, 2021 at 9:16 AM LINZ, Arnaud < > [email protected] > > > > <mailto:[email protected]>> wrote: > > > > Hello, > > > > > > > > From a user perspective: we have some (rare) use cases where we use > > > > “coarse grain” datasets, with big beans and tasks that do lengthy > > > operation > > > > (such as ML training). In these cases we had to increase the time out > > to > > > > huge values (heartbeat.timeout: 500000) so that our app is not > killed. > > > > I’m aware this is not the way Flink was meant to be used, but it’s a > > > > convenient way to distribute our workload on datanodes without having > > to > > > > use another concurrency framework (such as M/R) that would require > the > > > > recoding of sources and sinks. > > > > > > > > In some other (most common) cases, our tasks do some R/W accesses to > > > > RAM-cached repositories backed by a key-value storage such as Kudu > (or > > > > Hbase). If most of those calls are very fast, sometimes when the > system > > > is > > > > under heavy load they may block more than a few seconds, and having > our > > > app > > > > killed because of a short timeout is not an option. > > > > > > > > That’s why I’m not in favor of very short timeouts… Because in my > > > > experience it really depends on what user code does in the tasks. (I > > > > understand that normally, as user code is not a JVM-blocking activity > > > such > > > > as a GC, it should have no impact on heartbeats, but from experience, > > it > > > > really does) > > > > > > > > Cheers, > > > > Arnaud > > > > > > > > > > > > De : Gen Luo <[email protected]<mailto:[email protected]>> > > > > Envoyé : jeudi 22 juillet 2021 05:46 > > > > À : Till Rohrmann <[email protected]<mailto:[email protected] > >> > > > > Cc : Yang Wang <[email protected]<mailto:[email protected] > >>; > > > dev > > > > <[email protected]<mailto:[email protected]>>; user < > > > > [email protected]<mailto:[email protected]>> > > > > Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and > interval > > > > default values > > > > > > > > Hi, > > > > Thanks for driving this @Till Rohrmann<mailto:[email protected]> > . > > I > > > > would give +1 on reducing the heartbeat timeout and interval, though > > I'm > > > > not sure if 15s and 3s would be enough either. > > > > > > > > IMO, except for the standalone cluster, where the heartbeat mechanism > > in > > > > Flink is totally relied, reducing the heartbeat can also help JM to > > find > > > > out faster TaskExecutors in abnormal conditions that can not respond > to > > > the > > > > heartbeat requests, e.g., continuously Full GC, though the process of > > > > TaskExecutor is alive and may not be known by the deployment system. > > > Since > > > > there are cases that can benefit from this change, I think it could > be > > > done > > > > if it won't break the experience in other scenarios. > > > > > > > > If we can address what will block the main threads from processing > > > > heartbeats, or enlarge the GC costs, we can try to get rid of them to > > > have > > > > a more predictable response time of heartbeat, or give some advices > to > > > > users if their jobs may encounter these issues. For example, as far > as > > I > > > > know JM of a large scale job will be more busy and may not able to > > > process > > > > heartbeats in time, then we can give a advice that users working with > > job > > > > large than 5000 tasks should enlarge there heartbeat interval to 10s > > and > > > > timeout to 50s. The numbers are written casually. > > > > > > > > As for the issue in FLINK-23216, I think it should be fixed and may > not > > > be > > > > a main concern for this case. > > > > > > > > On Wed, Jul 21, 2021 at 6:26 PM Till Rohrmann <[email protected] > > > > <mailto:[email protected]>> wrote: > > > > Thanks for sharing these insights. > > > > > > > > I think it is no longer true that the ResourceManager notifies the > > > > JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more > > details. > > > > > > > > Given the GC pauses, would you then be ok with decreasing the > heartbeat > > > > timeout to 20 seconds? This should give enough time to do the GC and > > then > > > > still send/receive a heartbeat request. > > > > > > > > I also wanted to add that we are about to get rid of one big cause of > > > > blocking I/O operations from the main thread. With FLINK-22483 [2] we > > > will > > > > get rid of Filesystem accesses to retrieve completed checkpoints. > This > > > > leaves us with one additional file system access from the main thread > > > which > > > > is the one completing a pending checkpoint. I think it should be > > possible > > > > to get rid of this access because as Stephan said it only writes > > > > information to disk that is already written before. Maybe solving > these > > > two > > > > issues could ease concerns about long pauses of unresponsiveness of > > > Flink. > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-23216 > > > > [2] https://issues.apache.org/jira/browse/FLINK-22483 > > > > > > > > Cheers, > > > > Till > > > > > > > > On Wed, Jul 21, 2021 at 4:58 AM Yang Wang <[email protected] > > <mailto: > > > > [email protected]>> wrote: > > > > Thanks @Till Rohrmann<mailto:[email protected]> for starting > this > > > > discussion > > > > > > > > Firstly, I try to understand the benefit of shorter heartbeat > timeout. > > > > IIUC, it will make the JobManager aware of > > > > TaskManager faster. However, it seems that only the standalone > cluster > > > > could benefit from this. For Yarn and > > > > native Kubernetes deployment, the Flink ResourceManager should get > the > > > > TaskManager lost event in a very short time. > > > > > > > > * About 8 seconds, 3s for Yarn NM -> Yarn RM, 5s for Yarn RM -> Flink > > RM > > > > * Less than 1 second, Flink RM has a watch for all the TaskManager > pods > > > > > > > > Secondly, I am not very confident to decrease the timeout to 15s. I > > have > > > > quickly checked the TaskManager GC logs > > > > in the past week of our internal Flink workloads and find more than > 100 > > > > 10-seconds Full GC logs, but no one is bigger than 15s. > > > > We are using CMS GC for old generation. > > > > > > > > > > > > Best, > > > > Yang > > > > > > > > Till Rohrmann <[email protected]<mailto:[email protected]>> > > > > 于2021年7月17日周六 上午1:05写道: > > > > Hi everyone, > > > > > > > > Since Flink 1.5 we have the same heartbeat timeout and interval > default > > > > values that are defined as heartbeat.timeout: 50s and > > heartbeat.interval: > > > > 10s. These values were mainly chosen to compensate for lengthy GC > > pauses > > > > and blocking operations that were executed in the main threads of > > Flink's > > > > components. Since then, there were quite some advancements wrt the > > JVM's > > > > GCs and we also got rid of a lot of blocking calls that were executed > > in > > > > the main thread. Moreover, a long heartbeat.timeout causes long > > recovery > > > > times in case of a TaskManager loss because the system can only > > properly > > > > recover after the dead TaskManager has been removed from the > scheduler. > > > > Hence, I wanted to propose to change the timeout and interval to: > > > > > > > > heartbeat.timeout: 15s > > > > heartbeat.interval: 3s > > > > > > > > Since there is no perfect solution that fits all use cases, I would > > > really > > > > like to hear from you what you think about it and how you configure > > these > > > > heartbeat options. Based on your experience we might actually come up > > > with > > > > better default values that allow us to be resilient but also to > detect > > > > failed components fast. FLIP-185 can be found here [1]. > > > > > > > > [1] https://cwiki.apache.org/confluence/x/GAoBCw > > > > > > > > Cheers, > > > > Till > > > > > > > > ________________________________ > > > > > > > > L'intégrité de ce message n'étant pas assurée sur internet, la > société > > > > expéditrice ne peut être tenue responsable de son contenu ni de ses > > > pièces > > > > jointes. Toute utilisation ou diffusion non autorisée est interdite. > Si > > > > vous n'êtes pas destinataire de ce message, merci de le détruire et > > > > d'avertir l'expéditeur. > > > > > > > > The integrity of this message cannot be guaranteed on the Internet. > The > > > > company that sent this message cannot therefore be held liable for > its > > > > content nor attachments. Any unauthorized use or dissemination is > > > > prohibited. If you are not the intended recipient of this message, > then > > > > please delete it and notify the sender. > > > > > > > -- > > > Thanks & Regards, Bhagi > > > > > > -- > Thanks & Regards, Bhagi >
