Re: Still receiving these emails.. please unsubscribe me

Martijn Visser Mon, 26 Jul 2021 04:11:24 -0700

Hi,

I would recommend having a look at the headers of the original email [1].
In those headers, there will be a value for 'Return-Path' with an email
address. If you send an email to that address, it will unsubscribe you.


Best regards,

Martijn

[1] https://support.google.com/mail/answer/29436?hl=en#zippy=%2Cgmail

On Mon, 26 Jul 2021 at 12:10, R Bhaaagi <bhagi.ramaha...@gmail.com> wrote:

> Hi Jing,
>
>
> I already sent emails to the mentioned email ids... the reply I got was, my
> email I’d is not available...
> Not sure why I am still receiving these emails
>
> On Mon, 26 Jul 2021 at 9:42 AM, JING ZHANG <beyond1...@gmail.com> wrote:
>
> > Hi Bhagi,
> > To unsubscribe emails from Flink JIRA activities, send an email to
> > issues-unsubscr...@flink.apache.org
> >
> > To unsubscribe emails from Flink dev mail list, send an email to dev-
> > unsubscr...@flink.apache.org
> >
> > To unsubscribe emails from Flink user mail list, send an email to user-
> > unsubscr...@flink.apache.org
> >
> > For more information, please go to [1].
> >
> > [1] https://flink.apache.org/community.html
> >
> > Best,
> > JING ZHANG
> >
> > R Bhaaagi <bhagi.ramaha...@gmail.com> 于2021年7月23日周五 下午5:45写道：
> >
> > > Hi Team,
> > >
> > > Please unsubscribe me from all these emails....
> > >
> > > On Fri, 23 Jul 2021 at 2:19 PM, LINZ, Arnaud <al...@bouyguestelecom.fr
> >
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > It’s hard to say what caused the timeout to trigger – I agree with
> you
> > > > that it should not have stopped the heartbeat thread, but it did. The
> > > easy
> > > > fix was to increase it until we no longer see our app self-killed.
> The
> > > task
> > > > was using a CPU-intensive computation (with a few threads created at
> > some
> > > > points… Somehow breaking the “slot number” contract).
> > > > For the RAM cache, I believe that the hearbeat timeout may also times
> > out
> > > > because of a busy network.
> > > >
> > > > Cheers,
> > > > Arnaud
> > > >
> > > >
> > > > De : Till Rohrmann <trohrm...@apache.org>
> > > > Envoyé : jeudi 22 juillet 2021 11:33
> > > > À : LINZ, Arnaud <al...@bouyguestelecom.fr>
> > > > Cc : Gen Luo <luogen...@gmail.com>; Yang Wang <danrtsey...@gmail.com
> >;
> > > > dev <dev@flink.apache.org>; user <u...@flink.apache.org>
> > > > Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and
> interval
> > > > default values
> > > >
> > > > Thanks for your inputs Gen and Arnaud.
> > > >
> > > > I do agree with you, Gen, that we need better guidance for our users
> on
> > > > when to change the heartbeat configuration. I think this should
> happen
> > in
> > > > any case. I am, however, not so sure whether we can give hard
> threshold
> > > > like 5000 tasks, for example, because as Arnaud said it strongly
> > depends
> > > on
> > > > the workload. Maybe we can explain it based on symptoms a user might
> > > > experience and what to do then.
> > > >
> > > > Concerning your workloads, Arnaud, I'd be interested to learn a bit
> > more.
> > > > The user code runs in its own thread. This means that its operation
> > won't
> > > > block the main thread/heartbeat. The only thing that can happen is
> that
> > > the
> > > > user code starves the heartbeat in terms of CPU cycles or causes a
> lot
> > of
> > > > GC pauses. If you are observing the former problem, then we might
> think
> > > > about changing the priorities of the respective threads. This should
> > then
> > > > improve Flink's stability for these workloads and a shorter heartbeat
> > > > timeout should be possible.
> > > >
> > > > Also for the RAM-cached repositories, what exactly is causing the
> > > > heartbeat to time out? Is it because you have a lot of GC or that the
> > > > heartbeat thread does not get enough CPU cycles?
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Thu, Jul 22, 2021 at 9:16 AM LINZ, Arnaud <
> al...@bouyguestelecom.fr
> > > > <mailto:al...@bouyguestelecom.fr>> wrote:
> > > > Hello,
> > > >
> > > > From a user perspective: we have some (rare) use cases where we use
> > > > “coarse grain” datasets, with big beans and tasks that do lengthy
> > > operation
> > > > (such as ML training). In these cases we had to increase the time out
> > to
> > > > huge values (heartbeat.timeout: 500000) so that our app is not
> killed.
> > > > I’m aware this is not the way Flink was meant to be used, but it’s a
> > > > convenient way to distribute our workload on datanodes without having
> > to
> > > > use another concurrency framework (such as M/R) that would require
> the
> > > > recoding of sources and sinks.
> > > >
> > > > In some other (most common) cases, our tasks do some R/W accesses to
> > > > RAM-cached repositories backed by a key-value storage such as Kudu
> (or
> > > > Hbase). If most of those calls are very fast, sometimes when the
> system
> > > is
> > > > under heavy load they may block more than a few seconds, and having
> our
> > > app
> > > > killed because of a short timeout is not an option.
> > > >
> > > > That’s why I’m not in favor of very short timeouts… Because in my
> > > > experience it really depends on what user code does in the tasks. (I
> > > > understand that normally, as user code is not a JVM-blocking activity
> > > such
> > > > as a GC, it should have no impact on heartbeats, but from experience,
> > it
> > > > really does)
> > > >
> > > > Cheers,
> > > > Arnaud
> > > >
> > > >
> > > > De : Gen Luo <luogen...@gmail.com<mailto:luogen...@gmail.com>>
> > > > Envoyé : jeudi 22 juillet 2021 05:46
> > > > À : Till Rohrmann <trohrm...@apache.org<mailto:trohrm...@apache.org
> >>
> > > > Cc : Yang Wang <danrtsey...@gmail.com<mailto:danrtsey...@gmail.com
> >>;
> > > dev
> > > > <dev@flink.apache.org<mailto:dev@flink.apache.org>>; user <
> > > > u...@flink.apache.org<mailto:u...@flink.apache.org>>
> > > > Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and
> interval
> > > > default values
> > > >
> > > > Hi,
> > > > Thanks for driving this @Till Rohrmann<mailto:trohrm...@apache.org>
> .
> > I
> > > > would give +1 on reducing the heartbeat timeout and interval, though
> > I'm
> > > > not sure if 15s and 3s would be enough either.
> > > >
> > > > IMO, except for the standalone cluster, where the heartbeat mechanism
> > in
> > > > Flink is totally relied, reducing the heartbeat can also help JM to
> > find
> > > > out faster TaskExecutors in abnormal conditions that can not respond
> to
> > > the
> > > > heartbeat requests, e.g., continuously Full GC, though the process of
> > > > TaskExecutor is alive and may not be known by the deployment system.
> > > Since
> > > > there are cases that can benefit from this change, I think it could
> be
> > > done
> > > > if it won't break the experience in other scenarios.
> > > >
> > > > If we can address what will block the main threads from processing
> > > > heartbeats, or enlarge the GC costs, we can try to get rid of them to
> > > have
> > > > a more predictable response time of heartbeat, or give some advices
> to
> > > > users if their jobs may encounter these issues. For example, as far
> as
> > I
> > > > know JM of a large scale job will be more busy and may not able to
> > > process
> > > > heartbeats in time, then we can give a advice that users working with
> > job
> > > > large than 5000 tasks should enlarge there heartbeat interval to 10s
> > and
> > > > timeout to 50s. The numbers are written casually.
> > > >
> > > > As for the issue in FLINK-23216, I think it should be fixed and may
> not
> > > be
> > > > a main concern for this case.
> > > >
> > > > On Wed, Jul 21, 2021 at 6:26 PM Till Rohrmann <trohrm...@apache.org
> > > > <mailto:trohrm...@apache.org>> wrote:
> > > > Thanks for sharing these insights.
> > > >
> > > > I think it is no longer true that the ResourceManager notifies the
> > > > JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more
> > details.
> > > >
> > > > Given the GC pauses, would you then be ok with decreasing the
> heartbeat
> > > > timeout to 20 seconds? This should give enough time to do the GC and
> > then
> > > > still send/receive a heartbeat request.
> > > >
> > > > I also wanted to add that we are about to get rid of one big cause of
> > > > blocking I/O operations from the main thread. With FLINK-22483 [2] we
> > > will
> > > > get rid of Filesystem accesses to retrieve completed checkpoints.
> This
> > > > leaves us with one additional file system access from the main thread
> > > which
> > > > is the one completing a pending checkpoint. I think it should be
> > possible
> > > > to get rid of this access because as Stephan said it only writes
> > > > information to disk that is already written before. Maybe solving
> these
> > > two
> > > > issues could ease concerns about long pauses of unresponsiveness of
> > > Flink.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-23216
> > > > [2] https://issues.apache.org/jira/browse/FLINK-22483
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Wed, Jul 21, 2021 at 4:58 AM Yang Wang <danrtsey...@gmail.com
> > <mailto:
> > > > danrtsey...@gmail.com>> wrote:
> > > > Thanks @Till Rohrmann<mailto:trohrm...@apache.org>  for starting
> this
> > > > discussion
> > > >
> > > > Firstly, I try to understand the benefit of shorter heartbeat
> timeout.
> > > > IIUC, it will make the JobManager aware of
> > > > TaskManager faster. However, it seems that only the standalone
> cluster
> > > > could benefit from this. For Yarn and
> > > > native Kubernetes deployment, the Flink ResourceManager should get
> the
> > > > TaskManager lost event in a very short time.
> > > >
> > > > * About 8 seconds, 3s for Yarn NM -> Yarn RM, 5s for Yarn RM -> Flink
> > RM
> > > > * Less than 1 second, Flink RM has a watch for all the TaskManager
> pods
> > > >
> > > > Secondly, I am not very confident to decrease the timeout to 15s. I
> > have
> > > > quickly checked the TaskManager GC logs
> > > > in the past week of our internal Flink workloads and find more than
> 100
> > > > 10-seconds Full GC logs, but no one is bigger than 15s.
> > > > We are using CMS GC for old generation.
> > > >
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > Till Rohrmann <trohrm...@apache.org<mailto:trohrm...@apache.org>>
> > > > 于2021年7月17日周六 上午1:05写道：
> > > > Hi everyone,
> > > >
> > > > Since Flink 1.5 we have the same heartbeat timeout and interval
> default
> > > > values that are defined as heartbeat.timeout: 50s and
> > heartbeat.interval:
> > > > 10s. These values were mainly chosen to compensate for lengthy GC
> > pauses
> > > > and blocking operations that were executed in the main threads of
> > Flink's
> > > > components. Since then, there were quite some advancements wrt the
> > JVM's
> > > > GCs and we also got rid of a lot of blocking calls that were executed
> > in
> > > > the main thread. Moreover, a long heartbeat.timeout causes long
> > recovery
> > > > times in case of a TaskManager loss because the system can only
> > properly
> > > > recover after the dead TaskManager has been removed from the
> scheduler.
> > > > Hence, I wanted to propose to change the timeout and interval to:
> > > >
> > > > heartbeat.timeout: 15s
> > > > heartbeat.interval: 3s
> > > >
> > > > Since there is no perfect solution that fits all use cases, I would
> > > really
> > > > like to hear from you what you think about it and how you configure
> > these
> > > > heartbeat options. Based on your experience we might actually come up
> > > with
> > > > better default values that allow us to be resilient but also to
> detect
> > > > failed components fast. FLIP-185 can be found here [1].
> > > >
> > > > [1] https://cwiki.apache.org/confluence/x/GAoBCw
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > ________________________________
> > > >
> > > > L'intégrité de ce message n'étant pas assurée sur internet, la
> société
> > > > expéditrice ne peut être tenue responsable de son contenu ni de ses
> > > pièces
> > > > jointes. Toute utilisation ou diffusion non autorisée est interdite.
> Si
> > > > vous n'êtes pas destinataire de ce message, merci de le détruire et
> > > > d'avertir l'expéditeur.
> > > >
> > > > The integrity of this message cannot be guaranteed on the Internet.
> The
> > > > company that sent this message cannot therefore be held liable for
> its
> > > > content nor attachments. Any unauthorized use or dissemination is
> > > > prohibited. If you are not the intended recipient of this message,
> then
> > > > please delete it and notify the sender.
> > > >
> > > --
> > > Thanks & Regards, Bhagi
> > >
> >
> --
> Thanks & Regards, Bhagi
>

Re: Still receiving these emails.. please unsubscribe me

Reply via email to