Re: [DISCUSS] FLIP-118: Improve Flink’s ID system

Yangze Guo Tue, 31 Mar 2020 02:50:17 -0700

Hi, Zhu,

Thanks for the feedback.


> make JobVertexID a composition of JobID and a topology index
I think it is a good idea. However, it seems the JobVertexID is
derived from hashcode which used to identify them across submission.
I'm not familiar with that component though. I prefer to keep this
idea out the scope of this FLIP if no one could help us to figure it
out.

> How about we still keep IntermediateDataSetID independent from JobVertexID,
> but just print the producing relationships in logs? I think keeping
> IntermediateDataSetID independent may be better considering the cross job
> result usages in interactive query cases.
I think you are right. I'll keep IntermediateDataSetID independent
from JobVertexID.

> The new IDs will become larger with this rework.
Yes, I also have the same concern. Benchmark is necessary, I'll try to
provide one during the implementation phase.


Best,
Yangze Guo

On Tue, Mar 31, 2020 at 4:55 PM Zhu Zhu <reed...@gmail.com> wrote:
>
> Thanks for proposing this improvement Yangze. Big +1 for the overall
> proposal. It can help a lot in troubleshooting.
>
> Here are a few questions for it:
> 1. Shall we make JobVertexID a composition of JobID and a topology index?
> This would help in the session cluster case, so that we can identify which
> tasks are from which jobs along with the rework of ExecutionAttemptID.
>
> 2. You mentioned that "Add the producer info to the string literal of
> IntermediateDataSetID". Do you mean to make IntermediateDataSetID a
> composition of JobVertexID and a consumer index?
> How about we still keep IntermediateDataSetID independent from JobVertexID,
> but just print the producing relationships in logs? I think keeping
> IntermediateDataSetID independent may be better considering the cross job
> result usages in interactive query cases.
>
> 3. The new IDs will become larger with this rework. The
> TaskDeploymentDescriptor can become much larger since it is mainly composed
> of a variety DIs. I'm not sure how much it would be but there can be more
> memory and CPU cost for it, and results in more frequent GCs, message size
> exceeding akka frame limits, and a longer blocked time of main thread.
> This should not be a problem in most cases but might be a problem for large
> scale jobs. Shall we have an benchmark for it?
>
> Thanks,
> Zhu Zhu
>
> Yangze Guo <karma...@gmail.com> 于2020年3月31日周二 下午2:19写道：
>
> > Thank you all for the feedback! Sorry for the belated reply.
> >
> > @Till
> > I'm +1 for your two ideas and I'd like to move these two out of the
> > scope of this FLIP since the pipelined region scheduling is an ongoing
> > work now.
> > I also agree that we should not make the InstanceID in
> > TaskExecutorConnection being composed of the ResourceID plus a
> > monotonically increasing value. Thanks a lot for your explanation.
> >
> > @Konstantin @Yang
> > Regarding the PodName of TaskExecutor on K8s, I second Yang's
> > suggestion. It makes sense to me to let user export RESOURCE_ID and
> > make TM respect it. User needs to guarantee there is no collision for
> > different TM.
> >
> > Best,
> > Yangze Guo
> >
> >
> > On Tue, Mar 31, 2020 at 12:25 AM Steven Wu <stevenz...@gmail.com> wrote:
> > >
> > > +1 on allowing user defined resourceId for taskmanager
> > >
> > > On Sun, Mar 29, 2020 at 7:24 PM Yang Wang <danrtsey...@gmail.com> wrote:
> > >
> > > > Hi Konstantin,
> > > >
> > > > I think it is a good idea. Currently, our users also report a similar
> > issue
> > > > with
> > > > resourceId of standalone cluster. When we start a standalone cluster
> > now,
> > > > the `TaskManagerRunner` always generates a uuid for the resourceId. It
> > will
> > > > be used to register to the jobmanager and not convenient to match with
> > the
> > > > real
> > > > taskmanager, especially in container environment.
> > > >
> > > > I think a probably solution is we could support the user defined
> > > > resourceId.
> > > > We could get it from the environment. For standalone on K8s, we could
> > set
> > > > the "RESOURCE_ID" env to the pod name so that it is easier to match the
> > > > taskmanager with K8s pod.
> > > >
> > > > Moreover, i am afraid we could not set the pod name to the resourceId.
> > I
> > > > think
> > > > you could set the "deployment.meta.name". Since the pod name is
> > generated
> > > > by
> > > > K8s in the pattern {deployment.meta.nane}-{rc.uuid}-{uuid}. On the
> > > > contrary, we
> > > > will set the resourceId to the pod name.
> > > >
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > Konstantin Knauf <konstan...@ververica.com> 于2020年3月29日周日 下午8:06写道：
> > > >
> > > > > Hi Yangze, Hi Till,
> > > > >
> > > > > thanks you for working on this topic. I believe it will make
> > debugging
> > > > > large Apache Flink deployments much more feasible.
> > > > >
> > > > > I was wondering whether it would make sense to allow the user to
> > specify
> > > > > the Resource ID in standalone setups?  For example, many users still
> > > > > implicitly use standalone clusters on Kubernetes (the native support
> > is
> > > > > still experimental) and in these cases it would be interesting to
> > also
> > > > set
> > > > > the PodName as the ResourceID. What do you think?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Kosntantin
> > > > >
> > > > > On Thu, Mar 26, 2020 at 6:49 PM Till Rohrmann <trohrm...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hi Yangze,
> > > > > >
> > > > > > thanks for creating this FLIP. I think it is a very good
> > improvement
> > > > > > helping our users and ourselves understanding better what's going
> > on in
> > > > > > Flink.
> > > > > >
> > > > > > Creating the ResourceIDs with host information/pod name is a good
> > idea.
> > > > > >
> > > > > > Also deriving ExecutionGraph IDs from their superset ID is a good
> > idea.
> > > > > >
> > > > > > The InstanceID is used for fencing purposes. I would not make it a
> > > > > > composition of the ResourceID + a monotonically increasing number.
> > The
> > > > > > problem is that in case of a RM failure the InstanceIDs would start
> > > > from
> > > > > 0
> > > > > > again and this could lead to collisions.
> > > > > >
> > > > > > Logging more information on how the different runtime IDs are
> > > > correlated
> > > > > is
> > > > > > also a good idea.
> > > > > >
> > > > > > Two other ideas for simplifying the ids are the following:
> > > > > >
> > > > > > * The SlotRequestID was introduced because the SlotPool was a
> > separate
> > > > > > RpcEndpoint a while ago. With this no longer being the case I
> > think we
> > > > > > could remove the SlotRequestID and replace it with the
> > AllocationID.
> > > > > > * Instead of creating new SlotRequestIDs for multi task slots one
> > could
> > > > > > derive them from the SlotRequestID used for requesting the
> > underlying
> > > > > > AllocatedSlot.
> > > > > >
> > > > > > Given that the slot sharing logic will most likely be reworked
> > with the
> > > > > > pipelined region scheduling, we might be able to resolve these two
> > > > points
> > > > > > as part of the pipelined region scheduling effort.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Thu, Mar 26, 2020 at 10:51 AM Yangze Guo <karma...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > We would like to start a discussion thread on "FLIP-118: Improve
> > > > > > > Flink’s ID system"[1].
> > > > > > >
> > > > > > > This FLIP mainly discusses the following issues, target to
> > enhance
> > > > the
> > > > > > > readability of IDs in log and help user to debug in case of
> > failures:
> > > > > > >
> > > > > > > - Enhance the readability of the string literals of IDs. Most of
> > them
> > > > > > > are hashcodes, e.g. ExecutionAttemptID, which do not provide much
> > > > > > > meaningful information and are hard to recognize and compare for
> > > > > > > users.
> > > > > > > - Log the ID’s lineage information to make debugging more
> > convenient.
> > > > > > > Currently, the log fails to always show the lineage information
> > > > > > > between IDs. Finding out relationships between entities
> > identified by
> > > > > > > given IDs is a common demand, e.g., slot of which AllocationID is
> > > > > > > assigned to satisfy slot request of with SlotRequestID. Absence
> > of
> > > > > > > such lineage information, it’s impossible to track the end to end
> > > > > > > lifecycle of an Execution or a Task now, which makes debugging
> > > > > > > difficult.
> > > > > > >
> > > > > > > Key changes proposed in the FLIP are as follows:
> > > > > > >
> > > > > > > - Add location information to distributed components
> > > > > > > - Add topology information to graph components
> > > > > > > - Log the ID’s lineage information
> > > > > > > - Expose the identifier of distributing component to user
> > > > > > >
> > > > > > > Please find more details in the FLIP wiki document [1]. Looking
> > > > forward
> > > > > > to
> > > > > > > your feedbacks.
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148643521
> > > > > > >
> > > > > > > Best,
> > > > > > > Yangze Guo
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Konstantin Knauf | Head of Product
> > > > >
> > > > > +49 160 91394525
> > > > >
> > > > >
> > > > > Follow us @VervericaData Ververica <https://www.ververica.com/>
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > > > > Conference
> > > > >
> > > > > Stream Processing | Event Driven | Real Time
> > > > >
> > > > > --
> > > > >
> > > > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> > > > >
> > > > > --
> > > > > Ververica GmbH
> > > > > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > > > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason,
> > Ji
> > > > > (Tony) Cheng
> > > > >
> > > >
> >

Re: [DISCUSS] FLIP-118: Improve Flink’s ID system

Reply via email to