Hi Rick,

Good idea, if we can make this as a strategy, users can choose the way
to parse application_id.

Thanks,
Wenjun

On Fri, Sep 23, 2022 at 2:34 PM Rick Cheng <[email protected]> wrote:
>
> Hi, Aaron Wang
>
> I agree that the current way of getting the yarn application id from the
> log is not elegant.
> Just for discussion, there is another way to get yarn application id as
> below:
>
> We can put some unique tags on tasks submitted from DS to yarn. E.g., for
> spark tasks, we can add the configuration "--conf spark.yarn.tags
> some_unique_tag".
> After the task is submitted, DS can query the corresponding yarn
> application id (or other info) through this unique tag.
>
> What do you think? Any comments or discussions are welcome.
>
>
> 王维饶 <[email protected]> 于2022年9月23日周五 12:15写道:
>
> > Hi, DolphinScheduler Community
> >
> > I'm student of SOC-2022, responsible for optimizing the way to collect yarn
> > applicationId. The old way which parse applicationId from log file does
> > cause some problem in production environment [1], also, have other
> > potential problems such as wasting CPU resource and fetching confused
> > applicationId due to uncontrollable log output in task. I've already
> > created an issue about it [2].
> >
> > My main idea is to intercept yarn's submitApplication function by AOP and
> > fetch appId from application context. I've already verified it for most
> > types of yarn job (like Mapreduce, Hive, Spark, Flink, etc...) and modified
> > relative code parts.
> >
> > To be specific, all yarn jobs will call submitApplication to create new
> > application, applicationId can be written in {user.dir}/appInfo.log and can
> > be directly fetched by getAppIdsFromAppInfoFile rather than parsed from log
> > file in getAppIdsFromLogFile like the old way. It's an efficient way to
> > fetch applicationId and can avoid potential problems mentioned earlier.
> >
> > However, this solution still have some questions to discuss about and we
> > held a community meeting at 19:00, September 22(GMT-8), organized by
> > GabryWu. The following are meeting summary.
> >
> > 🍊* Issue1:* Evaluate & review the idea and design
> >
> > *> Conclusion1:* It seems reasonable and more efficient than the old way,
> > especially, not nobtrusive to task code. Eric suggest add configuration to
> > choose ways to fetch applicationId (old or new way) for stability.
> >
> >
> > 🍓* Issue2: *Whether to create a new module in source code for AOP codes?
> >
> > *> Conclusion2:* It's better to do so. Otherwise, AOP code will behave like
> > black box and will be invalid if user replace the submitApplication in
> > secondary development.
> >
> >
> > 🍈* Issue3:* Some environment variable configurations need to be added in
> > dolphinscheduler_env.sh like:
> >
> > export
> >
> > HADOOP_CLIENT_OPTS="-javaagent:{DOLPHINSCHEDULER_HOME}/tools/libs/aspectjweaver-1.9.7.jar"
> >
> > However, I don't think it's elegant to hard-code the version of dependency
> > which will bring potential operational problems. One possible solution is
> > to package known-dependencies.txt in binary package, and the version can be
> > parsed from it.
> >
> > *> Conclustion3: *Ruan thinks it's not a big deal. The version will not be
> > easily changed and will not cause too much operational cost.
> >
> >
> > *> Other conclusions: *We should declare in user doc that not to override
> > aop-related environment variables when DS is running (manually or configure
> > in DS ui).
> >
> >
> > Really looking forward to other suggestions or more discuss about these
> > questions.
> >
> > Thanks!
> >
> > ———————————————————————————————————————
> >
> > DolphinScheduler社区各位大家好
> >
> > 我是开源之夏2022的参与者,负责优化yarn
> > applicationId的收集方式。原来的方式是从日志中通过正则匹配解析获取,这种方式在生产环境中会产生很多问题[1],
> > 另外也会造成CPU资源占用过高,以及因为用户代码中的自定义输出匹配到歧义applicationId的情况。我已经对这个问题提出issue[2]。
> >
> > 主要的思路是通过AOP拦截submitApplication
> > 方法,已经验证当前DS支持的几个依赖yarn调度的计算任务都可以通过配置环境变量进行拦截,并且已经初步修改了对应代码。
> >
> > 详细地说,所有yarn job均会通过submitApplication
> > 申请创建新的application并进行资源的分配与作业调度,Aop对该方法拦截到的applicationId可以写入
> > {user.dir}/appInfo.log文件中并通过getAppIdsFromAppInfoFile方法直接获取,而不需要像原来一样从
> >
> > getAppIdsFromLogFile方法中进行日志解析。新的方法是一个更有效的获取applicaitionId的方式并且可以避免前面提到的潜在问题。
> >
> > 然而,这个解决方案仍然有几个遗留问题需要讨论,并且我们在9.22
> >
> > 19:00(GMT-8)进了由导师GabryWu组织的社区会议,参会人为我、GabryWu、Ruanwenjun、GabryWu、Eric。以下是对会议内容的简要记录:
> >
> > 🍊 议题一:评估思路和设计的可行性
> >
> > >
> >
> > 结论一:这个方案看起来比原来更加高效,而且它没有任何对作业代码的侵入。Eric建议为了平滑过渡,提供配置可以供用户选择使用哪一种applicationId的获取方式(新或者旧的方法)
> >
> >
> > 🍓 议题二:是否为Aop的代码在源码中创建新的模块?
> >
> > > 结论二:最好这样做,否则Aop代码对于用户将是一个黑盒,而且如果用户二次开发了yarn代码修改了submitApplication
> > 代码,Aop将变得无效。
> >
> >
> > 🍈 议题三:该方案需要在dolphinscheduler_env.sh 中添加环境变量比如:
> >
> > export
> >
> > HADOOP_CLIENT_OPTS="-javaagent:{DOLPHINSCHEDULER_HOME}/tools/libs/aspectjweaver-1.9.7.jar"
> >
> > 然而,我不认为这是一个理想的方式硬编码依赖版本号,会造成潜在的运维问题,一个解决方案是在二进制包中添加dependencies.txt
> > ,可以从该文件中解析版本号。
> >
> > > 结论三:Ruan认为这不是一个大的问题,依赖版本不会被轻易改变,即不会造成太大的运维问题。
> >
> >
> > > 其它结论:
> >
> > 应该在用户文档中声明在DS运行时用户不要覆盖Aop相关的环境变量(手动修改或者在DS ui中修改环境配置)。
> >
> >
> > 我非常期待关于这个方案更多的建议和讨论。
> >
> > 谢谢大家!
> >
> >
> > *Related issue:*
> >
> > [1] https://github.com/apache/dolphinscheduler/issues/11214
> >
> > [2] https://github.com/apache/dolphinscheduler/issues/11262
> >
> >
> > *Meeting playback:*
> >
> > [1] Google Drive
> >
> > link:
> >
> > https://drive.google.com/file/d/1JGShE4aNl3wJEF7jX0OuQD_anEaEotE3/view?usp=sharing
> >
> > [2] Baidu Netdisk
> >
> > link: https://pan.baidu.com/s/1h5fmtEsOk86G9JBPGUcPDg
> >
> > code: dhy3
> >
> >
> > _____________________
> >
> > Best Wishes
> >
> > Radeity (Aaron Wang)
> >
> > _____________________
> >

Reply via email to