Hi Rick, Good idea, if we can make this as a strategy, users can choose the way to parse application_id.
Thanks, Wenjun On Fri, Sep 23, 2022 at 2:34 PM Rick Cheng <[email protected]> wrote: > > Hi, Aaron Wang > > I agree that the current way of getting the yarn application id from the > log is not elegant. > Just for discussion, there is another way to get yarn application id as > below: > > We can put some unique tags on tasks submitted from DS to yarn. E.g., for > spark tasks, we can add the configuration "--conf spark.yarn.tags > some_unique_tag". > After the task is submitted, DS can query the corresponding yarn > application id (or other info) through this unique tag. > > What do you think? Any comments or discussions are welcome. > > > 王维饶 <[email protected]> 于2022年9月23日周五 12:15写道: > > > Hi, DolphinScheduler Community > > > > I'm student of SOC-2022, responsible for optimizing the way to collect yarn > > applicationId. The old way which parse applicationId from log file does > > cause some problem in production environment [1], also, have other > > potential problems such as wasting CPU resource and fetching confused > > applicationId due to uncontrollable log output in task. I've already > > created an issue about it [2]. > > > > My main idea is to intercept yarn's submitApplication function by AOP and > > fetch appId from application context. I've already verified it for most > > types of yarn job (like Mapreduce, Hive, Spark, Flink, etc...) and modified > > relative code parts. > > > > To be specific, all yarn jobs will call submitApplication to create new > > application, applicationId can be written in {user.dir}/appInfo.log and can > > be directly fetched by getAppIdsFromAppInfoFile rather than parsed from log > > file in getAppIdsFromLogFile like the old way. It's an efficient way to > > fetch applicationId and can avoid potential problems mentioned earlier. > > > > However, this solution still have some questions to discuss about and we > > held a community meeting at 19:00, September 22(GMT-8), organized by > > GabryWu. The following are meeting summary. > > > > 🍊* Issue1:* Evaluate & review the idea and design > > > > *> Conclusion1:* It seems reasonable and more efficient than the old way, > > especially, not nobtrusive to task code. Eric suggest add configuration to > > choose ways to fetch applicationId (old or new way) for stability. > > > > > > 🍓* Issue2: *Whether to create a new module in source code for AOP codes? > > > > *> Conclusion2:* It's better to do so. Otherwise, AOP code will behave like > > black box and will be invalid if user replace the submitApplication in > > secondary development. > > > > > > 🍈* Issue3:* Some environment variable configurations need to be added in > > dolphinscheduler_env.sh like: > > > > export > > > > HADOOP_CLIENT_OPTS="-javaagent:{DOLPHINSCHEDULER_HOME}/tools/libs/aspectjweaver-1.9.7.jar" > > > > However, I don't think it's elegant to hard-code the version of dependency > > which will bring potential operational problems. One possible solution is > > to package known-dependencies.txt in binary package, and the version can be > > parsed from it. > > > > *> Conclustion3: *Ruan thinks it's not a big deal. The version will not be > > easily changed and will not cause too much operational cost. > > > > > > *> Other conclusions: *We should declare in user doc that not to override > > aop-related environment variables when DS is running (manually or configure > > in DS ui). > > > > > > Really looking forward to other suggestions or more discuss about these > > questions. > > > > Thanks! > > > > ——————————————————————————————————————— > > > > DolphinScheduler社区各位大家好 > > > > 我是开源之夏2022的参与者,负责优化yarn > > applicationId的收集方式。原来的方式是从日志中通过正则匹配解析获取,这种方式在生产环境中会产生很多问题[1], > > 另外也会造成CPU资源占用过高,以及因为用户代码中的自定义输出匹配到歧义applicationId的情况。我已经对这个问题提出issue[2]。 > > > > 主要的思路是通过AOP拦截submitApplication > > 方法,已经验证当前DS支持的几个依赖yarn调度的计算任务都可以通过配置环境变量进行拦截,并且已经初步修改了对应代码。 > > > > 详细地说,所有yarn job均会通过submitApplication > > 申请创建新的application并进行资源的分配与作业调度,Aop对该方法拦截到的applicationId可以写入 > > {user.dir}/appInfo.log文件中并通过getAppIdsFromAppInfoFile方法直接获取,而不需要像原来一样从 > > > > getAppIdsFromLogFile方法中进行日志解析。新的方法是一个更有效的获取applicaitionId的方式并且可以避免前面提到的潜在问题。 > > > > 然而,这个解决方案仍然有几个遗留问题需要讨论,并且我们在9.22 > > > > 19:00(GMT-8)进了由导师GabryWu组织的社区会议,参会人为我、GabryWu、Ruanwenjun、GabryWu、Eric。以下是对会议内容的简要记录: > > > > 🍊 议题一:评估思路和设计的可行性 > > > > > > > > > 结论一:这个方案看起来比原来更加高效,而且它没有任何对作业代码的侵入。Eric建议为了平滑过渡,提供配置可以供用户选择使用哪一种applicationId的获取方式(新或者旧的方法) > > > > > > 🍓 议题二:是否为Aop的代码在源码中创建新的模块? > > > > > 结论二:最好这样做,否则Aop代码对于用户将是一个黑盒,而且如果用户二次开发了yarn代码修改了submitApplication > > 代码,Aop将变得无效。 > > > > > > 🍈 议题三:该方案需要在dolphinscheduler_env.sh 中添加环境变量比如: > > > > export > > > > HADOOP_CLIENT_OPTS="-javaagent:{DOLPHINSCHEDULER_HOME}/tools/libs/aspectjweaver-1.9.7.jar" > > > > 然而,我不认为这是一个理想的方式硬编码依赖版本号,会造成潜在的运维问题,一个解决方案是在二进制包中添加dependencies.txt > > ,可以从该文件中解析版本号。 > > > > > 结论三:Ruan认为这不是一个大的问题,依赖版本不会被轻易改变,即不会造成太大的运维问题。 > > > > > > > 其它结论: > > > > 应该在用户文档中声明在DS运行时用户不要覆盖Aop相关的环境变量(手动修改或者在DS ui中修改环境配置)。 > > > > > > 我非常期待关于这个方案更多的建议和讨论。 > > > > 谢谢大家! > > > > > > *Related issue:* > > > > [1] https://github.com/apache/dolphinscheduler/issues/11214 > > > > [2] https://github.com/apache/dolphinscheduler/issues/11262 > > > > > > *Meeting playback:* > > > > [1] Google Drive > > > > link: > > > > https://drive.google.com/file/d/1JGShE4aNl3wJEF7jX0OuQD_anEaEotE3/view?usp=sharing > > > > [2] Baidu Netdisk > > > > link: https://pan.baidu.com/s/1h5fmtEsOk86G9JBPGUcPDg > > > > code: dhy3 > > > > > > _____________________ > > > > Best Wishes > > > > Radeity (Aaron Wang) > > > > _____________________ > >
