reele opened a new issue, #15129: URL: https://github.com/apache/dolphinscheduler/issues/15129
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened 目前在执行过程中有两种日期时间: 1. 调度时间 2. 执行时间(或开始时间) 但目前存在一个逻辑问题,就是`dependent`节点使用`processInstance.scheduleTime`作为日期基准去匹配`startTime`。 在`3.2.0`版本中,'dependent'节点的匹配逻辑是: 1. 通过`findLastProcessInterval`找到依赖任务的`ProcessInstance` - 搜索顺序是: - a. 通过 `queryLastSchedulerProcessInterval` 查询 `ProcessInstance`, 条件是 `schedule_time >= #{startTime} and schedule_time <= #{endTime}` - b. 通过 `queryLastManualProcessInterval` 查询 `ProcessInstance`, 条件是 `start_time >= #{startTime} and start_time <= #{endTime}` https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L115-L116 2. 通过`findValidTaskListByProcessId`匹配配依赖的任务 https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L158-L166 在`dev`版本中,'dependent'节点的匹配逻辑是: 1. 通过`queryLastTaskInstanceIntervalByTaskCode`查找 TaskInstance, 条件是 `start_time >= #{startTime} and start_time <= #{endTime}` https://github.com/apache/dolphinscheduler/blob/d675d32771f89a0ad09470e247469b504c6666fe/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L254-L255 在我所处的场景(银行, 数据仓库)中, '业务日期'或'数据日期'尤其重要, 因为对于数据加工的过程, 依赖的重点在于 '哪天的数据已经处理完成', 而不是 '数据的处理在哪天执行完成过', 数据的使用方涉及监管报送、报表还有诸多管理类系统, 对数据所处的业务时间的敏感度非常高, 所以在银行行业内的专业调度工具一般主要考虑数据的营业日期, 而不是任务的执行时间. 再者, 银行相关的业务系统非常多, 除了直接营业系统外, 还有管理类系统、第三方数据源、第三方托管系统, 截止上月末的数据很有可能会在下个月的2-3号才会产生, 比如财务系统出账调账就会延迟几天, 这种情况非常多且常规, 目前我负责的数据仓库已经有40+个上游系统或数据源, 可能延迟的至少有5个, 而因上游延迟会推后的数据加工任务就有上百个. 所以在对业务日期非常敏感且存在可能延迟的环境下, 对于`3.2.0`和`dev`版本有以下问题: 1. 对于`3.2.0`版本, 假设'dependent'节点(today)所依赖的任务`T1`在当天还没有执行, 这时我对`T1`执行了日期为3天前的`补数`操作, 这就会触发上面`3.2.0`版本`1.b`的逻辑导致依赖意外检测成功。 2. 对于`dev`版本, 假设上游的数据下发推迟了1天, 被依赖的任务`T1`在第二天才完成, 'dependent'节点包括它的下游节点永远也不会成功了, 因为永远不会有新任务的`startTime`发生在上一天。 ------------- The dependency logic of the dependent node in the current implementation has two types of datetime: 1. Schedule time 2. Execution time (or start time) However, there is a logical issue with the `dependent` node using `processInstance.scheduleTime` as the date base to match `startTime`. In version `3.2.0`, the matching logic of the 'dependent' node is: 1. Find the dependent task's `ProcessInstance` by calling `findLastProcessInterval`. - The search order is: - a. Query `ProcessInstance` by calling `queryLastSchedulerProcessInterval` with the condition `schedule_time >= #{startTime} and schedule_time <= #{endTime}`. - b. Query `ProcessInstance` by calling `queryLastManualProcessInterval` with the condition `start_time >= #{startTime} and start_time <= #{endTime}`. Reference link: https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L115-L116 2. Match the dependent tasks by calling `findValidTaskListByProcessId`. Reference link: https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L158-L166 In the `dev` version, the matching logic of the 'dependent' node is: 1. Find TaskInstance by calling `queryLastTaskInstanceIntervalByTaskCode` with the condition `start_time >= #{startTime} and start_time <= #{endTime}`. Reference link: https://github.com/apache/dolphinscheduler/blob/d675d32771f89a0ad09470e247469b504c6666fe/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L254-L255 In my scenario (banking industry, data warehouse), the 'business date' or 'data date' is particularly important because for data processing, the focus of dependency lies in 'which day's data has been processed', rather than 'when was the data processed'. The users of data include regulatory reporting, reports, and many management systems, which are highly sensitive to the business time of the data. Therefore, professional scheduling tools in the banking industry mainly consider the business date of the data, rather than the execution time of tasks. Furthermore, there are many banking-related business systems, including direct business systems, management systems, third-party data sources, and third-party hosting systems. The data at the end of last month may not be available until the 2nd or 3rd day of next month, such as financial system reconciliation that delays for several days. This situation is very common. Currently, I am responsible for more than 40 upstream systems or data sources, and there may be delays in at least 5 of them. As a result, there are hundreds of data processing tasks that may be delayed due to upstream delays. Therefore, in an environment where the business date is highly sensitive and there may be delays, there are problems with versions `3.2.0` and `dev`: 1. For version `3.2.0`, assuming that the dependent task `T1` that the 'dependent' node (today) depends on has not been executed on that day, if I perform a 'supplement' operation on it with a date three days ago, this will trigger the above logic in version `3.2.0`, causing an unexpected detection of dependency success. 2. For the `dev` version, assuming that the upstream data is delayed by one day, and the dependent task `T1` is completed on the second day, the 'dependent' node and its downstream nodes will never succeed because there will never be a new task's `startTime` that occurs on the previous day. ### What you expected to happen 首先,用工作流的`scheduleTime`匹配任务的`startTime`是否合理? 其次,如果确实有不同的需求,是不是可以增加一个选项,指定依赖节点是检测`调度时间`还是`实际开始时间`? --- Firstly, is it reasonable to match the `startTime` of a task with the `scheduleTime`? Secondly, if there are indeed different requirements, can an option be added to specify whether the dependent node detects the `scheduleTime` or the actual `startTime`? ### How to reproduce 请参考上述内容 --- Please refer to the above content. ### Anything else _No response_ ### Version dev ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
