github-actions[bot] commented on issue #15129: URL: https://github.com/apache/dolphinscheduler/issues/15129#issuecomment-1797849272
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened Currently there are two types of datetimes in execution: 1. Scheduling time 2. Execution time (or start time) But there is currently a logical problem, that is, the `dependent` node uses `processInstance.scheduleTime` as the date base to match `startTime`. In version 3.2.0, the matching logic of 'dependent' nodes is: 1. Find the `ProcessInstance` of the dependent task through `findLastProcessInterval` - The search order is: - a. Query `ProcessInstance` through `queryLastSchedulerProcessInterval`, the condition is `schedule_time >= #{startTime} and schedule_time <= #{endTime}` - b. Query `ProcessInstance` through `queryLastManualProcessInterval`, the condition is `start_time >= #{startTime} and start_time <= #{endTime}` https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L115-L116 2. Match dependent tasks through `findValidTaskListByProcessId` https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L158-L166 In the `dev` version, the matching logic of the 'dependent' node is: 1. Find TaskInstance through `queryLastTaskInstanceIntervalByTaskCode`, the condition is `start_time >= #{startTime} and start_time <= #{endTime}` https://github.com/apache/dolphinscheduler/blob/d675d32771f89a0ad09470e247469b504c6666fe/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L254-L255 In the scenario I am in (bank, data warehouse), 'business date' or 'data date' is particularly important, because for the data processing process, the key point of dependence is 'which day the data has been processed', not ' On which day the data processing was completed? The users of the data involve regulatory submissions, reports, and many management systems. They are very sensitive to the business time where the data is located, so it is a professional scheduling tool in the banking industry. Generally, the business date of the data is mainly considered, rather than the execution time of the task. Furthermore, there are many bank-related business systems. In addition to direct business systems, there are also management systems, third-party data sources, and third-party custody systems. The data as of the end of last month will most likely be available on the 2-3rd of next month. For example, the financial system will delay the payment and adjustment for several days. This situation is very common and routine. Currently, the data warehouse I am responsible for has more than 40 upstream systems or data sources, and at least 5 of them may be delayed. There are hundreds of data processing tasks that will be postponed due to upstream delays. Therefore, in an environment that is very sensitive to business dates and has possible delays, there are the following problems for the `3.2.0` and `dev` versions: 1. For the `3.2.0` version, assuming that the task `T1` that the 'dependent' node (today) depends on has not been executed on the same day, then I executed the `complement` of the date 3 days ago on `T1` operation, this will trigger the logic of the above `3.2.0` version `1.b`, causing the dependency to be unexpectedly detected successfully. 2. For the `dev` version, it is assumed that the upstream data delivery is delayed by 1 day, and the dependent task `T1` is not completed until the next day. The 'dependent' node including its downstream nodes will never succeed because it will never succeed. There will be no new tasks whose `startTime` occurred on the previous day. ------------- The dependency logic of the dependent node in the current implementation has two types of datetime: 1. Schedule time 2. Execution time (or start time) However, there is a logical issue with the `dependent` node using `processInstance.scheduleTime` as the date base to match `startTime`. In version `3.2.0`, the matching logic of the 'dependent' node is: 1. Find the dependent task's `ProcessInstance` by calling `findLastProcessInterval`. -The search order is: - a. Query `ProcessInstance` by calling `queryLastSchedulerProcessInterval` with the condition `schedule_time >= #{startTime} and schedule_time <= #{endTime}`. - b. Query `ProcessInstance` by calling `queryLastManualProcessInterval` with the condition `start_time >= #{startTime} and start_time <= #{endTime}`. Reference link: https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L115-L116 2. Match the dependent tasks by calling `findValidTaskListByProcessId`. Reference link: https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L158-L166 In the `dev` version, the matching logic of the 'dependent' node is: 1. Find TaskInstance by calling `queryLastTaskInstanceIntervalByTaskCode` with the condition `start_time >= #{startTime} and start_time <= #{endTime}`. Reference link: https://github.com/apache/dolphinscheduler/blob/d675d32771f89a0ad09470e247469b504c6666fe/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L254-L255 In my scenario (banking industry, data warehouse), the 'business date' or 'data date' is particularly important because for data processing, the focus of dependency lies in 'which day's data has been processed', rather than 'when was the data processed'. The users of data include regulatory reporting, reports, and many management systems, which are highly sensitive to the business time of the data. Therefore, professional scheduling tools in the banking industry mainly consider the business date of the data, rather than the execution time of tasks. Furthermore, there are many banking-related business systems, including direct business systems, management systems, third-party data sources, and third-party hosting systems. The data at the end of last month may not be available until the 2nd or 3rd day of next month, such as financial system reconciliation that delays for several days. This situation is very common. Currently, I am responsible for more than 40 upstream systems or data sources, and there may be delays in at least 5 of them. As a As a result, there are hundreds of data processing tasks that may be delayed due to upstream delays. Therefore, in an environment where the business date is highly sensitive and there may be delays, there are problems with versions `3.2.0` and `dev`: 1. For version `3.2.0`, assuming that the dependent task `T1` that the 'dependent' node (today) depends on has not been executed on that day, if I perform a 'supplement' operation on it with a date three days ago, this will trigger the above logic in version `3.2.0`, causing an unexpected detection of dependency success. 2. For the `dev` version, assuming that the upstream data is delayed by one day, and the dependent task `T1` is completed on the second day, the 'dependent' node and its downstream nodes will never succeed because there will never be a new task's `startTime` that occurs on the previous day. ### What you expected to happen First, is it reasonable to match the `scheduleTime` of the workflow to the `startTime` of the task? Secondly, if there are indeed different needs, is it possible to add an option to specify whether the dependent node detects the `scheduling time` or the `actual start time`? --- Firstly, is it reasonable to match the `startTime` of a task with the `scheduleTime`? Secondly, if there are indeed different requirements, can an option be added to specify whether the dependent node detects the `scheduleTime` or the actual `startTime`? ### How to reproduce Please refer to the above --- Please refer to the above content. ### Anything else _No response_ ### Version dev ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
