github-actions[bot] commented on issue #15129:
URL: 
https://github.com/apache/dolphinscheduler/issues/15129#issuecomment-1797849272

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   Currently there are two types of datetimes in execution:
   1. Scheduling time
   2. Execution time (or start time)
   
   But there is currently a logical problem, that is, the `dependent` node uses 
`processInstance.scheduleTime` as the date base to match `startTime`.
   
   In version 3.2.0, the matching logic of 'dependent' nodes is:
   
   
   1. Find the `ProcessInstance` of the dependent task through 
`findLastProcessInterval`
      - The search order is:
   - a. Query `ProcessInstance` through `queryLastSchedulerProcessInterval`, 
the condition is `schedule_time >= #{startTime} and schedule_time <= #{endTime}`
   - b. Query `ProcessInstance` through `queryLastManualProcessInterval`, the 
condition is `start_time >= #{startTime} and start_time <= #{endTime}`
   
   
https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L115-L116
   
   2. Match dependent tasks through `findValidTaskListByProcessId`
   
https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L158-L166
   
   
   
   In the `dev` version, the matching logic of the 'dependent' node is:
   1. Find TaskInstance through `queryLastTaskInstanceIntervalByTaskCode`, the 
condition is `start_time >= #{startTime} and start_time <= #{endTime}`
   
https://github.com/apache/dolphinscheduler/blob/d675d32771f89a0ad09470e247469b504c6666fe/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L254-L255
   
   In the scenario I am in (bank, data warehouse), 'business date' or 'data 
date' is particularly important, because for the data processing process, the 
key point of dependence is 'which day the data has been processed', not ' On 
which day the data processing was completed? The users of the data involve 
regulatory submissions, reports, and many management systems. They are very 
sensitive to the business time where the data is located, so it is a 
professional scheduling tool in the banking industry. Generally, the business 
date of the data is mainly considered, rather than the execution time of the 
task.
   
   Furthermore, there are many bank-related business systems. In addition to 
direct business systems, there are also management systems, third-party data 
sources, and third-party custody systems. The data as of the end of last month 
will most likely be available on the 2-3rd of next month. For example, the 
financial system will delay the payment and adjustment for several days. This 
situation is very common and routine. Currently, the data warehouse I am 
responsible for has more than 40 upstream systems or data sources, and at least 
5 of them may be delayed. There are hundreds of data processing tasks that will 
be postponed due to upstream delays.
   
   Therefore, in an environment that is very sensitive to business dates and 
has possible delays, there are the following problems for the `3.2.0` and `dev` 
versions:
   1. For the `3.2.0` version, assuming that the task `T1` that the 'dependent' 
node (today) depends on has not been executed on the same day, then I executed 
the `complement` of the date 3 days ago on `T1` operation, this will trigger 
the logic of the above `3.2.0` version `1.b`, causing the dependency to be 
unexpectedly detected successfully.
   2. For the `dev` version, it is assumed that the upstream data delivery is 
delayed by 1 day, and the dependent task `T1` is not completed until the next 
day. The 'dependent' node including its downstream nodes will never succeed 
because it will never succeed. There will be no new tasks whose `startTime` 
occurred on the previous day.
   
   -------------
   
   The dependency logic of the dependent node in the current implementation has 
two types of datetime:
   1. Schedule time
   2. Execution time (or start time)
   
   However, there is a logical issue with the `dependent` node using 
`processInstance.scheduleTime` as the date base to match `startTime`.
   
   In version `3.2.0`, the matching logic of the 'dependent' node is:
   
   1. Find the dependent task's `ProcessInstance` by calling 
`findLastProcessInterval`.
      -The search order is:
        - a. Query `ProcessInstance` by calling 
`queryLastSchedulerProcessInterval` with the condition `schedule_time >= 
#{startTime} and schedule_time <= #{endTime}`.
        - b. Query `ProcessInstance` by calling 
`queryLastManualProcessInterval` with the condition `start_time >= #{startTime} 
and start_time <= #{endTime}`.
   
      Reference link: 
https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L115-L116
   
   2. Match the dependent tasks by calling `findValidTaskListByProcessId`.
   Reference link: 
https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L158-L166
   
   In the `dev` version, the matching logic of the 'dependent' node is:
   1. Find TaskInstance by calling `queryLastTaskInstanceIntervalByTaskCode` 
with the condition `start_time >= #{startTime} and start_time <= #{endTime}`.
   Reference link: 
https://github.com/apache/dolphinscheduler/blob/d675d32771f89a0ad09470e247469b504c6666fe/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L254-L255
   
   In my scenario (banking industry, data warehouse), the 'business date' or 
'data date' is particularly important because for data processing, the focus of 
dependency lies in 'which day's data has been processed', rather than 'when was 
the data processed'. The users of data include regulatory reporting, reports, 
and many management systems, which are highly sensitive to the business time of 
the data. Therefore, professional scheduling tools in the banking industry 
mainly consider the business date of the data, rather than the execution time 
of tasks.
   
   Furthermore, there are many banking-related business systems, including 
direct business systems, management systems, third-party data sources, and 
third-party hosting systems. The data at the end of last month may not be 
available until the 2nd or 3rd day of next month, such as financial system 
reconciliation that delays for several days. This situation is very common. 
Currently, I am responsible for more than 40 upstream systems or data sources, 
and there may be delays in at least 5 of them. As a As a result, there are 
hundreds of data processing tasks that may be delayed due to upstream delays.
   
   Therefore, in an environment where the business date is highly sensitive and 
there may be delays, there are problems with versions `3.2.0` and `dev`:
   1. For version `3.2.0`, assuming that the dependent task `T1` that the 
'dependent' node (today) depends on has not been executed on that day, if I 
perform a 'supplement' operation on it with a date three days ago, this will 
trigger the above logic in version `3.2.0`, causing an unexpected detection of 
dependency success.
   2. For the `dev` version, assuming that the upstream data is delayed by one 
day, and the dependent task `T1` is completed on the second day, the 
'dependent' node and its downstream nodes will never succeed because there will 
never be a new task's `startTime` that occurs on the previous day.
   
   ### What you expected to happen
   
   First, is it reasonable to match the `scheduleTime` of the workflow to the 
`startTime` of the task?
   
   Secondly, if there are indeed different needs, is it possible to add an 
option to specify whether the dependent node detects the `scheduling time` or 
the `actual start time`?
   
   ---
   
   Firstly, is it reasonable to match the `startTime` of a task with the 
`scheduleTime`?
   
   Secondly, if there are indeed different requirements, can an option be added 
to specify whether the dependent node detects the `scheduleTime` or the actual 
`startTime`?
   
   ### How to reproduce
   
   Please refer to the above
   
   ---
   
   Please refer to the above content.
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   dev
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to