Re: [I] [Bug] [Dependent] The date rules of the dependent node are ambiguous. [dolphinscheduler]

via GitHub Mon, 06 Nov 2023 21:29:37 -0800


github-actions[bot] commented on issue #15129:
URL: 
https://github.com/apache/dolphinscheduler/issues/15129#issuecomment-1797849272

### Search before asking

- [X] I had searched in the
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and
found no similar issues.

### What happened

Currently there are two types of datetimes in execution:
1. Scheduling time
2. Execution time (or start time)

But there is currently a logical problem, that is, the `dependent` node uses
`processInstance.scheduleTime` as the date base to match `startTime`.

In version 3.2.0, the matching logic of 'dependent' nodes is:

1. Find the `ProcessInstance` of the dependent task through
`findLastProcessInterval`
- The search order is:
- a. Query `ProcessInstance` through `queryLastSchedulerProcessInterval`,
the condition is `schedule_time >= #{startTime} and schedule_time <= #{endTime}`
- b. Query `ProcessInstance` through `queryLastManualProcessInterval`, the
condition is `start_time >= #{startTime} and start_time <= #{endTime}`

https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L115-L116

2. Match dependent tasks through `findValidTaskListByProcessId`

In the `dev` version, the matching logic of the 'dependent' node is:
1. Find TaskInstance through `queryLastTaskInstanceIntervalByTaskCode`, the
condition is `start_time >= #{startTime} and start_time <= #{endTime}`

https://github.com/apache/dolphinscheduler/blob/d675d32771f89a0ad09470e247469b504c6666fe/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L254-L255

In the scenario I am in (bank, data warehouse), 'business date' or 'data
date' is particularly important, because for the data processing process, the
key point of dependence is 'which day the data has been processed', not ' On
which day the data processing was completed? The users of the data involve
regulatory submissions, reports, and many management systems. They are very
sensitive to the business time where the data is located, so it is a
professional scheduling tool in the banking industry. Generally, the business
date of the data is mainly considered, rather than the execution time of the
task.

Furthermore, there are many bank-related business systems. In addition to
direct business systems, there are also management systems, third-party data
sources, and third-party custody systems. The data as of the end of last month
will most likely be available on the 2-3rd of next month. For example, the
financial system will delay the payment and adjustment for several days. This
situation is very common and routine. Currently, the data warehouse I am
responsible for has more than 40 upstream systems or data sources, and at least
5 of them may be delayed. There are hundreds of data processing tasks that will
be postponed due to upstream delays.

Therefore, in an environment that is very sensitive to business dates and
has possible delays, there are the following problems for the `3.2.0` and `dev`
versions:
1. For the `3.2.0` version, assuming that the task `T1` that the 'dependent'
node (today) depends on has not been executed on the same day, then I executed
the `complement` of the date 3 days ago on `T1` operation, this will trigger
the logic of the above `3.2.0` version `1.b`, causing the dependency to be
unexpectedly detected successfully.
2. For the `dev` version, it is assumed that the upstream data delivery is
delayed by 1 day, and the dependent task `T1` is not completed until the next
day. The 'dependent' node including its downstream nodes will never succeed
because it will never succeed. There will be no new tasks whose `startTime`
occurred on the previous day.

-------------

The dependency logic of the dependent node in the current implementation has
two types of datetime:
1. Schedule time
2. Execution time (or start time)

However, there is a logical issue with the `dependent` node using
`processInstance.scheduleTime` as the date base to match `startTime`.

In version `3.2.0`, the matching logic of the 'dependent' node is:

1. Find the dependent task's `ProcessInstance` by calling
`findLastProcessInterval`.
-The search order is:
- a. Query `ProcessInstance` by calling
`queryLastSchedulerProcessInterval` with the condition `schedule_time >=
#{startTime} and schedule_time <= #{endTime}`.
- b. Query `ProcessInstance` by calling
`queryLastManualProcessInterval` with the condition `start_time >= #{startTime}
and start_time <= #{endTime}`.

Reference link:
https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L115-L116

2. Match the dependent tasks by calling `findValidTaskListByProcessId`.
Reference link:
https://github.com/apache/dolphinscheduler/blob/e648d6d2adede44c11711c313b5f27d4474961c5/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L158-L166

In the `dev` version, the matching logic of the 'dependent' node is:
1. Find TaskInstance by calling `queryLastTaskInstanceIntervalByTaskCode`
with the condition `start_time >= #{startTime} and start_time <= #{endTime}`.
Reference link:
https://github.com/apache/dolphinscheduler/blob/d675d32771f89a0ad09470e247469b504c6666fe/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/utils/DependentExecute.java#L254-L255

In my scenario (banking industry, data warehouse), the 'business date' or
'data date' is particularly important because for data processing, the focus of
dependency lies in 'which day's data has been processed', rather than 'when was
the data processed'. The users of data include regulatory reporting, reports,
and many management systems, which are highly sensitive to the business time of
the data. Therefore, professional scheduling tools in the banking industry
mainly consider the business date of the data, rather than the execution time
of tasks.

Furthermore, there are many banking-related business systems, including
direct business systems, management systems, third-party data sources, and
third-party hosting systems. The data at the end of last month may not be
available until the 2nd or 3rd day of next month, such as financial system
reconciliation that delays for several days. This situation is very common.
Currently, I am responsible for more than 40 upstream systems or data sources,
and there may be delays in at least 5 of them. As a As a result, there are
hundreds of data processing tasks that may be delayed due to upstream delays.

Therefore, in an environment where the business date is highly sensitive and
there may be delays, there are problems with versions `3.2.0` and `dev`:
1. For version `3.2.0`, assuming that the dependent task `T1` that the
'dependent' node (today) depends on has not been executed on that day, if I
perform a 'supplement' operation on it with a date three days ago, this will
trigger the above logic in version `3.2.0`, causing an unexpected detection of
dependency success.
2. For the `dev` version, assuming that the upstream data is delayed by one
day, and the dependent task `T1` is completed on the second day, the
'dependent' node and its downstream nodes will never succeed because there will
never be a new task's `startTime` that occurs on the previous day.

### What you expected to happen

First, is it reasonable to match the `scheduleTime` of the workflow to the
`startTime` of the task?

Secondly, if there are indeed different needs, is it possible to add an
option to specify whether the dependent node detects the `scheduling time` or
the `actual start time`?

---

Firstly, is it reasonable to match the `startTime` of a task with the
`scheduleTime`?

Secondly, if there are indeed different requirements, can an option be added
to specify whether the dependent node detects the `scheduleTime` or the actual
`startTime`?

### How to reproduce

Please refer to the above

---

Please refer to the above content.

### Anything else

_No response_

### Version

dev

### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [X] I agree to follow this project's [Code of
Conduct](https://www.apache.org/foundation/policies/conduct)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Bug] [Dependent] The date rules of the dependent node are ambiguous. [dolphinscheduler]

Reply via email to