liutang123 opened a new issue #4854:
URL: https://github.com/apache/incubator-doris/issues/4854
Problem
--
Now, FE schedules fragments from root to leaf .
However timetimes the delay of the previous fragment scheduling will affect
the scheduling of the next fragment. If there are slow nodes, this problem will
be more serious.
Some logs are as follows:
Fragment 1's instance log(instance start at **12:42.493897**):
```
I1026 12:12:42.493897 160919 internal_service.cpp:150] exec plan fragment,
fragment_instance_id=e048deb04c074e70-aaf0a5afc4095fa2,
coord=TNetworkAddress(hostname=10.49.135.23, port=9020), backend=6
EXCHANGE_NODE (id=10):(Active: 4s398ms, % non-child: 73.44%)
- FirstBatchArrivalWaitTime: 4s398ms
```
Fragment 2's instance log(instance start at **12:45.642977**):
```
I1026 12:12:45.642977 226269 internal_service.cpp:150] exec plan fragment,
fragment_instance_id=e048deb04c074e70-aaf0a5afc4095f66,
coord=TNetworkAddress(hostname=10.49.135.23, port=9020), backend=146
EXCHANGE_NODE (id=8):(Active: 2s790ms, % non-child: 46.59%)
- FirstBatchArrivalWaitTime: 2s790ms
```
Fragment 4's instance log(instance start at **12:12:46.598142**):
```
I1026 12:12:46.598142 153965 internal_service.cpp:150] exec plan fragment,
fragment_instance_id=e048deb04c074e70-aaf0a5afc4095edb,
coord=TNetworkAddress(hostname=10.49.135.23, port=9020), backend=209
EXCHANGE_NODE (id=6):(Active: 1s762ms, % non-child: 29.43%)
- DataArrivalWaitTime: 1s762ms
```
Fragment 5's instance log(instance start at **12:12:47.194581**):
```
I1026 12:12:47.194581 8638 internal_service.cpp:150] exec plan fragment,
fragment_instance_id=e048deb04c074e70-aaf0a5afc4095eb9,
coord=TNetworkAddress(hostname=10.49.135.23, port=9020), backend=375
Instance e048deb04c074e70-aaf0a5afc4095eb9
(host=TNetworkAddress(hostname:10.16.173.46, port:9060)):(Active: 1s156ms, %
non-child: 0.04%)
```
We can see that the leaf fragment instances just run for 1 second, but due
to the presence of slow instances in the front fragment, the start-up delay is
5 seconds.
Proposol
--
**Option 1**
Schedule fragments by DAG dependence.
If a query's fragments like follows:
Fragment 0
/ \
Fragment 1 Fragment 2
/ \
Fragment 3 Fragment 4
We first schedue Fragment 0, then Fragment 1 and Fragment 2, and last
Fragment 3 and Fragment 4.
**advantage**: This option does not need to add additional RPC.
**disadvantage**: This is only a relief, not a fundamental solution to the
scheduling problem.
**Option 2**
Add a `prepare_fragment` RPC.
Starting a fragment consists of two steps:
1. Prepare: create and register `DataStreamRecvr` in `DataStreamMgr`.
2. Start: subtmit `FragmentMgr::exec_actual` to `FragmentMgr::_thread_pool`.
**advantage**: Both the prepare and start phases are scheduled concurrently.
**disadvantage**:
- Add one more RPC
- Fragment scheduling may still be affected by slow nodes
Fragment 0
/ \
Fragment 1 Fragment 2
/ \
Fragment 3 Fragment 4
Fragment 1 should start executing without being affected by fragment 2. but
in this solution, fragment 1 cannot be executed until all fragment instances
are prepared.
**Option 3**
Add a `prepare_fragment` RPC like **option 2** but Start fragment instances
by DAG dependence.
Fragment 0
/ \
Fragment 1 Fragment 2
/ \
Fragment 3 Fragment 4
Setp 1: prepare all fragment instances concurrently.
Setp 2: Start a fragment when its front fragments are prepared. For example:
we start fragment 1 when fragment 0's instances are all prepared.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]