imbajin commented on PR #336:
URL: 
https://github.com/apache/incubator-hugegraph-computer/pull/336#issuecomment-3393058130

   # Code Review: Priority/Elder/Depends Based Scheduling
   
   ## Summary
   This PR 
introduces a 
comprehensive 
scheduler redesign 
with 
priority-based 
scheduling, 
task 
dependencies, 
and cron 
support. 
The architecture 
is 
well-designed,
 but there 
are **critical
  concurrency 
issues** 
that must 
be addressed 
before merging.
   
   **Verdict: 
REQUEST 
CHANGES** ⚠️
   
   ---
   
   ## Architecture Changes
   
   ### Major Refactoring ✅
   The PR 
completely 
redesigns the 
scheduler from 
a simple 
queue-based 
system to 
a sophisticated 
multi-component 
architecture:
   
   **Old 
Architecture:**
   - Simple 
`SpaceQueue` 
(FIFO)
   - Basic 
`Broker` 
for worker 
assignment
   - 
Single-threaded 
dispatch loop
   
   **New 
Architecture:**
   - 
`SchedulerResourceManager`
 - Worker 
pool management
   - 
`SchedulerTaskManager`
 - Task 
lifecycle, 
dependencies, 
priorities
   - 
`SchedulerAlgorithmManager`
 - Scheduling 
algorithm logic
   - 
`SchedulerCronManager`
 - Recurring 
task support
   - 
Event-driven 
scheduling with 
ticker-based 
dispatch
   
   ### New Features:
   1. ✅ 
Priority-based 
scheduling with 
integer priorities
   2. ✅ 
Task dependency 
management 
(`preorders`)
   3. ✅ 
Cron expressions 
for recurring 
tasks
   4. ✅ 
Exclusive/concurrent
 execution modes
   5. ✅ 
Configurable soft 
scheduling
   
   ---
   
   ## Critical Issues 🚨
   
   ### 1. TOCTTOU Race Condition (scheduler_bl.go:421-423)
   **Severity: 
CRITICAL**
   
   ```go
   // TODO: Is here need a lock? TOCTTOU
   if taskInfo.State != structure.TaskStateWaiting {
       logrus.Errorf("task state is not in 'Waiting' state, taskID: 
%v", taskInfo)
       return
   }
   ```
   
   The TODO 
correctly 
identifies a 
Time-Of-Check-Time-Of-Use
 vulnerability. 
State check 
and operations 
are not 
atomic.
   
   **Fix:**
   ```go
   defer taskInfo.Unlock(taskInfo.Lock())
   if taskInfo.State != structure.TaskStateWaiting {
       // ...
   }
   ```
   
   ### 2. Resource Leak: Channel Not Closed (scheduler_bl.go:62)
   **Severity: 
HIGH**
   
   The 
`startChan` 
is never 
closed, 
causing potential 
goroutine leaks 
during 
shutdown.
   
   **Fix:** 
Add cleanup:
   ```go
   func (s *ScheduleBl) Shutdown() {
       close(s.startChan)
   }
   ```
   
   ### 3. Dependency State Validation Missing 
(scheduler_bl.go:234-244)
   **Severity: 
HIGH**
   
   No check 
if dependency 
tasks are 
completed. 
Tasks could 
depend on 
failed/canceled 
tasks.
   
   **Fix:**
   ```go
   if depTask.State != structure.TaskStateComplete {
       return false, fmt.Errorf("dependency task %d is not complete 
(state: %s)", depTaskID, depTask.State)
   }
   ```
   
   ### 4. Potential Deadlock Pattern (scheduler_bl.go:97-99)
   **Severity: 
HIGH**
   
   Multiple functions 
acquire locks 
then call 
`TryScheduleNextTasks`,
 which 
conditionally 
locks. 
This pattern 
is 
deadlock-prone.
   
   **Fix:** 
Use consistent 
locking 
strategy, 
consider 
`sync.RWMutex` 
for 
read-heavy 
operations.
   
   ---
   
   ## High Priority Issues ⚠️
   
   ### 5. Nil Pointer Risk (task_bl.go:87-92)
   If 
`Scheduler.cronManager`
 is nil 
(initialization 
order 
issue), 
this will 
panic:
   ```go
   if err := Scheduler.cronManager.CheckCronExpression(cronExpr); err 
!= nil {
   ```
   Add nil 
checks.
   
   ### 6. API Breaking Change (scheduler_bl.go:502)
   `CloseCurrent` 
signature changed 
from `(taskId 
int32)` to 
`(taskId int32, removeWorkerName 
...string)`. 
Document this 
breaking change 
and verify 
all call 
sites are 
updated.
   
   ### 7. Commented Out Critical Code (scheduler_bl.go:229)
   ```go
   //defer s.Unlock(s.Lock())
   ```
   Critical locking 
code commented 
without explanation 
- this 
is dangerous.
   
   ### 8. Incomplete Implementation (scheduler_bl.go:159)
   ```go
   // TODO: NEED TO JUDGE IF THE TASK CAN CONCURRENTLY RUNNING
   // NOT only by user setting, but also by scheduler setting
   ```
   Important concurrent 
execution logic 
deferred.
   
   ---
   
   ## Medium Priority Issues
   
   ### 9. Missing Input Validation (task_bl.go:71-79)
   No upper 
bound on 
priority 
values. 
Large priorities 
could cause 
unexpected 
behavior.
   
   ### 10. Silent Error Handling (scheduler_bl.go:441-443)
   ```go
   if err := s.taskManager.AddTaskStartSequence(taskInfo.ID); err != 
nil {
       logrus.Errorf("failed to add task '%d' to start sequence: %v", 
taskInfo.ID, err)
   }
   ```
   Errors logged 
but not 
handled.
   
   ---
   
   ## CI/CD Status
   
   ### Failed Checks ❌
   1. 
**check-license-header**
 - Missing 
Apache license 
headers in 
new files
   2. 
**dependency-review**
 - New 
dependencies need 
review
   
   ### Passed Checks ✅
   - CodeQL 
(Go 
& Java)
   - Dependency 
license check
   
   ### Test Coverage ⚠️
   - PR 
mentions tests 
in 
`/test` 
folder but 
not visible 
in diff
   - Need 
unit tests 
for:
     - 
Priority scheduling 
algorithm
     - 
Dependency resolution
     - 
Cron task 
scheduling
     - 
Concurrent execution
   
   ---
   
   ## Recommendations
   
   ### Must Fix Before Merge:
   1. Fix 
TOCTTOU race 
condition with 
atomic state 
checking
   2. Add 
resource cleanup 
(close 
channels on 
shutdown)
   3. Validate 
dependency task 
states
   4. Fix 
failing CI 
checks 
(license 
headers, 
dependency 
review)
   5. Review 
and fix 
all locking 
patterns
   6. Add 
nil pointer 
checks for 
managers
   
   ### Should Fix:
   7. Complete 
TODO 
implementations 
(concurrent 
execution 
logic)
   8. Document 
API breaking 
changes
   9. Add 
comprehensive test 
coverage
   10. Add 
input validation 
(priority 
bounds)
   11. Remove 
commented code 
with explanation
   
   ### Documentation Needed:
   12. Migration 
guide for 
scheduler API 
changes
   13. Configuration 
documentation 
(`start_chan_size`,
 
`ticker_interval`,
 
`soft_schedule`)
   14. Cron 
expression format 
specification
   
   ---
   
   ## Positive Aspects 👍
   
   1. ✅ 
Well-architected 
component separation
   2. ✅ 
Good configuration 
externalization
   3. ✅ 
Comprehensive 
feature set
   4. ✅ 
Improved 
observability with 
detailed logging
   5. ✅ 
Added Docker 
support documentation
   
   ---
   
   ## Risk Assessment
   
   **Risk 
Level:** 
MEDIUM-HIGH 
due to 
concurrency issues 
in core 
scheduling logic
   
   **Estimated 
Time to 
Fix:** 
2-3 
days for 
critical 
fixes, 
1 week 
for comprehensive 
improvements
   
   The architecture 
and features 
are 
excellent, 
but the 
critical 
concurrency issues 
must be 
resolved before 
this can 
be safely 
merged to 
production.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to