dataroaring commented on PR #55666: URL: https://github.com/apache/doris/pull/55666#issuecomment-3284047747
## Comprehensive Code Review - PR #55666 ### **Summary** This PR enhances error reporting for load operations by providing immediate feedback about the first data quality error, eliminating the need for users to manually access error URLs. The implementation spans both backend (C++) and frontend (Java) components with comprehensive coverage across all load types. ### **Key Strengths** โ - **Significant UX Improvement**: Users get immediate error details instead of accessing separate URLs - **Comprehensive Implementation**: Covers Stream Load, Broker Load, Insert Load, and Group Commit operations - **Backward Compatibility**: Preserves existing `error_url` functionality - **Consistent Architecture**: Clean error message propagation across the codebase ### **Critical Issues** ๐จ #### 1. **Thread Safety (CRITICAL)** **File**: `be/src/runtime/runtime_state.cpp:375` ```cpp _first_error_msg = error_msg() + ". Src line: " + line(); ``` **Issue**: Direct assignment without synchronization. Multiple threads can access `_first_error_msg` concurrently. **Fix**: Add mutex protection similar to the `_error_url_lock` pattern used elsewhere. #### 2. **Memory Management (CRITICAL)** **Issue**: No bounds checking on error message length could cause memory issues in high-throughput scenarios. **Fix**: Implement message size limits (suggest 1KB maximum) with intelligent truncation. #### 3. **Protocol Definitions Incomplete (MAJOR)** **Files**: `gensrc/proto/internal_service.proto`, `gensrc/thrift/FrontendService.thrift` **Issue**: References to `first_error_msg` fields but actual field definitions appear missing. **Fix**: Complete the protocol schema definitions for proper serialization. ### **Major Issues** โ ๏ธ #### 4. **Race Conditions** **File**: `be/src/pipeline/pipeline_fragment_context.cpp:1864-1878` **Issue**: Task state iteration without proper locking could cause race conditions. **Fix**: Add synchronization for task state access. #### 5. **Input Validation Missing** **Issue**: Error messages exposed to users lack sanitization and could reveal sensitive information. **Fix**: Implement input sanitization before displaying messages to users. ### **Minor Issues** ๐ #### 6. **Inconsistent Null Handling** **File**: `fe/fe-core/src/main/java/org/apache/doris/load/EtlStatus.java` **Issue**: Getter/setter don't handle null cases consistently. **Fix**: Add consistent null safety checks using `Strings.isNullOrEmpty()`. #### 7. **Arbitrary Truncation** **File**: `fe/fe-core/src/main/java/org/apache/doris/load/loadv2/LoadJob.java:805` ```java jobInfo.add(StringUtils.abbreviate(loadingStatus.getFirstErrorMsg(), 256)); ``` **Issue**: 256-character limit may cut important error information. **Fix**: Use intelligent truncation that preserves key error details or define a constant for the limit. ### **Missing Test Coverage** ๐งช - Concurrent error message handling scenarios - Error message truncation behavior - Protocol buffer serialization/deserialization - Performance impact testing for high-throughput loads - Edge cases in error propagation paths ### **Security Considerations** ๐ - Error messages might expose sensitive data structure information - Large error messages could impact system performance (DoS potential) - User data in error messages should be sanitized ### **Recommendations by Priority** **๐ด High Priority (Must Fix Before Merge):** 1. Implement proper thread synchronization for all `first_error_msg` access 2. Add message size limits with intelligent truncation 3. Complete protocol buffer and thrift schema definitions 4. Add input sanitization for user-facing error messages **๐ก Medium Priority (Should Fix):** 1. Add comprehensive unit and integration tests 2. Standardize error message formatting across components 3. Add configuration options for error message behavior 4. Implement performance benchmarking for error handling paths **๐ข Low Priority (Nice to Have):** 1. Improve inline documentation and code comments 2. Add metrics and monitoring for error message handling 3. Consider internationalization for error messages ### **Overall Recommendation: REQUEST CHANGES** While this PR addresses a legitimate user need with a well-architected solution, several critical production safety issues must be resolved: 1. **Thread Safety**: Lack of synchronization poses significant risk in production 2. **Memory Management**: Unbounded string operations could impact system stability 3. **Incomplete Implementation**: Missing protocol definitions suggest incomplete implementation The core concept and approach are excellent. Once these critical issues are addressed, this will be a valuable enhancement to the Doris load operation experience. **Files Requiring Immediate Attention:** - `be/src/runtime/runtime_state.cpp` - Thread safety fixes - `be/src/runtime/runtime_state.h` - Add synchronization documentation - `gensrc/proto/internal_service.proto` - Complete field definitions - `gensrc/thrift/FrontendService.thrift` - Complete field definitions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
