DanielCarter-stack commented on PR #10396:
URL: https://github.com/apache/seatunnel/pull/10396#issuecomment-3798624378

   <!-- code-pr-reviewer -->
   <!-- cpr:pr_reply_v2_parts {"group": "apache/seatunnel#10396", "part": 1, 
"total": 1} -->
   ### Issue 1: Missing recursion depth limit may lead to stack overflow
   
   **Location**: 
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/catalog/SeaTunnelDataTypeConvertorUtil.java:convertDataType`
 method
   
   **Related Context**:
   - Parent class/interface: `SeaTunnelDataType` (type system base class)
   - Caller: `ParquetReadStrategy.java` (via `convertDataType` method)
   
   **Problem Description**:
   When recursively converting nested types, no depth limit is set. Although 
nesting depth is typically small in real business scenarios, maliciously 
crafted schemas or extremely complex real data could lead to stack overflow.
   
   **Potential Risks**:
   - Malicious users construct ultra-deep nested schemas (e.g., 
`ARRAY<ARRAY<...<STRING>...>>` depth 1000+)
   - May cause JVM stack overflow, crashing the Job submission phase
   - Although the risk is low (requires constructing extreme scenarios), it is 
a potential DoS attack surface
   
   **Impact Scope**:
   - Direct impact: `SeaTunnelDataTypeConvertorUtil#convertDataType`
   - Indirect impact: All connectors using this utility class (Parquet, ORC, 
JSON, etc.)
   - Impact area: Core framework
   
   **Severity**: MAJOR
   
   **Improvement Suggestion**:
   ```java
   private static final int MAX_NESTING_DEPTH = 100;
   private static int currentDepth = 0;
   
   public static SeaTunnelDataType<?> convertDataType(LogicalType type) {
       if (currentDepth > MAX_NESTING_DEPTH) {
           throw new SeaTunnelDataTypeException(
               "Nesting depth exceeds maximum allowed limit: " + 
MAX_NESTING_DEPTH);
       }
       
       if (type instanceof ArrayType) {
           currentDepth++;
           SeaTunnelDataType<?> elementType = convertDataType(((ArrayType) 
type).getElementType());
           currentDepth--;
           return SeaTunnelDataType.array(elementType);
       }
       // Same logic for MAP type
   }
   ```
   
   **Rationale**: Adding a depth limit prevents stack overflow in extreme 
cases, has low implementation cost, and does not affect normal usage scenarios.
   
   ---
   
   ### Issue 2: Missing clear error messages when type conversion fails
   
   **Location**: 
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/catalog/SeaTunnelDataTypeConvertorUtil.java:convertDataType`
 method
   
   **Problem Description**:
   When encountering unsupported types, the current code may return `null` or 
throw ambiguous exceptions. Users cannot quickly identify which field and what 
type caused the failure.
   
   **Potential Risks**:
   - Difficult debugging: Users only see "type conversion failed" without 
knowing the specific cause
   - Affects user experience and increases support costs
   
   **Impact Scope**:
   - Direct impact: All users using schema configuration
   - Impact area: Single Connector / Multiple Connectors
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   ```java
   public static SeaTunnelDataType<?> convertDataType(LogicalType type) {
       try {
           if (type instanceof ArrayType) {
               // ... recursive logic
           } else if (type instanceof MapType) {
               // ... recursive logic
           }
           // ... other types
       } catch (Exception e) {
           throw new SeaTunnelDataTypeException(
               String.format("Failed to convert type '%s' (SQL: %s)", 
                   type.getClass().getSimpleName(), 
                   type.asSummaryString()),
               e);
       }
   }
   ```
   
   **Rationale**: Provide detailed error context to help users quickly identify 
problems.
   
   ---
   
   ### Issue 3: Test cases do not cover deep nesting boundary conditions
   
   **Location**: 
`seatunnel-api/src/test/java/org/apache/seatunnel/api/table/catalog/SeaTunnelDataTypeConvertorUtilTest.java`
   
   **Problem Description**:
   Although tests cover nested ARRAY and MAP, they do not test extremely deep 
nesting (e.g., depth 50+) or depth limit logic (if the fix for Issue 1 is 
added).
   
   **Potential Risks**:
   - Cannot verify the correctness of depth limit logic
   - Cannot discover potential performance issues or stack overflow risks
   
   **Impact Scope**:
   - Direct impact: Test coverage
   - Indirect impact: Code robustness
   - Impact area: Core framework
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   ```java
   @Test
   public void testDeepNestingDepth() {
       // Construct nested type with depth 50
       LogicalType deepType = createNestedArrayType(50);
       SeaTunnelDataType<?> result = 
SeaTunnelDataTypeConvertorUtil.convertDataType(deepType);
       assertNotNull(result);
   }
   
   @Test
   public void testExceedMaxNestingDepth() {
       // Construct nested type with depth 101
       LogicalType tooDeepType = createNestedArrayType(101);
       assertThrows(SeaTunnelDataTypeException.class, 
           () -> SeaTunnelDataTypeConvertorUtil.convertDataType(tooDeepType));
   }
   ```
   
   **Rationale**: Boundary testing is key to ensuring code robustness, 
especially for logic involving recursion.
   
   ---
   
   ### Issue 4: Missing JavaDoc documentation for new features
   
   **Location**: 
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/catalog/SeaTunnelDataTypeConvertorUtil.java`
 class level and method level
   
   **Problem Description**:
   The newly added nested type support feature is not documented in JavaDoc, so 
developers cannot understand this capability through API documentation.
   
   **Potential Risks**:
   - Reduces API discoverability
   - Other developers may reimplement the same functionality
   - Users are unaware that nested type configuration can be used
   
   **Impact Scope**:
   - Direct impact: API documentation users
   - Impact area: All developers using the SeaTunnel API
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   ```java
   /**
    * Utility class for converting between different data type representations.
    * 
    * <p>This class supports conversion of nested types including:
    * <ul>
    *   <li>ARRAY types (e.g., ARRAY<STRING>, ARRAY<ARRAY<INT>>)</li>
    *   <li>MAP types (e.g., MAP<STRING, INT>, MAP<STRING, ARRAY<DOUBLE>>)</li>
    *   <li>Mixed nesting (e.g., ARRAY<MAP<STRING, INT>>)</li>
    * </ul>
    * 
    * <p>Example usage:
    * <pre>{@code
    * // Define nested array type
    * SeaTunnelDataType<?> nestedArray = SeaTunnelDataType.array(
    *     SeaTunnelDataType.map(SeaTunnelDataType.string(), 
SeaTunnelDataType.intType())
    * );
    * }</pre>
    * 
    * @since 2.3.0
    */
   public class SeaTunnelDataTypeConvertorUtil {
       // ...
   }
   ```
   
   **Rationale**: Good documentation is key to API usability, especially for 
public APIs.
   
   ---
   
   ### Issue 5: User documentation not updated
   
   **Location**: Documentation files (to be supplemented)
   
   **Problem Description**:
   The PR description mentions that users can define nested types, but the 
documentation does not explain the specific syntax format and examples.
   
   **Potential Risks**:
   - Users do not know how to use the new feature
   - Configuration may fail due to syntax errors
   
   **Impact Scope**:
   - Direct impact: Users using File Connector
   - Impact area: Single Connector / Multiple Connectors
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   Update `docs/en/connector-v2/source/FileSource.md`, add nested type 
configuration examples:
   ```hocon
   source {
     FileSource {
       path = "/path/to/parquet"
       format = "parquet"
       schema = {
         fields {
           simple_array = array("string")
           nested_array = array(array("int"))
           map_field = map("string", "int")
           complex_field = array(map("string", array("double")))
         }
       }
     }
   }
   ```
   
   **Rationale**: New features must be accompanied by documentation, otherwise 
users cannot discover and use them.
   
   ---
   
   ### Issue 6: Thread safety issue with currentDepth variable in recursive 
logic
   
   **Location**: 
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/catalog/SeaTunnelDataTypeConvertorUtil.java`
   
   **Problem Description**:
   If following the suggestion for Issue 1 to add a static variable 
`currentDepth` to track recursion depth, there will be thread safety issues in 
multi-threaded environments. When multiple Jobs are submitted simultaneously, 
the static variable will be shared.
   
   **Potential Risks**:
   - Incorrect depth counting, leading to misjudgment of depth limits
   - False positives or false negatives may occur in concurrent scenarios
   
   **Impact Scope**:
   - Direct impact: `SeaTunnelDataTypeConvertorUtil`
   - Indirect impact: All scenarios where Jobs are submitted concurrently
   - Impact area: Core framework
   
   **Severity**: MAJOR (if implementing the solution for Issue 1)
   
   **Improvement Suggestion**:
   ```java
   // Solution 1: Use ThreadLocal
   private static final ThreadLocal<Integer> currentDepth = 
ThreadLocal.withInitial(() -> 0);
   
   public static SeaTunnelDataType<?> convertDataType(LogicalType type) {
       int depth = currentDepth.get();
       if (depth > MAX_NESTING_DEPTH) {
           throw new SeaTunnelDataTypeException(...);
       }
       
       if (type instanceof ArrayType) {
           currentDepth.set(depth + 1);
           try {
               SeaTunnelDataType<?> elementType = convertDataType(((ArrayType) 
type).getElementType());
               return SeaTunnelDataType.array(elementType);
           } finally {
               currentDepth.set(depth);
           }
       }
   }
   
   // Solution 2: Use instance methods instead of static methods
   public class SeaTunnelDataTypeConverter {
       private int currentDepth = 0;
       
       public SeaTunnelDataType<?> convert(LogicalType type) {
           // Instance methods are naturally isolated
       }
   }
   ```
   
   **Rationale**: Static state is a high-risk design in concurrent scenarios 
and should be avoided or isolated using ThreadLocal.
   
   ---
   
   ### Issue 7: MAP type key type restrictions not validated
   
   **Location**: 
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/catalog/SeaTunnelDataTypeConvertorUtil.java:convertDataType`
 method section handling MapType
   
   **Problem Description**:
   SeaTunnel's MAP type typically requires that Key must be a basic type 
(STRING, INT, LONG, etc.), not a complex type (ARRAY, MAP, ROW). The current 
recursive conversion logic does not validate the legality of the Key type, 
which may result in generating invalid MAP types.
   
   **Potential Risks**:
   - Generated MAP type fails at runtime
   - Schema validation passes during Job submission phase but crashes during 
execution phase
   
   **Impact Scope**:
   - Direct impact: Users using nested MAP
   - Indirect impact: When downstream Connectors process MAP types
   - Impact area: Core framework / Multiple Connectors
   
   **Severity**: MAJOR
   
   **Improvement Suggestion**:
   ```java
   if (type instanceof MapType) {
       LogicalType keyType = ((MapType) type).getKeyType();
       LogicalType valueType = ((MapType) type).getValueType();
       
       // Verify Key type
       if (keyType instanceof ArrayType || keyType instanceof MapType || 
keyType instanceof RowType) {
           throw new SeaTunnelDataTypeException(
               "MAP key type must be a primitive type, but got: " + 
keyType.asSummaryString());
       }
       
       return SeaTunnelDataType.map(
           convertDataType(keyType),
           convertDataType(valueType)
       );
   }
   ```
   
   **Rationale**: Defensive programming - capture illegal configurations during 
the type conversion phase rather than deferring to runtime.
   
   ---


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to