DanielCarter-stack commented on PR #10450:
URL: https://github.com/apache/seatunnel/pull/10450#issuecomment-3851088717

   <!-- code-pr-reviewer -->
   <!-- cpr:pr_reply_v2_parts {"group": "apache/seatunnel#10450", "part": 1, 
"total": 1} -->
   ### Issue 1: file_split_size lacks input validation
   
   **Location**: 
`seatunnel-connectors-v2/connector-file/connector-file-s3/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/s3/source/S3FileSourceFactory.java`
   
   **Related Context**:
   - Define configuration option: `S3FileSourceFactory.java:42-55`
   - Use configuration option: `CustomFileSplitGenerator` (not modified in this 
PR)
   
   **Problem Description**:
   The `file_split_size` configuration option does not have validity 
validation. Users may configure the following invalid values:
   - Negative numbers: leading to logic errors
   - Zero: may cause infinite loops or division-by-zero errors
   - Excessively large values: may cause a single split to exceed memory limits
   
   **Potential Risks**:
   - **Risk 1**: When configured as 0 or a negative number, it may cause 
`splitEnd` calculation errors, producing abnormal split ranges
   - **Risk 2**: When configured as an extremely large value (e.g., 
Long.MAX_VALUE), it may degrade to single-split reading, but users may 
mistakenly believe splitting is enabled
   - **Risk 3**: No friendly error messages, making it difficult for users to 
troubleshoot configuration issues
   
   **Impact Scope**:
   - **Direct Impact**: All S3File jobs using `enable_file_split=true`
   - **Affected Area**: Single Connector (S3File)
   
   **Severity**: MAJOR
   
   **Improvement Suggestions**:
   
   ```java
   // S3FileSourceFactory.java
   public static final ConfigOption<Long> FILE_SPLIT_SIZE = ConfigOption
       .key("file_split_size")
       .longType()
       .defaultValue(64 * 1024 * 1024L)
       .withDescription(
           "The file split size (in bytes) when file split is enabled. "
           + "Must be positive. Recommended values are between 1MB and 1GB. "
           + "Note: actual split size may be larger due to row delimiter 
alignment.");
   
   // Add validation in the prepare method
   @Override
   public void prepare(PrepareConfig config) throws Exception {
       // ... existing code ...
       
       Long fileSplitSize = options.get(FILE_SPLIT_SIZE);
       if (fileSplitSize != null && fileSplitSize <= 0) {
           throw new IllegalArgumentException(
               "file_split_size must be positive, but got: " + fileSplitSize);
       }
       
       // Optional: Add warning
       if (fileSplitSize != null && fileSplitSize < 1024 * 1024) {
           LOG.warn("file_split_size is less than 1MB, which may cause too many 
splits. "
                   + "Recommended value: at least 1MB.");
       }
   }
   ```
   
   **Rationale**: Adding configuration validation can prevent runtime errors 
caused by invalid configurations, detecting issues early and providing friendly 
error messages.
   
   ---
   
   ### Issue 2: TextReadStrategy line separator hardcoding
   
   **Location**: 
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/TextReadStrategy.java`
   
   **Related Context**:
   - Parent class: `AbstractReadStrategy`
   - Child classes: `CsvReadStrategy`, `JsonReadStrategy`
   - Caller: `FileSourceReader`
   
   **Problem Description**:
   The `adjustSplitEndToNextDelimiter` method hardcodes `\n` as the line 
separator, not supporting Windows-style `\r\n`. Although most modern systems 
and S3 storage use Unix-style line breaks, issues may arise in certain 
scenarios (e.g., files uploaded directly from Windows).
   
   **Code Snippet**:
   ```java
   protected long adjustSplitEndToNextDelimiter(
           FileSourceSplit sourceSplit, long splitEnd, byte[] delimiter) {
       // ...
       int nextNewlinePos = findNextNewlinePosition(content, 0, content.length);
       // ...
   }
   
   private int findNextNewlinePosition(byte[] content, int start, int end) {
       for (int i = start; i < end; i++) {
           if (content[i] == '\n') { // Hardcoded \n
               return i;
           }
       }
       return -1;
   }
   ```
   
   **Potential Risks**:
   - **Risk 1**: If the file uses `\r\n` line breaks, a split may cut at `\r`, 
causing the last line of data to contain `\r` characters at the end
   - **Risk 2**: Downstream processing may produce parsing errors due to 
trailing `\r` (e.g., field values have an extra space or invisible character)
   - **Risk 3**: Inconsistent with expectations of some CSV parsers (most CSV 
parsers can correctly handle `\r\n`)
   
   **Impact Scope**:
   - **Direct Impact**: Using `enable_file_split=true` to read text/csv/json 
files
   - **Indirect Impact**: Downstream sinks may receive dirty data containing 
`\r`
   - **Affected Area**: Single Connector (S3File and other file connectors that 
may use this logic)
   
   **Severity**: MINOR
   
   **Improvement Suggestions**:
   
   ```java
   private int findNextNewlinePosition(byte[] content, int start, int end) {
       for (int i = start; i < end; i++) {
           if (content[i] == '\n') {
               // Unix style: found \n, return position after \n
               return i + 1;
           }
           if (content[i] == '\r' && i + 1 < end && content[i + 1] == '\n') {
               // Windows style: found \r\n, return position after \n
               return i + 2;
           }
           if (content[i] == '\r') {
               // Old Mac style: found lone \r, return position after \r
               return i + 1;
           }
       }
       return -1;
   }
   ```
   
   **Rationale**: Support mainstream line break formats (Unix `\n`, Windows 
`\r\n`, Old Mac `\r`) to improve system robustness. This aligns with the 
behavior of most text parsers and IDEs.
   
   ---
   
   ### Issue 3: CSV format split fallback logic not explicit enough
   
   **Location**: 
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java`
   
   **Related Context**:
   - Parent class: `TextReadStrategy`
   - Caller: `FileSourceReader`
   - Related tests: `ReadStrategySplitFallbackTest`
   
   **Problem Description**:
   In CSV's `prepareRead` method, `start >= end` is used to determine whether 
to skip the header. This logic implicitly indicates "fallback to non-split 
reading when split has no valid range." However, this behavior is not explicit 
enough and may cause future maintainers to misunderstand.
   
   **Code Snippet**:
   ```java
   public void prepareRead(...) {
       if (start >= end) {
           // fallback to non-splitting read
           skipHeader(in);
       }
       // ...
       adjustSplitEndToNextDelimiter(sourceSplit, end, rowDelimiter);
   }
   ```
   
   **Potential Risks**:
   - **Risk 1**: Reduced code readability, maintainers may not understand why 
to skip header when `start >= end`
   - **Risk 2**: If the way `prepareRead` is called changes in the future, this 
implicit assumption may be broken
   - **Risk 3**: Lack of logging, unable to track whether fallback occurred
   
   **Impact Scope**:
   - **Direct Impact**: CSV file split behavior
   - **Affected Area**: Single Connector
   
   **Severity**: MINOR
   
   **Improvement Suggestions**:
   
   ```java
   public void prepareRead(...) {
       // Explicit check: if split has no valid range, fallback to 
non-splitting read
       if (start >= end) {
           LOG.debug("Split {} has no valid range (start={}, end={}), "
                   + "falling back to non-splitting read with header skip",
                   sourceSplit.splitId(), start, end);
           skipHeader(in);
       }
       
       // ... rest of the code ...
   }
   ```
   
   **Rationale**: Add comments and logs to explicitly state fallback behavior, 
improving code readability and maintainability.
   
   ---
   
   ### Issue 4: Parquet split error handling improvements not uniformly applied 
to other formats
   
   **Location**: 
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/split/ParquetFileSplitStrategy.java`
   
   **Related Context**:
   - Related classes: `TextReadStrategy`, `CsvReadStrategy`
   - Comparison: Parquet has enhanced error handling, but Text/CSV does not
   
   **Problem Description**:
   When Parquet split fails, error messages are enhanced (adding file path and 
root cause), but Text/CSV format split failures do not have similar 
improvements. If Text/CSV split fails (although less likely), users will 
struggle to troubleshoot.
   
   **Code Comparison**:
   
   **Parquet (Improved)**:
   ```java
   throw new IOException(
       String.format("Failed to get split for file: %s", filePath), e);
   ```
   
   **Text/CSV (Not Improved)**:
   ```java
   // No explicit error handling, relies on framework default exception 
propagation
   ```
   
   **Potential Risks**:
   - **Risk 1**: When Text/CSV split fails, error messages may not be detailed 
enough to locate the problem
   - **Risk 2**: Inconsistent error handling behavior between different formats 
increases user confusion
   - **Risk 3**: If `adjustSplitEndToNextDelimiter` throws an exception (e.g., 
array out of bounds), there is no context information
   
   **Impact Scope**:
   - **Direct Impact**: Text/CSV/JSON file split error diagnosis
   - **Affected Area**: Single Connector
   
   **Severity**: MINOR
   
   **Improvement Suggestions**:
   
   ```java
   // TextReadStrategy.java
   protected long adjustSplitEndToNextDelimiter(
           FileSourceSplit sourceSplit, long splitEnd, byte[] delimiter) {
       try {
           // ... existing logic ...
       } catch (Exception e) {
           throw new IOException(
               String.format("Failed to adjust split end for file: %s, splitId: 
%s, splitEnd: %d",
                   sourceSplit.path(), sourceSplit.splitId(), splitEnd),
               e);
       }
   }
   ```
   
   **Rationale**: Unify error handling style to improve error diagnostic 
capabilities for all formats.
   
   ---
   
   ### Issue 5: Missing unit tests for Parquet split functionality
   
   **Location**: 
`seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/ReadStrategySplitFallbackTest.java`
   
   **Related Context**:
   - Existing tests: Only covers text and csv
   - Related classes: `ParquetFileSplitStrategy`, `ParquetReadStrategy`
   
   **Problem Description**:
   Unit test `ReadStrategySplitFallbackTest` only tests fallback behavior for 
text and csv, but does not test Parquet split behavior. Parquet's split logic 
differs from text/csv (based on RowGroup, no line alignment needed) and should 
be tested separately.
   
   **Potential Risks**:
   - **Risk 1**: After Parquet split logic is modified, there may be no test 
coverage, leading to regression issues
   - **Risk 2**: Unable to verify correctness of Parquet split (e.g., whether 
it correctly splits by RowGroup)
   - **Risk 3**: Unable to verify error handling improvements when Parquet 
split fails
   
   **Impact Scope**:
   - **Direct Impact**: Parquet file split functionality quality
   - **Affected Area**: Single Connector
   
   **Severity**: MINOR
   
   **Improvement Suggestions**:
   
   ```java
   // ReadStrategySplitFallbackTest.java
   @Test
   public void testParquetSplitWithValidRange() throws Exception {
       // Create a test Parquet file with multiple RowGroups
       Path testParquetFile = createTestParquetFile(3); // 3 RowGroups
       
       ParquetReadStrategy readStrategy = new ParquetReadStrategy();
       FileSourceSplit split = new FileSourceSplit(0, testParquetFile, 0, 1024, 
null);
       
       // Prepare read should succeed
       readStrategy.prepareRead(/* params */);
       
       // Verify that split is handled correctly
       // ...
   }
   
   @Test
   public void testParquetSplitFailureMessage() throws Exception {
       // Test the enhanced error message
       Path invalidParquetFile = createInvalidParquetFile();
       
       ParquetFileSplitStrategy splitStrategy = new ParquetFileSplitStrategy();
       
       IOException exception = assertThrows(IOException.class, () -> {
           splitStrategy.getSplits(invalidParquetFile, /* params */);
       });
       
       assertTrue(exception.getMessage().contains("Failed to get split for 
file:"));
       
assertTrue(exception.getMessage().contains(invalidParquetFile.toString()));
   }
   ```
   
   **Rationale**: Improve test coverage to ensure correctness of Parquet split 
and effectiveness of error handling.
   
   ---
   
   ### Issue 6: Missing split support for JSON format
   
   **Location**: Documentation mentions JSON format split support, but no 
explicit verification in code
   
   **Related Context**:
   - Documentation: `docs/en/connectors/source/S3File.md` mentions json format 
support
   - Related classes: `JsonReadStrategy` (not modified in this PR)
   - Inheritance: `JsonReadStrategy` may inherit from `TextReadStrategy` or 
other classes
   
   **Problem Description**:
   The PR description mentions split support for json format, but no 
modifications to `JsonReadStrategy.java` are seen in the code change list. Need 
to confirm:
   1. Does `JsonReadStrategy` inherit from `TextReadStrategy`? (If so, split is 
automatically supported)
   2. Does JSON split require special line alignment logic? (e.g., multi-line 
JSON objects)
   3. Has JSON split functionality been tested?
   
   **Potential Risks**:
   - **Risk 1**: If `JsonReadStrategy` does not inherit from 
`TextReadStrategy`, JSON format does not support split, but documentation 
misleads users
   - **Risk 2**: If JSON files contain multi-line JSON objects (e.g., JSON 
Lines), current logic may not handle correctly
   - **Risk 3**: Lack of testing, unable to confirm if JSON split works properly
   
   **Impact Scope**:
   - **Direct Impact**: JSON file split functionality
   - **Affected Area**: Single Connector
   
   **Severity**: MAJOR (if JSON actually not supported) / MINOR (if JSON 
already automatically supported)
   
   **Improvement Suggestions**:
   Further inspection of `JsonReadStrategy` implementation is needed:
   
   1. If `JsonReadStrategy` inherits from `TextReadStrategy`:
      - Add unit tests for JSON format
      - Document in docs that JSON only supports JSON Lines format (one JSON 
object per line)
   
   2. If `JsonReadStrategy` does not inherit from `TextReadStrategy`:
      - Implement similar split logic, or explicitly state JSON split is not 
supported
      - Update documentation to remove JSON split description
   
   **Rationale**: Ensure documentation and implementation are consistent, 
avoiding user misunderstanding.
   
   ---
   
   ### Issue 7: E2E tests do not cover all scenarios
   
   **Location**: 
`seatunnel-e2e/seatunnel-connector-v2-e2e/connector-file-s3-e2e/src/test/java/org/apache/seatunnel/e2e/connector/file/s3/S3FileWithFilterIT.java`
   
   **Related Context**:
   - Existing tests: `testS3FileTextEnableSplitToAssert` (tests CSV + header)
   - Missing scenarios: Parquet, JSON, text without header, split edge cases
   
   **Problem Description**:
   E2E tests only cover CSV + header scenarios, missing tests for the following:
   1. Parquet file split (need to verify RowGroup splitting)
   2. JSON file split
   3. Text file split without header
   4. Split edge cases (file size is exactly a multiple of `file_split_size`)
   5. Performance comparison with different `file_split_size` configurations
   
   **Potential Risks**:
   - **Risk 1**: Parquet and JSON split functionality not verified by E2E, may 
fail in real environments
   - **Risk 2**: Edge cases not tested, may cause issues under specific data 
distributions
   - **Risk 3**: Unable to evaluate actual performance improvement of split 
functionality
   
   **Impact Scope**:
   - **Direct Impact**: Test coverage and functionality reliability
   - **Affected Area**: Single Connector
   
   **Severity**: MINOR
   
   **Improvement Suggestions**:
   
   ```java
   // S3FileWithFilterIT.java
   
   @Test
   public void testS3FileParquetEnableSplit() throws Exception {
       // Test Parquet file with multiple RowGroups
       // Verify split behavior and data correctness
   }
   
   @Test
   public void testS3FileJsonEnableSplit() throws Exception {
       // Test JSON Lines file
       // Verify split behavior and data correctness
   }
   
   @Test
   public void testS3FileTextNoHeaderEnableSplit() throws Exception {
       // Test text file without header
       // Verify no header is skipped
   }
   
   @Test
   public void testS3FileSplitBoundaryCase() throws Exception {
       // Test file size exactly equals file_split_size
       // Verify no data duplication or loss at boundaries
   }
   ```
   
   **Rationale**: Improve test coverage to ensure all declared supported 
formats and scenarios are verified by E2E.
   
   ---
   
   ### Issue 8: Documentation lacks detailed explanation of split limitations
   
   **Location**: `docs/en/connectors/source/S3File.md` and 
`docs/zh/connectors/source/S3File.md`
   
   **Problem Description**:
   Documentation mentions that only "non-compressed formats" support split, but 
does not explain in detail:
   1. Which compression formats are not supported? (gz, zip, bz2, lz4, snappy, 
etc.)
   2. What happens if users enable split for compressed files? (Automatic 
fallback? Error?)
   3. Does snappy compression used internally by Parquet support split? (Should 
support, because Parquet split is logical)
   4. Will split affect data order? (May, because of parallel reading)
   
   **Potential Risks**:
   - **Risk 1**: Users may enable split for compressed files expecting 
performance improvement, but actually no effect
   - **Risk 2**: Users don't understand why certain files cannot be split, 
causing confusion
   - **Risk 3**: Order-dependent scenarios (e.g., log files) may have data 
order changed due to split
   
   **Impact Scope**:
   - **Direct Impact**: User understanding and configuration
   - **Affected Area**: Single Connector's user experience
   
   **Severity**: MINOR
   
   **Improvement Suggestions**:
   
   Add detailed explanation in documentation:
   
   ```markdown
   ### File Split Limitations
   
   - **Supported formats**: 
     - Text (plain text files)
     - CSV (including with/without header)
     - JSON Lines (one JSON object per line)
     - Parquet (split by RowGroup)
   
   - **Unsupported formats**:
     - Compressed text files (e.g., .gz, .bz2, .zip, .lz4) - split will be 
automatically disabled
     - Excel (.xlsx)
     - XML
     - Single-line JSON files (not JSON Lines)
   
   - **Behavior with unsupported formats**:
     If you enable `enable_file_split` for an unsupported format, the system 
will 
     automatically fall back to non-splitting mode. A warning log will be 
emitted.
   
   - **Data ordering**:
     When file split is enabled, data may be read out of order across splits. 
     If strict ordering is required, do not enable file split or use a 
single-split strategy.
   
   - **Parquet compression**:
     Parquet files with internal compression (e.g., Snappy, Gzip) are fully 
supported,
     because Parquet split is based on RowGroup boundaries, not byte ranges.
   ```
   
   **Rationale**: Provide clear and complete limitation descriptions to help 
users correctly understand and configure.
   
   ---
   
   ### Issue 9: Missing Metrics for split performance monitoring
   
   **Location**: The entire PR adds no Metrics-related code
   
   **Related Context**:
   - SeaTunnel Metrics framework
   - Related classes: `FileSourceReader`, `FileSourceSplitEnumerator`
   
   **Problem Description**:
   The PR does not add any Metrics to monitor split behavior and performance, 
including:
   1. Actual number of splits generated
   2. Size distribution of each split
   3. Split read duration
   4. Split failure rate
   5. Frequency of split alignment adjustments
   
   **Potential Risks**:
   - **Risk 1**: Users cannot monitor whether split functionality is effective
   - **Risk 2**: Unable to tune `file_split_size` parameter
   - **Risk 3**: Unable to diagnose split-related performance issues
   - **Risk 4**: Unable to evaluate actual effectiveness of split functionality
   
   **Impact Scope**:
   - **Direct Impact**: Observability and operational capabilities
   - **Affected Area**: Single Connector
   
   **Severity**: MINOR
   
   **Improvement Suggestions**:
   
   ```java
   // FileSourceReader.java (pseudocode)
   private Counter splitCounter;
   private Histogram splitSizeHistogram;
   private Timer splitReadTimer;
   
   public void open() {
       splitCounter = context.metricRegistry().counter("file.split.count");
       splitSizeHistogram = 
context.metricRegistry().histogram("file.split.size");
       splitReadTimer = context.metricRegistry().timer("file.split.read.time");
   }
   
   public void readNext() {
       Timer.Context timeContext = splitReadTimer.time();
       try {
           // ... reading logic ...
           splitCounter.inc();
           splitSizeHistogram.update(splitSize);
       } finally {
           timeContext.stop();
       }
   }
   ```
   
   **Rationale**: Provide observability to help users monitor and tune split 
functionality.
   
   ---


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to