DanielCarter-stack commented on PR #10450:
URL: https://github.com/apache/seatunnel/pull/10450#issuecomment-3851088717
<!-- code-pr-reviewer -->
<!-- cpr:pr_reply_v2_parts {"group": "apache/seatunnel#10450", "part": 1,
"total": 1} -->
### Issue 1: file_split_size lacks input validation
**Location**:
`seatunnel-connectors-v2/connector-file/connector-file-s3/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/s3/source/S3FileSourceFactory.java`
**Related Context**:
- Define configuration option: `S3FileSourceFactory.java:42-55`
- Use configuration option: `CustomFileSplitGenerator` (not modified in this
PR)
**Problem Description**:
The `file_split_size` configuration option does not have validity
validation. Users may configure the following invalid values:
- Negative numbers: leading to logic errors
- Zero: may cause infinite loops or division-by-zero errors
- Excessively large values: may cause a single split to exceed memory limits
**Potential Risks**:
- **Risk 1**: When configured as 0 or a negative number, it may cause
`splitEnd` calculation errors, producing abnormal split ranges
- **Risk 2**: When configured as an extremely large value (e.g.,
Long.MAX_VALUE), it may degrade to single-split reading, but users may
mistakenly believe splitting is enabled
- **Risk 3**: No friendly error messages, making it difficult for users to
troubleshoot configuration issues
**Impact Scope**:
- **Direct Impact**: All S3File jobs using `enable_file_split=true`
- **Affected Area**: Single Connector (S3File)
**Severity**: MAJOR
**Improvement Suggestions**:
```java
// S3FileSourceFactory.java
public static final ConfigOption<Long> FILE_SPLIT_SIZE = ConfigOption
.key("file_split_size")
.longType()
.defaultValue(64 * 1024 * 1024L)
.withDescription(
"The file split size (in bytes) when file split is enabled. "
+ "Must be positive. Recommended values are between 1MB and 1GB. "
+ "Note: actual split size may be larger due to row delimiter
alignment.");
// Add validation in the prepare method
@Override
public void prepare(PrepareConfig config) throws Exception {
// ... existing code ...
Long fileSplitSize = options.get(FILE_SPLIT_SIZE);
if (fileSplitSize != null && fileSplitSize <= 0) {
throw new IllegalArgumentException(
"file_split_size must be positive, but got: " + fileSplitSize);
}
// Optional: Add warning
if (fileSplitSize != null && fileSplitSize < 1024 * 1024) {
LOG.warn("file_split_size is less than 1MB, which may cause too many
splits. "
+ "Recommended value: at least 1MB.");
}
}
```
**Rationale**: Adding configuration validation can prevent runtime errors
caused by invalid configurations, detecting issues early and providing friendly
error messages.
---
### Issue 2: TextReadStrategy line separator hardcoding
**Location**:
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/TextReadStrategy.java`
**Related Context**:
- Parent class: `AbstractReadStrategy`
- Child classes: `CsvReadStrategy`, `JsonReadStrategy`
- Caller: `FileSourceReader`
**Problem Description**:
The `adjustSplitEndToNextDelimiter` method hardcodes `\n` as the line
separator, not supporting Windows-style `\r\n`. Although most modern systems
and S3 storage use Unix-style line breaks, issues may arise in certain
scenarios (e.g., files uploaded directly from Windows).
**Code Snippet**:
```java
protected long adjustSplitEndToNextDelimiter(
FileSourceSplit sourceSplit, long splitEnd, byte[] delimiter) {
// ...
int nextNewlinePos = findNextNewlinePosition(content, 0, content.length);
// ...
}
private int findNextNewlinePosition(byte[] content, int start, int end) {
for (int i = start; i < end; i++) {
if (content[i] == '\n') { // Hardcoded \n
return i;
}
}
return -1;
}
```
**Potential Risks**:
- **Risk 1**: If the file uses `\r\n` line breaks, a split may cut at `\r`,
causing the last line of data to contain `\r` characters at the end
- **Risk 2**: Downstream processing may produce parsing errors due to
trailing `\r` (e.g., field values have an extra space or invisible character)
- **Risk 3**: Inconsistent with expectations of some CSV parsers (most CSV
parsers can correctly handle `\r\n`)
**Impact Scope**:
- **Direct Impact**: Using `enable_file_split=true` to read text/csv/json
files
- **Indirect Impact**: Downstream sinks may receive dirty data containing
`\r`
- **Affected Area**: Single Connector (S3File and other file connectors that
may use this logic)
**Severity**: MINOR
**Improvement Suggestions**:
```java
private int findNextNewlinePosition(byte[] content, int start, int end) {
for (int i = start; i < end; i++) {
if (content[i] == '\n') {
// Unix style: found \n, return position after \n
return i + 1;
}
if (content[i] == '\r' && i + 1 < end && content[i + 1] == '\n') {
// Windows style: found \r\n, return position after \n
return i + 2;
}
if (content[i] == '\r') {
// Old Mac style: found lone \r, return position after \r
return i + 1;
}
}
return -1;
}
```
**Rationale**: Support mainstream line break formats (Unix `\n`, Windows
`\r\n`, Old Mac `\r`) to improve system robustness. This aligns with the
behavior of most text parsers and IDEs.
---
### Issue 3: CSV format split fallback logic not explicit enough
**Location**:
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java`
**Related Context**:
- Parent class: `TextReadStrategy`
- Caller: `FileSourceReader`
- Related tests: `ReadStrategySplitFallbackTest`
**Problem Description**:
In CSV's `prepareRead` method, `start >= end` is used to determine whether
to skip the header. This logic implicitly indicates "fallback to non-split
reading when split has no valid range." However, this behavior is not explicit
enough and may cause future maintainers to misunderstand.
**Code Snippet**:
```java
public void prepareRead(...) {
if (start >= end) {
// fallback to non-splitting read
skipHeader(in);
}
// ...
adjustSplitEndToNextDelimiter(sourceSplit, end, rowDelimiter);
}
```
**Potential Risks**:
- **Risk 1**: Reduced code readability, maintainers may not understand why
to skip header when `start >= end`
- **Risk 2**: If the way `prepareRead` is called changes in the future, this
implicit assumption may be broken
- **Risk 3**: Lack of logging, unable to track whether fallback occurred
**Impact Scope**:
- **Direct Impact**: CSV file split behavior
- **Affected Area**: Single Connector
**Severity**: MINOR
**Improvement Suggestions**:
```java
public void prepareRead(...) {
// Explicit check: if split has no valid range, fallback to
non-splitting read
if (start >= end) {
LOG.debug("Split {} has no valid range (start={}, end={}), "
+ "falling back to non-splitting read with header skip",
sourceSplit.splitId(), start, end);
skipHeader(in);
}
// ... rest of the code ...
}
```
**Rationale**: Add comments and logs to explicitly state fallback behavior,
improving code readability and maintainability.
---
### Issue 4: Parquet split error handling improvements not uniformly applied
to other formats
**Location**:
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/split/ParquetFileSplitStrategy.java`
**Related Context**:
- Related classes: `TextReadStrategy`, `CsvReadStrategy`
- Comparison: Parquet has enhanced error handling, but Text/CSV does not
**Problem Description**:
When Parquet split fails, error messages are enhanced (adding file path and
root cause), but Text/CSV format split failures do not have similar
improvements. If Text/CSV split fails (although less likely), users will
struggle to troubleshoot.
**Code Comparison**:
**Parquet (Improved)**:
```java
throw new IOException(
String.format("Failed to get split for file: %s", filePath), e);
```
**Text/CSV (Not Improved)**:
```java
// No explicit error handling, relies on framework default exception
propagation
```
**Potential Risks**:
- **Risk 1**: When Text/CSV split fails, error messages may not be detailed
enough to locate the problem
- **Risk 2**: Inconsistent error handling behavior between different formats
increases user confusion
- **Risk 3**: If `adjustSplitEndToNextDelimiter` throws an exception (e.g.,
array out of bounds), there is no context information
**Impact Scope**:
- **Direct Impact**: Text/CSV/JSON file split error diagnosis
- **Affected Area**: Single Connector
**Severity**: MINOR
**Improvement Suggestions**:
```java
// TextReadStrategy.java
protected long adjustSplitEndToNextDelimiter(
FileSourceSplit sourceSplit, long splitEnd, byte[] delimiter) {
try {
// ... existing logic ...
} catch (Exception e) {
throw new IOException(
String.format("Failed to adjust split end for file: %s, splitId:
%s, splitEnd: %d",
sourceSplit.path(), sourceSplit.splitId(), splitEnd),
e);
}
}
```
**Rationale**: Unify error handling style to improve error diagnostic
capabilities for all formats.
---
### Issue 5: Missing unit tests for Parquet split functionality
**Location**:
`seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/ReadStrategySplitFallbackTest.java`
**Related Context**:
- Existing tests: Only covers text and csv
- Related classes: `ParquetFileSplitStrategy`, `ParquetReadStrategy`
**Problem Description**:
Unit test `ReadStrategySplitFallbackTest` only tests fallback behavior for
text and csv, but does not test Parquet split behavior. Parquet's split logic
differs from text/csv (based on RowGroup, no line alignment needed) and should
be tested separately.
**Potential Risks**:
- **Risk 1**: After Parquet split logic is modified, there may be no test
coverage, leading to regression issues
- **Risk 2**: Unable to verify correctness of Parquet split (e.g., whether
it correctly splits by RowGroup)
- **Risk 3**: Unable to verify error handling improvements when Parquet
split fails
**Impact Scope**:
- **Direct Impact**: Parquet file split functionality quality
- **Affected Area**: Single Connector
**Severity**: MINOR
**Improvement Suggestions**:
```java
// ReadStrategySplitFallbackTest.java
@Test
public void testParquetSplitWithValidRange() throws Exception {
// Create a test Parquet file with multiple RowGroups
Path testParquetFile = createTestParquetFile(3); // 3 RowGroups
ParquetReadStrategy readStrategy = new ParquetReadStrategy();
FileSourceSplit split = new FileSourceSplit(0, testParquetFile, 0, 1024,
null);
// Prepare read should succeed
readStrategy.prepareRead(/* params */);
// Verify that split is handled correctly
// ...
}
@Test
public void testParquetSplitFailureMessage() throws Exception {
// Test the enhanced error message
Path invalidParquetFile = createInvalidParquetFile();
ParquetFileSplitStrategy splitStrategy = new ParquetFileSplitStrategy();
IOException exception = assertThrows(IOException.class, () -> {
splitStrategy.getSplits(invalidParquetFile, /* params */);
});
assertTrue(exception.getMessage().contains("Failed to get split for
file:"));
assertTrue(exception.getMessage().contains(invalidParquetFile.toString()));
}
```
**Rationale**: Improve test coverage to ensure correctness of Parquet split
and effectiveness of error handling.
---
### Issue 6: Missing split support for JSON format
**Location**: Documentation mentions JSON format split support, but no
explicit verification in code
**Related Context**:
- Documentation: `docs/en/connectors/source/S3File.md` mentions json format
support
- Related classes: `JsonReadStrategy` (not modified in this PR)
- Inheritance: `JsonReadStrategy` may inherit from `TextReadStrategy` or
other classes
**Problem Description**:
The PR description mentions split support for json format, but no
modifications to `JsonReadStrategy.java` are seen in the code change list. Need
to confirm:
1. Does `JsonReadStrategy` inherit from `TextReadStrategy`? (If so, split is
automatically supported)
2. Does JSON split require special line alignment logic? (e.g., multi-line
JSON objects)
3. Has JSON split functionality been tested?
**Potential Risks**:
- **Risk 1**: If `JsonReadStrategy` does not inherit from
`TextReadStrategy`, JSON format does not support split, but documentation
misleads users
- **Risk 2**: If JSON files contain multi-line JSON objects (e.g., JSON
Lines), current logic may not handle correctly
- **Risk 3**: Lack of testing, unable to confirm if JSON split works properly
**Impact Scope**:
- **Direct Impact**: JSON file split functionality
- **Affected Area**: Single Connector
**Severity**: MAJOR (if JSON actually not supported) / MINOR (if JSON
already automatically supported)
**Improvement Suggestions**:
Further inspection of `JsonReadStrategy` implementation is needed:
1. If `JsonReadStrategy` inherits from `TextReadStrategy`:
- Add unit tests for JSON format
- Document in docs that JSON only supports JSON Lines format (one JSON
object per line)
2. If `JsonReadStrategy` does not inherit from `TextReadStrategy`:
- Implement similar split logic, or explicitly state JSON split is not
supported
- Update documentation to remove JSON split description
**Rationale**: Ensure documentation and implementation are consistent,
avoiding user misunderstanding.
---
### Issue 7: E2E tests do not cover all scenarios
**Location**:
`seatunnel-e2e/seatunnel-connector-v2-e2e/connector-file-s3-e2e/src/test/java/org/apache/seatunnel/e2e/connector/file/s3/S3FileWithFilterIT.java`
**Related Context**:
- Existing tests: `testS3FileTextEnableSplitToAssert` (tests CSV + header)
- Missing scenarios: Parquet, JSON, text without header, split edge cases
**Problem Description**:
E2E tests only cover CSV + header scenarios, missing tests for the following:
1. Parquet file split (need to verify RowGroup splitting)
2. JSON file split
3. Text file split without header
4. Split edge cases (file size is exactly a multiple of `file_split_size`)
5. Performance comparison with different `file_split_size` configurations
**Potential Risks**:
- **Risk 1**: Parquet and JSON split functionality not verified by E2E, may
fail in real environments
- **Risk 2**: Edge cases not tested, may cause issues under specific data
distributions
- **Risk 3**: Unable to evaluate actual performance improvement of split
functionality
**Impact Scope**:
- **Direct Impact**: Test coverage and functionality reliability
- **Affected Area**: Single Connector
**Severity**: MINOR
**Improvement Suggestions**:
```java
// S3FileWithFilterIT.java
@Test
public void testS3FileParquetEnableSplit() throws Exception {
// Test Parquet file with multiple RowGroups
// Verify split behavior and data correctness
}
@Test
public void testS3FileJsonEnableSplit() throws Exception {
// Test JSON Lines file
// Verify split behavior and data correctness
}
@Test
public void testS3FileTextNoHeaderEnableSplit() throws Exception {
// Test text file without header
// Verify no header is skipped
}
@Test
public void testS3FileSplitBoundaryCase() throws Exception {
// Test file size exactly equals file_split_size
// Verify no data duplication or loss at boundaries
}
```
**Rationale**: Improve test coverage to ensure all declared supported
formats and scenarios are verified by E2E.
---
### Issue 8: Documentation lacks detailed explanation of split limitations
**Location**: `docs/en/connectors/source/S3File.md` and
`docs/zh/connectors/source/S3File.md`
**Problem Description**:
Documentation mentions that only "non-compressed formats" support split, but
does not explain in detail:
1. Which compression formats are not supported? (gz, zip, bz2, lz4, snappy,
etc.)
2. What happens if users enable split for compressed files? (Automatic
fallback? Error?)
3. Does snappy compression used internally by Parquet support split? (Should
support, because Parquet split is logical)
4. Will split affect data order? (May, because of parallel reading)
**Potential Risks**:
- **Risk 1**: Users may enable split for compressed files expecting
performance improvement, but actually no effect
- **Risk 2**: Users don't understand why certain files cannot be split,
causing confusion
- **Risk 3**: Order-dependent scenarios (e.g., log files) may have data
order changed due to split
**Impact Scope**:
- **Direct Impact**: User understanding and configuration
- **Affected Area**: Single Connector's user experience
**Severity**: MINOR
**Improvement Suggestions**:
Add detailed explanation in documentation:
```markdown
### File Split Limitations
- **Supported formats**:
- Text (plain text files)
- CSV (including with/without header)
- JSON Lines (one JSON object per line)
- Parquet (split by RowGroup)
- **Unsupported formats**:
- Compressed text files (e.g., .gz, .bz2, .zip, .lz4) - split will be
automatically disabled
- Excel (.xlsx)
- XML
- Single-line JSON files (not JSON Lines)
- **Behavior with unsupported formats**:
If you enable `enable_file_split` for an unsupported format, the system
will
automatically fall back to non-splitting mode. A warning log will be
emitted.
- **Data ordering**:
When file split is enabled, data may be read out of order across splits.
If strict ordering is required, do not enable file split or use a
single-split strategy.
- **Parquet compression**:
Parquet files with internal compression (e.g., Snappy, Gzip) are fully
supported,
because Parquet split is based on RowGroup boundaries, not byte ranges.
```
**Rationale**: Provide clear and complete limitation descriptions to help
users correctly understand and configure.
---
### Issue 9: Missing Metrics for split performance monitoring
**Location**: The entire PR adds no Metrics-related code
**Related Context**:
- SeaTunnel Metrics framework
- Related classes: `FileSourceReader`, `FileSourceSplitEnumerator`
**Problem Description**:
The PR does not add any Metrics to monitor split behavior and performance,
including:
1. Actual number of splits generated
2. Size distribution of each split
3. Split read duration
4. Split failure rate
5. Frequency of split alignment adjustments
**Potential Risks**:
- **Risk 1**: Users cannot monitor whether split functionality is effective
- **Risk 2**: Unable to tune `file_split_size` parameter
- **Risk 3**: Unable to diagnose split-related performance issues
- **Risk 4**: Unable to evaluate actual effectiveness of split functionality
**Impact Scope**:
- **Direct Impact**: Observability and operational capabilities
- **Affected Area**: Single Connector
**Severity**: MINOR
**Improvement Suggestions**:
```java
// FileSourceReader.java (pseudocode)
private Counter splitCounter;
private Histogram splitSizeHistogram;
private Timer splitReadTimer;
public void open() {
splitCounter = context.metricRegistry().counter("file.split.count");
splitSizeHistogram =
context.metricRegistry().histogram("file.split.size");
splitReadTimer = context.metricRegistry().timer("file.split.read.time");
}
public void readNext() {
Timer.Context timeContext = splitReadTimer.time();
try {
// ... reading logic ...
splitCounter.inc();
splitSizeHistogram.update(splitSize);
} finally {
timeContext.stop();
}
}
```
**Rationale**: Provide observability to help users monitor and tune split
functionality.
---
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]