DanielCarter-stack commented on PR #10478:
URL: https://github.com/apache/seatunnel/pull/10478#issuecomment-3878744705
<!-- code-pr-reviewer -->
<!-- cpr:pr_reply_v2_parts {"group": "apache/seatunnel#10478", "part": 1,
"total": 1} -->
### Issue 1: Inconsistent metadata setting in ExcelReaderListener
**Location**:
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/excel/ExcelReaderListener.java:101`
```java
// Current code
seaTunnelRow.setTableId(tableId);
seaTunnelRow.setOptions(metadata); // Direct setting
// All other places are
applyRowMetadata(seaTunnelRow, tableId, metadata); // Through method
```
**Issue Description**:
ExcelReaderListener directly calls `setOptions(metadata)`, while all other
places use the `applyRowMetadata()` method. Although the effect is the same, it
breaks code consistency.
**Potential Risks**:
- If `applyRowMetadata()` requires additional logic in the future (such as
logging, validation), ExcelReaderListener will miss it
- Easily overlooked during maintenance
**Impact Scope**:
- Direct impact: ExcelReadStrategy
- Indirect impact: None
- Affected area: Single Connector
**Severity**: MINOR
**Improvement Suggestion**:
```java
// ExcelReaderListener.java:98-101
- seaTunnelRow.setTableId(tableId);
- seaTunnelRow.setOptions(metadata);
+ applyRowMetadata(seaTunnelRow, tableId, metadata);
output.collect(seaTunnelRow);
// Need to change applyRowMetadata method to public or protected, or add
this method in ExcelReaderListener
```
---
### Issue 2: Extra getFileStatus() call added on every file read
(Performance Issue)
**Location**:
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/AbstractReadStrategy.java:590-604`
```java
protected Map<String, Object> buildFileMetadata(FileSourceSplit split,
String currentFileName)
throws IOException {
FileStatus fileStatus =
hadoopFileSystemProxy.getFileStatus(split.getFilePath()); // Additional RPC
call
...
}
```
**Issue Description**:
For distributed file systems such as HDFS and S3, `getFileStatus()` is an
RPC call that adds latency. For large file reads, this overhead may not be
obvious, but for scenarios with many small files, the performance impact is
significant.
**Potential Risks**:
- Significant performance degradation in scenarios with many small files
- May become a performance bottleneck
**Impact Scope**:
- Direct impact: All scenarios using file source
- Indirect impact: Entire pipeline throughput
- Affected area: All file connectors
**Severity**: CRITICAL
**Improvement Suggestion**:
```java
// Solution 1: Pre-fetch FileStatus when creating FileSourceSplit to avoid
duplicate calls
// Need to modify FileSourceSplit and FileSourceEnumerator
public class FileSourceSplit implements SourceSplit {
private final String tableId;
private final String filePath;
private final FileStatus fileStatus; // Add
...
}
// Solution 2: Provide a switch for users to choose whether metadata is
needed
if (isMetadataEnabled()) {
Map<String, Object> metadata = buildFileMetadata(split, currentFileName);
applyRowMetadata(seaTunnelRow, tableId, metadata);
} else {
seaTunnelRow.setTableId(tableId);
}
// Solution 3: Cache FileStatus
protected Map<String, Object> buildFileMetadata(FileSourceSplit split,
String currentFileName, FileStatus cachedStatus) {
FileStatus fileStatus = cachedStatus != null ? cachedStatus :
hadoopFileSystemProxy.getFileStatus(split.getFilePath());
...
}
```
---
### Issue 3: Extra system call added for local file system
**Location**:
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/AbstractReadStrategy.java:606-631`
```java
private Long resolveFileCreateTime(FileStatus fileStatus, String filePath) {
...
if ("file".equalsIgnoreCase(scheme)) {
try {
URI uri = new Path(filePath).toUri();
BasicFileAttributes attributes =
Files.readAttributes(Paths.get(uri),
BasicFileAttributes.class); // Additional system call
...
} catch (Exception ignored) {
}
}
...
}
```
**Issue Description**:
For local file system, `getFileStatus()` is already called (underlying is
stat system call), then `Files.readAttributes()` is called again (stat again),
causing duplicate system calls.
**Potential Risks**:
- Local file system read performance degradation
- More noticeable in scenarios with many small files
**Impact Scope**:
- Direct impact: LocalFile connector
- Indirect impact: None
- Affected area: Single Connector
**Severity**: MAJOR
**Improvement Suggestion**:
```java
// Solution 1: Only call when user explicitly needs FileCreateTime
private Long resolveFileCreateTime(FileStatus fileStatus, String filePath,
boolean needCreateTime) {
if (!needCreateTime) {
return fileStatus.getModificationTime() > 0 ?
fileStatus.getModificationTime() : null;
}
...existing code...
}
// Solution 2: Use FileStatus API (if available)
// Note: Hadoop FileStatus may not have creation time, need to check
specific version
```
---
### Issue 4: resolveFileCreateTime() swallows all exceptions
**Location**:
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/AbstractReadStrategy.java:606-631`
```java
private Long resolveFileCreateTime(FileStatus fileStatus, String filePath) {
String scheme = null;
try {
scheme = hadoopFileSystemProxy.getFileSystem().getScheme();
} catch (Exception ignored) { // Swallow exception
}
if (StringUtils.isBlank(scheme)) {
scheme = new Path(filePath).toUri().getScheme();
}
if ("file".equalsIgnoreCase(scheme)) {
try {
...
} catch (Exception ignored) { // Swallow exception
}
}
...
}
```
**Issue Description**:
Both catch blocks catch all exceptions but do nothing, which hides potential
serious problems (such as insufficient permissions, unavailable file system,
etc.), making problems difficult to debug.
**Potential Risks**:
- Hides real errors, increasing debugging difficulty
- May return incorrect metadata values
**Impact Scope**:
- Direct impact: Metadata accuracy
- Indirect impact: Debugging and problem location
- Affected area: All file connectors
**Severity**: MAJOR
**Improvement Suggestion**:
```java
private Long resolveFileCreateTime(FileStatus fileStatus, String filePath) {
String scheme = null;
try {
scheme = hadoopFileSystemProxy.getFileSystem().getScheme();
} catch (Exception e) {
// Use logger instead of ignoring
if (log.isDebugEnabled()) {
log.debug("Failed to get filesystem scheme, using path scheme
instead", e);
}
}
if (StringUtils.isBlank(scheme)) {
scheme = new Path(filePath).toUri().getScheme();
}
if ("file".equalsIgnoreCase(scheme)) {
try {
URI uri = new Path(filePath).toUri();
BasicFileAttributes attributes =
Files.readAttributes(Paths.get(uri),
BasicFileAttributes.class);
long created = attributes.creationTime().toMillis();
if (created > 0) {
return created;
}
} catch (Exception e) {
// Use logger to record, but still fallback
log.warn("Failed to get creation time for {}, fallback to
modification time", filePath, e);
}
}
...
}
```
---
### Issue 5: No logging when creation time falls back to modification time
**Location**:
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/AbstractReadStrategy.java:628-630`
```java
// Fallback to modification time for filesystems without creation time.
long modificationTime = fileStatus.getModificationTime();
return modificationTime > 0 ? modificationTime : null;
```
**Issue Description**:
When the file system does not support creation time, the code falls back to
modification time, but there is no log record. Users may be confused why the
value of `FileCreateTime` is the same as `FileUpdateTime`.
**Potential Risks**:
- User confusion
- Difficult to troubleshoot problems
**Impact Scope**:
- Direct impact: User experience
- Indirect impact: Support cost
- Affected area: All file connectors
**Severity**: MINOR
**Improvement Suggestion**:
```java
// Fallback to modification time for filesystems without creation time.
long modificationTime = fileStatus.getModificationTime();
if (log.isDebugEnabled() && modificationTime > 0) {
log.debug("Filesystem does not support creation time, using modification
time for {}", split.getFilePath());
}
return modificationTime > 0 ? modificationTime : null;
```
---
### Issue 6: addMetadataColumnIfAbsent() uses O(n²) algorithm
**Location**:
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseFileSourceConfig.java:201-207`
```java
private void addMetadataColumnIfAbsent(List<Column> columns, MetadataColumn
column) {
for (Column existing : columns) { // O(n) traversal
if (existing.getName().equals(column.getName())) {
return;
}
}
columns.add(column); // List.add() O(1) but may trigger expansion
}
```
**Issue Description**:
Although there are only 5 fields and the impact is not significant, each
call needs to traverse the entire list, which is not a best practice.
**Potential Risks**:
- Code quality
- Maintainability
**Impact Scope**:
- Direct impact: Code quality
- Indirect impact: None
- Affected area: BaseFileSourceConfig
**Severity**: MINOR
**Improvement Suggestion**:
```java
// Solution 1: Use Set for deduplication
private Set<String> existingColumnNames = new HashSet<>();
private void addMetadataColumnIfAbsent(List<Column> columns, MetadataColumn
column) {
if (existingColumnNames.add(column.getName())) { // Set.add() returns
false if already exists
columns.add(column);
}
}
// Solution 2: Use Stream
private void addMetadataColumnIfAbsent(List<Column> columns, MetadataColumn
column) {
if (columns.stream().noneMatch(c ->
c.getName().equals(column.getName()))) {
columns.add(column);
}
}
```
---
### Issue 7: Lack of tests for different file systems
**Location**: Test files
**Issue Description**:
Unit tests only cover the local file system (`TextReadStrategyTest` uses
LocalConf), and there are no tests for HDFS, S3, SFTP, etc. Different file
systems may have different support for `getScheme()`, `FileStatus`.
**Potential Risks**:
- Metadata for HDFS, S3 and other file systems may be incorrect
- Problems are only discovered in E2E tests, increasing debugging costs
**Impact Scope**:
- Direct impact: Test coverage
- Indirect impact: Code quality
- Affected area: All non-local file systems
**Severity**: MAJOR
**Improvement Suggestion**:
```java
// Add tests for different file systems
@Test
public void testHdfsFileMetadata() {
// Use mock HDFS filesystem
...
}
@Test
public void testS3FileMetadata() {
// Use mock S3 filesystem
...
}
// At least test:
// 1. Behavior when getScheme() returns different values
// 2. Different values of FileStatus (e.g., modificationTime=0)
// 3. Cases where creation time parsing succeeds and fails
```
---
### Issue 8: Lack of unit tests for resolveFileCreateTime()
**Location**: Test files
**Issue Description**:
`resolveFileCreateTime()` is a complex private method involving multiple
conditional branches and exception handling, but there are no dedicated unit
tests.
**Potential Risks**:
- Logic errors are difficult to discover
- Edge cases not covered
**Impact Scope**:
- Direct impact: Test coverage
- Indirect impact: Code quality
- Affected area: AbstractReadStrategy
**Severity**: MAJOR
**Improvement Suggestion**:
```java
// Use reflection to test private methods, or change to package-private
@Test
public void testResolveFileCreateTimeWithLocalFile() throws Exception {
...
}
@Test
public void testResolveFileCreateTimeWithHdfsFile() {
...
}
@Test
public void testResolveFileCreateTimeWhenSchemeIsNull() {
...
}
@Test
public void testResolveFileCreateTimeWhenModificationTimeIsZero() {
...
}
```
---
### Issue 9: Lack of tests for exception scenarios
**Location**: Test files
**Issue Description**:
Tests only verify normal cases, no testing for:
- File does not exist
- Insufficient permissions
- FileStatus retrieval failure
- Invalid file paths
**Potential Risks**:
- Improper exception handling leads to crashes
- Poor user experience
**Impact Scope**:
- Direct impact: Test coverage
- Indirect impact: Stability
- Affected area: All ReadStrategy
**Severity**: MINOR
**Improvement Suggestion**:
```java
@Test
public void testBuildFileMetadataWhenFileNotExists() {
...
}
@Test
public void testBuildFileMetadataWhenPermissionDenied() {
...
}
```
---
### Issue 10: E2E tests only verify NOT_NULL
**Location**: E2E test configuration file
```hocon
field_value = [
{
rule_type = NOT_NULL
}
]
```
**Issue Description**:
E2E tests only verify that fields are not null, and do not verify the
correctness of values (such as whether file paths are correct, whether
timestamps are reasonable).
**Potential Risks**:
- Value errors but tests pass
- Problems are only discovered in production environment
**Impact Scope**:
- Direct impact: Test quality
- Indirect impact: Product quality
- Affected area: All E2E tests
**Severity**: MINOR
**Improvement Suggestion**:
```hocon
// Besides NOT_NULL, should also verify:
// 1. File path matches expectation
// 2. File size is greater than 0
// 3. Timestamp is reasonable (e.g., not negative, within reasonable range)
// 4. File type is correct
field_value = [
{
rule_type = NOT_NULL
},
{
rule_type = "REGEX_MATCH"
rule_value = ".*\\.txt" // Verify file path
}
]
```
---
### Issue 11: Documentation does not explicitly state which file systems
support creation time
**Location**: `docs/en/transforms/metadata.md:35`
```markdown
4. **File metadata**: `FileCreateTime` can be null depending on filesystem
or permissions; `FileType` is typically derived from the file extension.
```
**Issue Description**:
Documentation only says "can be null depending on filesystem", but does not
explicitly state:
- Which file systems support creation time (local with java.nio)
- Which file systems do not support it (HDFS, S3, etc. will fall back to
modification time)
**Potential Risks**:
- User confusion
- Unreasonable expectations
**Impact Scope**:
- Direct impact: User experience
- Indirect impact: Support cost
- Affected area: Documentation
**Severity**: MINOR
**Improvement Suggestion**:
```markdown
4. **File metadata**:
- `FileCreateTime`: Only supported on local filesystems (using
`java.nio.file`). For other filesystems (HDFS, S3, SFTP, FTP, COS), it falls
back to `FileUpdateTime`. Can be null if the underlying file metadata is
unavailable.
- `FileType`: Derived from the file extension (the part after the last
dot). Empty string if no extension.
```
---
### Issue 12: Lack of performance metrics
**Location**: All metadata-related code
**Issue Description**:
There is no monitoring of metadata construction performance (such as time
consumed, frequency), making it impossible to evaluate performance impact in
production environments.
**Potential Risks**:
- Performance problems are difficult to discover
- Unable to optimize
**Impact Scope**:
- Direct impact: Observability
- Indirect impact: Performance tuning
- Affected area: All file connectors
**Severity**: MINOR
**Improvement Suggestion**:
```java
protected Map<String, Object> buildFileMetadata(FileSourceSplit split,
String currentFileName)
throws IOException {
long start = System.nanoTime();
try {
FileStatus fileStatus =
hadoopFileSystemProxy.getFileStatus(split.getFilePath());
...
} finally {
long duration = System.nanoTime() - start;
if (duration > 1_000_000) { // Exceeds 1ms
log.warn("Building file metadata took {} ms for {}", duration /
1_000_000, split.getFilePath());
}
}
}
```
---
### Issue 13: Lack of debug logging
**Location**: All metadata-related code
**Issue Description**:
There are no logs recording the metadata construction process, making it
difficult to track problems during debugging.
**Potential Risks**:
- Problems are difficult to locate
- Difficult to debug
**Impact Scope**:
- Direct impact: Observability
- Indirect impact: Debugging efficiency
- Affected area: All file connectors
**Severity**: MINOR
**Improvement Suggestion**:
```java
protected Map<String, Object> buildFileMetadata(FileSourceSplit split,
String currentFileName)
throws IOException {
if (log.isTraceEnabled()) {
log.trace("Building metadata for file: {}", currentFileName);
}
FileStatus fileStatus =
hadoopFileSystemProxy.getFileStatus(split.getFilePath());
...
if (log.isDebugEnabled()) {
log.debug("Built metadata for file: {}, size: {}, updateTime: {}",
currentFileName, fileStatus.getLen(),
fileStatus.getModificationTime());
}
...
}
```
---
### Issue 14: SeaTunnelRow.options is transient and will not be serialized
to checkpoint
**Location**:
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/type/SeaTunnelRow.java:37`
```java
private Map<String, Object> options; // Default is transient
```
**Issue Description**:
`SeaTunnelRow.options` field is transient (according to Java serialization
rules, if not explicitly declared, but after reviewing the code, there is no
transient modifier, needs confirmation). If indeed not serialized, metadata
will be lost after checkpoint recovery.
**Potential Risks**:
- Metadata lost after checkpoint recovery
- Downstream operators that depend on metadata may error
**Impact Scope**:
- Direct impact: Checkpoint recovery
- Indirect impact: Data consistency
- Affected area: All scenarios using metadata
**Severity**: CRITICAL
**Confidence**: Low (needs further confirmation)
**Improvement Suggestion**:
```java
// Need to confirm SeaTunnelRow serialization mechanism
// If using custom serialization like Kryo, may not be affected
// If using Java serialization, need to:
private void writeObject(ObjectOutputStream out) throws IOException {
out.defaultWriteObject();
// Serialize options
}
private void readObject(ObjectInputStream in) throws IOException,
ClassNotFoundException {
in.defaultReadObject();
// Deserialize options
}
```
---
### Issue 15: Inconsistent naming of CommonOptions enum values
**Location**:
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/type/CommonOptions.java:69-74`
```java
FILE_PATH("FilePath", true),
FILE_CREATE_TIME("FileCreateTime", true),
FILE_UPDATE_TIME("FileUpdateTime", true),
FILE_SIZE("FileSize", true),
FILE_TYPE("FileType", true);
```
**Issue Description**:
Enum names use uppercase underscore (`FILE_PATH`), but enum values use
PascalCase (`FilePath`), which is inconsistent with the style of other enum
constants (such as `IS_BINARY_FORMAT`). More importantly, metadata transform
uses lowercase underscore (such as `file_path`), which may cause confusion.
**Potential Risks**:
- User confusion
- Configuration errors
**Impact Scope**:
- Direct impact: User experience
- Indirect impact: Configuration
- Affected area: API
**Severity**: MINOR
**Improvement Suggestion**:
```java
// Maintain consistency, use the same naming style everywhere
// Documentation should clearly state:
// - CommonOptions enum name: FILE_PATH (Java code)
// - getName() return value: FilePath
// - Metadata transform configuration: file_path (configuration file)
```
---
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]