Re: [PR] [Feature][Connector-V2][File-Source]Support metadata column for file source connector [seatunnel]

via GitHub Tue, 10 Feb 2026 07:48:03 -0800


DanielCarter-stack commented on PR #10478:
URL: https://github.com/apache/seatunnel/pull/10478#issuecomment-3878744705


   <!-- code-pr-reviewer -->
   <!-- cpr:pr_reply_v2_parts {"group": "apache/seatunnel#10478", "part": 1, 
"total": 1} -->
   ### Issue 1: Inconsistent metadata setting in ExcelReaderListener
   
   **Location**: 
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/excel/ExcelReaderListener.java:101`
   
   ```java
   // Current code
   seaTunnelRow.setTableId(tableId);
   seaTunnelRow.setOptions(metadata);  // Direct setting
   
   // All other places are
   applyRowMetadata(seaTunnelRow, tableId, metadata);  // Through method
   ```
   
   **Issue Description**:
   ExcelReaderListener directly calls `setOptions(metadata)`, while all other 
places use the `applyRowMetadata()` method. Although the effect is the same, it 
breaks code consistency.
   
   **Potential Risks**:
   - If `applyRowMetadata()` requires additional logic in the future (such as 
logging, validation), ExcelReaderListener will miss it
   - Easily overlooked during maintenance
   
   **Impact Scope**:
   - Direct impact: ExcelReadStrategy
   - Indirect impact: None
   - Affected area: Single Connector
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   ```java
   // ExcelReaderListener.java:98-101
   - seaTunnelRow.setTableId(tableId);
   - seaTunnelRow.setOptions(metadata);
   + applyRowMetadata(seaTunnelRow, tableId, metadata);
   output.collect(seaTunnelRow);
   
   // Need to change applyRowMetadata method to public or protected, or add 
this method in ExcelReaderListener
   ```
   
   ---
   
   ### Issue 2: Extra getFileStatus() call added on every file read 
(Performance Issue)
   
   **Location**: 
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/AbstractReadStrategy.java:590-604`
   
   ```java
   protected Map<String, Object> buildFileMetadata(FileSourceSplit split, 
String currentFileName)
           throws IOException {
       FileStatus fileStatus = 
hadoopFileSystemProxy.getFileStatus(split.getFilePath());  // Additional RPC 
call
       ...
   }
   ```
   
   **Issue Description**:
   For distributed file systems such as HDFS and S3, `getFileStatus()` is an 
RPC call that adds latency. For large file reads, this overhead may not be 
obvious, but for scenarios with many small files, the performance impact is 
significant.
   
   **Potential Risks**:
   - Significant performance degradation in scenarios with many small files
   - May become a performance bottleneck
   
   **Impact Scope**:
   - Direct impact: All scenarios using file source
   - Indirect impact: Entire pipeline throughput
   - Affected area: All file connectors
   
   **Severity**: CRITICAL
   
   **Improvement Suggestion**:
   ```java
   // Solution 1: Pre-fetch FileStatus when creating FileSourceSplit to avoid 
duplicate calls
   // Need to modify FileSourceSplit and FileSourceEnumerator
   public class FileSourceSplit implements SourceSplit {
       private final String tableId;
       private final String filePath;
       private final FileStatus fileStatus;  // Add
       ...
   }
   
   // Solution 2: Provide a switch for users to choose whether metadata is 
needed
   if (isMetadataEnabled()) {
       Map<String, Object> metadata = buildFileMetadata(split, currentFileName);
       applyRowMetadata(seaTunnelRow, tableId, metadata);
   } else {
       seaTunnelRow.setTableId(tableId);
   }
   
   // Solution 3: Cache FileStatus
   protected Map<String, Object> buildFileMetadata(FileSourceSplit split, 
String currentFileName, FileStatus cachedStatus) {
       FileStatus fileStatus = cachedStatus != null ? cachedStatus : 
hadoopFileSystemProxy.getFileStatus(split.getFilePath());
       ...
   }
   ```
   
   ---
   
   ### Issue 3: Extra system call added for local file system
   
   **Location**: 
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/AbstractReadStrategy.java:606-631`
   
   ```java
   private Long resolveFileCreateTime(FileStatus fileStatus, String filePath) {
       ...
       if ("file".equalsIgnoreCase(scheme)) {
           try {
               URI uri = new Path(filePath).toUri();
               BasicFileAttributes attributes =
                       Files.readAttributes(Paths.get(uri), 
BasicFileAttributes.class);  // Additional system call
               ...
           } catch (Exception ignored) {
           }
       }
       ...
   }
   ```
   
   **Issue Description**:
   For local file system, `getFileStatus()` is already called (underlying is 
stat system call), then `Files.readAttributes()` is called again (stat again), 
causing duplicate system calls.
   
   **Potential Risks**:
   - Local file system read performance degradation
   - More noticeable in scenarios with many small files
   
   **Impact Scope**:
   - Direct impact: LocalFile connector
   - Indirect impact: None
   - Affected area: Single Connector
   
   **Severity**: MAJOR
   
   **Improvement Suggestion**:
   ```java
   // Solution 1: Only call when user explicitly needs FileCreateTime
   private Long resolveFileCreateTime(FileStatus fileStatus, String filePath, 
boolean needCreateTime) {
       if (!needCreateTime) {
           return fileStatus.getModificationTime() > 0 ? 
fileStatus.getModificationTime() : null;
       }
       ...existing code...
   }
   
   // Solution 2: Use FileStatus API (if available)
   // Note: Hadoop FileStatus may not have creation time, need to check 
specific version
   ```
   
   ---
   
   ### Issue 4: resolveFileCreateTime() swallows all exceptions
   
   **Location**: 
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/AbstractReadStrategy.java:606-631`
   
   ```java
   private Long resolveFileCreateTime(FileStatus fileStatus, String filePath) {
       String scheme = null;
       try {
           scheme = hadoopFileSystemProxy.getFileSystem().getScheme();
       } catch (Exception ignored) {  // Swallow exception
       }
       if (StringUtils.isBlank(scheme)) {
           scheme = new Path(filePath).toUri().getScheme();
       }
       if ("file".equalsIgnoreCase(scheme)) {
           try {
               ...
           } catch (Exception ignored) {  // Swallow exception
           }
       }
       ...
   }
   ```
   
   **Issue Description**:
   Both catch blocks catch all exceptions but do nothing, which hides potential 
serious problems (such as insufficient permissions, unavailable file system, 
etc.), making problems difficult to debug.
   
   **Potential Risks**:
   - Hides real errors, increasing debugging difficulty
   - May return incorrect metadata values
   
   **Impact Scope**:
   - Direct impact: Metadata accuracy
   - Indirect impact: Debugging and problem location
   - Affected area: All file connectors
   
   **Severity**: MAJOR
   
   **Improvement Suggestion**:
   ```java
   private Long resolveFileCreateTime(FileStatus fileStatus, String filePath) {
       String scheme = null;
       try {
           scheme = hadoopFileSystemProxy.getFileSystem().getScheme();
       } catch (Exception e) {
           // Use logger instead of ignoring
           if (log.isDebugEnabled()) {
               log.debug("Failed to get filesystem scheme, using path scheme 
instead", e);
           }
       }
       if (StringUtils.isBlank(scheme)) {
           scheme = new Path(filePath).toUri().getScheme();
       }
       if ("file".equalsIgnoreCase(scheme)) {
           try {
               URI uri = new Path(filePath).toUri();
               BasicFileAttributes attributes =
                       Files.readAttributes(Paths.get(uri), 
BasicFileAttributes.class);
               long created = attributes.creationTime().toMillis();
               if (created > 0) {
                   return created;
               }
           } catch (Exception e) {
               // Use logger to record, but still fallback
               log.warn("Failed to get creation time for {}, fallback to 
modification time", filePath, e);
           }
       }
       ...
   }
   ```
   
   ---
   
   ### Issue 5: No logging when creation time falls back to modification time
   
   **Location**: 
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/AbstractReadStrategy.java:628-630`
   
   ```java
   // Fallback to modification time for filesystems without creation time.
   long modificationTime = fileStatus.getModificationTime();
   return modificationTime > 0 ? modificationTime : null;
   ```
   
   **Issue Description**:
   When the file system does not support creation time, the code falls back to 
modification time, but there is no log record. Users may be confused why the 
value of `FileCreateTime` is the same as `FileUpdateTime`.
   
   **Potential Risks**:
   - User confusion
   - Difficult to troubleshoot problems
   
   **Impact Scope**:
   - Direct impact: User experience
   - Indirect impact: Support cost
   - Affected area: All file connectors
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   ```java
   // Fallback to modification time for filesystems without creation time.
   long modificationTime = fileStatus.getModificationTime();
   if (log.isDebugEnabled() && modificationTime > 0) {
       log.debug("Filesystem does not support creation time, using modification 
time for {}", split.getFilePath());
   }
   return modificationTime > 0 ? modificationTime : null;
   ```
   
   ---
   
   ### Issue 6: addMetadataColumnIfAbsent() uses O(n²) algorithm
   
   **Location**: 
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseFileSourceConfig.java:201-207`
   
   ```java
   private void addMetadataColumnIfAbsent(List<Column> columns, MetadataColumn 
column) {
       for (Column existing : columns) {  // O(n) traversal
           if (existing.getName().equals(column.getName())) {
               return;
           }
       }
       columns.add(column);  // List.add() O(1) but may trigger expansion
   }
   ```
   
   **Issue Description**:
   Although there are only 5 fields and the impact is not significant, each 
call needs to traverse the entire list, which is not a best practice.
   
   **Potential Risks**:
   - Code quality
   - Maintainability
   
   **Impact Scope**:
   - Direct impact: Code quality
   - Indirect impact: None
   - Affected area: BaseFileSourceConfig
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   ```java
   // Solution 1: Use Set for deduplication
   private Set<String> existingColumnNames = new HashSet<>();
   private void addMetadataColumnIfAbsent(List<Column> columns, MetadataColumn 
column) {
       if (existingColumnNames.add(column.getName())) {  // Set.add() returns 
false if already exists
           columns.add(column);
       }
   }
   
   // Solution 2: Use Stream
   private void addMetadataColumnIfAbsent(List<Column> columns, MetadataColumn 
column) {
       if (columns.stream().noneMatch(c -> 
c.getName().equals(column.getName()))) {
           columns.add(column);
       }
   }
   ```
   
   ---
   
   ### Issue 7: Lack of tests for different file systems
   
   **Location**: Test files
   
   **Issue Description**:
   Unit tests only cover the local file system (`TextReadStrategyTest` uses 
LocalConf), and there are no tests for HDFS, S3, SFTP, etc. Different file 
systems may have different support for `getScheme()`, `FileStatus`.
   
   **Potential Risks**:
   - Metadata for HDFS, S3 and other file systems may be incorrect
   - Problems are only discovered in E2E tests, increasing debugging costs
   
   **Impact Scope**:
   - Direct impact: Test coverage
   - Indirect impact: Code quality
   - Affected area: All non-local file systems
   
   **Severity**: MAJOR
   
   **Improvement Suggestion**:
   ```java
   // Add tests for different file systems
   @Test
   public void testHdfsFileMetadata() {
       // Use mock HDFS filesystem
       ...
   }
   
   @Test
   public void testS3FileMetadata() {
       // Use mock S3 filesystem
       ...
   }
   
   // At least test:
   // 1. Behavior when getScheme() returns different values
   // 2. Different values of FileStatus (e.g., modificationTime=0)
   // 3. Cases where creation time parsing succeeds and fails
   ```
   
   ---
   
   ### Issue 8: Lack of unit tests for resolveFileCreateTime()
   
   **Location**: Test files
   
   **Issue Description**:
   `resolveFileCreateTime()` is a complex private method involving multiple 
conditional branches and exception handling, but there are no dedicated unit 
tests.
   
   **Potential Risks**:
   - Logic errors are difficult to discover
   - Edge cases not covered
   
   **Impact Scope**:
   - Direct impact: Test coverage
   - Indirect impact: Code quality
   - Affected area: AbstractReadStrategy
   
   **Severity**: MAJOR
   
   **Improvement Suggestion**:
   ```java
   // Use reflection to test private methods, or change to package-private
   @Test
   public void testResolveFileCreateTimeWithLocalFile() throws Exception {
       ...
   }
   
   @Test
   public void testResolveFileCreateTimeWithHdfsFile() {
       ...
   }
   
   @Test
   public void testResolveFileCreateTimeWhenSchemeIsNull() {
       ...
   }
   
   @Test
   public void testResolveFileCreateTimeWhenModificationTimeIsZero() {
       ...
   }
   ```
   
   ---
   
   ### Issue 9: Lack of tests for exception scenarios
   
   **Location**: Test files
   
   **Issue Description**:
   Tests only verify normal cases, no testing for:
   - File does not exist
   - Insufficient permissions
   - FileStatus retrieval failure
   - Invalid file paths
   
   **Potential Risks**:
   - Improper exception handling leads to crashes
   - Poor user experience
   
   **Impact Scope**:
   - Direct impact: Test coverage
   - Indirect impact: Stability
   - Affected area: All ReadStrategy
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   ```java
   @Test
   public void testBuildFileMetadataWhenFileNotExists() {
       ...
   }
   
   @Test
   public void testBuildFileMetadataWhenPermissionDenied() {
       ...
   }
   ```
   
   ---
   
   ### Issue 10: E2E tests only verify NOT_NULL
   
   **Location**: E2E test configuration file
   
   ```hocon
   field_value = [
     {
       rule_type = NOT_NULL
     }
   ]
   ```
   
   **Issue Description**:
   E2E tests only verify that fields are not null, and do not verify the 
correctness of values (such as whether file paths are correct, whether 
timestamps are reasonable).
   
   **Potential Risks**:
   - Value errors but tests pass
   - Problems are only discovered in production environment
   
   **Impact Scope**:
   - Direct impact: Test quality
   - Indirect impact: Product quality
   - Affected area: All E2E tests
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   ```hocon
   // Besides NOT_NULL, should also verify:
   // 1. File path matches expectation
   // 2. File size is greater than 0
   // 3. Timestamp is reasonable (e.g., not negative, within reasonable range)
   // 4. File type is correct
   field_value = [
     {
       rule_type = NOT_NULL
     },
     {
       rule_type = "REGEX_MATCH"
       rule_value = ".*\\.txt"  // Verify file path
     }
   ]
   ```
   
   ---
   
   ### Issue 11: Documentation does not explicitly state which file systems 
support creation time
   
   **Location**: `docs/en/transforms/metadata.md:35`
   
   ```markdown
   4. **File metadata**: `FileCreateTime` can be null depending on filesystem 
or permissions; `FileType` is typically derived from the file extension.
   ```
   
   **Issue Description**:
   Documentation only says "can be null depending on filesystem", but does not 
explicitly state:
   - Which file systems support creation time (local with java.nio)
   - Which file systems do not support it (HDFS, S3, etc. will fall back to 
modification time)
   
   **Potential Risks**:
   - User confusion
   - Unreasonable expectations
   
   **Impact Scope**:
   - Direct impact: User experience
   - Indirect impact: Support cost
   - Affected area: Documentation
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   ```markdown
   4. **File metadata**: 
      - `FileCreateTime`: Only supported on local filesystems (using 
`java.nio.file`). For other filesystems (HDFS, S3, SFTP, FTP, COS), it falls 
back to `FileUpdateTime`. Can be null if the underlying file metadata is 
unavailable.
      - `FileType`: Derived from the file extension (the part after the last 
dot). Empty string if no extension.
   ```
   
   ---
   
   ### Issue 12: Lack of performance metrics
   
   **Location**: All metadata-related code
   
   **Issue Description**:
   There is no monitoring of metadata construction performance (such as time 
consumed, frequency), making it impossible to evaluate performance impact in 
production environments.
   
   **Potential Risks**:
   - Performance problems are difficult to discover
   - Unable to optimize
   
   **Impact Scope**:
   - Direct impact: Observability
   - Indirect impact: Performance tuning
   - Affected area: All file connectors
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   ```java
   protected Map<String, Object> buildFileMetadata(FileSourceSplit split, 
String currentFileName)
           throws IOException {
       long start = System.nanoTime();
       try {
           FileStatus fileStatus = 
hadoopFileSystemProxy.getFileStatus(split.getFilePath());
           ...
       } finally {
           long duration = System.nanoTime() - start;
           if (duration > 1_000_000) {  // Exceeds 1ms
               log.warn("Building file metadata took {} ms for {}", duration / 
1_000_000, split.getFilePath());
           }
       }
   }
   ```
   
   ---
   
   ### Issue 13: Lack of debug logging
   
   **Location**: All metadata-related code
   
   **Issue Description**:
   There are no logs recording the metadata construction process, making it 
difficult to track problems during debugging.
   
   **Potential Risks**:
   - Problems are difficult to locate
   - Difficult to debug
   
   **Impact Scope**:
   - Direct impact: Observability
   - Indirect impact: Debugging efficiency
   - Affected area: All file connectors
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   ```java
   protected Map<String, Object> buildFileMetadata(FileSourceSplit split, 
String currentFileName)
           throws IOException {
       if (log.isTraceEnabled()) {
           log.trace("Building metadata for file: {}", currentFileName);
       }
       FileStatus fileStatus = 
hadoopFileSystemProxy.getFileStatus(split.getFilePath());
       ...
       if (log.isDebugEnabled()) {
           log.debug("Built metadata for file: {}, size: {}, updateTime: {}", 
               currentFileName, fileStatus.getLen(), 
fileStatus.getModificationTime());
       }
       ...
   }
   ```
   
   ---
   
   ### Issue 14: SeaTunnelRow.options is transient and will not be serialized 
to checkpoint
   
   **Location**: 
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/type/SeaTunnelRow.java:37`
   
   ```java
   private Map<String, Object> options;  // Default is transient
   ```
   
   **Issue Description**:
   `SeaTunnelRow.options` field is transient (according to Java serialization 
rules, if not explicitly declared, but after reviewing the code, there is no 
transient modifier, needs confirmation). If indeed not serialized, metadata 
will be lost after checkpoint recovery.
   
   **Potential Risks**:
   - Metadata lost after checkpoint recovery
   - Downstream operators that depend on metadata may error
   
   **Impact Scope**:
   - Direct impact: Checkpoint recovery
   - Indirect impact: Data consistency
   - Affected area: All scenarios using metadata
   
   **Severity**: CRITICAL
   
   **Confidence**: Low (needs further confirmation)
   
   **Improvement Suggestion**:
   ```java
   // Need to confirm SeaTunnelRow serialization mechanism
   // If using custom serialization like Kryo, may not be affected
   // If using Java serialization, need to:
   private void writeObject(ObjectOutputStream out) throws IOException {
       out.defaultWriteObject();
       // Serialize options
   }
   
   private void readObject(ObjectInputStream in) throws IOException, 
ClassNotFoundException {
       in.defaultReadObject();
       // Deserialize options
   }
   ```
   
   ---
   
   ### Issue 15: Inconsistent naming of CommonOptions enum values
   
   **Location**: 
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/type/CommonOptions.java:69-74`
   
   ```java
   FILE_PATH("FilePath", true),
   FILE_CREATE_TIME("FileCreateTime", true),
   FILE_UPDATE_TIME("FileUpdateTime", true),
   FILE_SIZE("FileSize", true),
   FILE_TYPE("FileType", true);
   ```
   
   **Issue Description**:
   Enum names use uppercase underscore (`FILE_PATH`), but enum values use 
PascalCase (`FilePath`), which is inconsistent with the style of other enum 
constants (such as `IS_BINARY_FORMAT`). More importantly, metadata transform 
uses lowercase underscore (such as `file_path`), which may cause confusion.
   
   **Potential Risks**:
   - User confusion
   - Configuration errors
   
   **Impact Scope**:
   - Direct impact: User experience
   - Indirect impact: Configuration
   - Affected area: API
   
   **Severity**: MINOR
   
   **Improvement Suggestion**:
   ```java
   // Maintain consistency, use the same naming style everywhere
   // Documentation should clearly state:
   // - CommonOptions enum name: FILE_PATH (Java code)
   // - getName() return value: FilePath
   // - Metadata transform configuration: file_path (configuration file)
   ```
   
   ---


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Feature][Connector-V2][File-Source]Support metadata column for file source connector [seatunnel]

Reply via email to