DanielCarter-stack commented on issue #10565:
URL: https://github.com/apache/seatunnel/issues/10565#issuecomment-4003697695

   <!-- code-pr-reviewer -->
   This issue appears to be related to how `file_filter_pattern` interacts with 
`file_format_type=BINARY` in SftpFile.
   
   **Analysis of the root cause:**
   
   When `file_filter_pattern` filters out all files (or matches zero files), 
the schema generation logic in `BaseFileSourceConfig.parseCatalogTable()` 
(lines 86-116) may not correctly handle the empty `filePaths` case for BINARY 
format. This can cause a schema mismatch that triggers the error you're seeing 
in `BinaryWriteStrategy.setCatalogTable()` (lines 54-60), which validates that 
the incoming schema matches `BinaryReadStrategy.binaryRowType`.
   
   **Your current configuration may be causing the pattern to match zero 
files:**
   - Your `file_filter_pattern = "/opt/module/qingyang/.*\\.pdf"` includes the 
full absolute path
   - The matching logic in `AbstractReadStrategy.filterFileByPattern()` (lines 
523-536) performs pattern matching against the full absolute path
   - Try using: `file_filter_pattern = ".*\\.pdf"` instead
   
   **To help diagnose this further, could you provide:**
   1. The full job configuration (sanitized)
   2. Whether the `/opt/module/qingyang` directory definitely contains `.pdf` 
files
   3. Logs showing the parsed configuration and any warnings about filtered 
files
   
   **Related code locations:**
   - `connector-file-base/.../BaseFileSourceConfig.java:86-116` - schema 
parsing logic
   - `connector-file-base/.../BinaryWriteStrategy.java:54-60` - schema 
validation
   - `connector-file-base/.../AbstractReadStrategy.java:523-536` - file 
filtering


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to