DanielCarter-stack commented on PR #10505:
URL: https://github.com/apache/seatunnel/pull/10505#issuecomment-3940694029
<!-- code-pr-reviewer -->
<!-- cpr:pr_reply_v2_parts {"group": "apache/seatunnel#10505", "part": 1,
"total": 1} -->
### Issue 1: seatunnel.sh script changes don't match PR theme
**Location**:
`seatunnel-core/seatunnel-starter/src/main/bin/seatunnel.sh:44-55`
**Changes**:
```bash
# Original Code
-if [ $# == 0 ]; then
- set -- -h
-fi
-args=("$@")
-args_str=" $* "
# After Modification
+if [ $# == 0 ]
+then
+ args="-h"
+else
+ args=$@
+fi
```
**Problem Description**:
1. This change is **completely unrelated** to the PR title "Add
recursive_file_scan option"
2. Modified parameter passing method:
- Original: Uses array `args=("$@")`
- Modified: Uses string `args=$@`
3. Modified parameter matching logic:
- Original: `[[ "$args_str" == *" -m local "* ]]` (matches with spaces
before and after)
- Modified: `[[ $args == *" -m local"* ]]` (matches with space before,
none after)
**Potential Risks**:
- **Risk 1 (High)**: May fail when parameters contain spaces
```bash
# Scenario
seatunnel.sh -c "my config.conf" # 配置文件路径包含空格
# Original Code
args=("my" "config.conf") # 正确分割为数组
"${args[@]}" # 正确展开为 "my" "config.conf"
# After Modification
args="my config.conf" # 单个字符串
$args # 可能错误展开
```
- **风险 2 (中)**: 参数匹配模式变松,可能误匹配
```bash
# Scenario
seatunnel.sh -m local2 # 应该不匹配 "local"
# Original
[[ "$args_str" == *" -m local "* ]] # 不匹配 (需要前后空格)
# After Modification
[[ $args == *" -m local"* ]] # 误匹配 "local2"
```
- **风险 3 (中)**: 最后的 java 命令调用方式变更
```bash
# Original
java ... "${args[@]}" # 数组展开,正确处理空格和引号
# After Modification
java ... $args # 字符串展开,可能导致 word splitting
```
** Confidence**: High (confirmed through code review)
** Impact Scope**:
- 直接影响: 所有使用 `seatunnel.sh` 脚本的用户
- 间接影响: CI/CD 流程、自动化脚本
- 影响面: 核心启动脚本 (所有连接器)
** Severity**: **CRITICAL** (blocks merging)
** Improvement Suggestions**:
** Option 1**: Remove this change and create an independent PR
```bash
# Restore original code
if [ $# == 0 ]; then
set -- -h
fi
args=("$@")
args_str=" $* "
# ...
java ${JAVA_OPTS} -cp ${CLASS_PATH} ${APP_MAIN} "${args[@]}"
```
** Option 2**: If this change is truly necessary, need to:
1. 在 PR 描述中说明原因
2. 添加充分的测试
3. 修复参数匹配逻辑,避免误匹配:
```bash
if [[ $args =~ (^-m local |--master local)[[:space:]] || $args =~
[[:space:]](-e local |--deploy-mode local)$ ]]; then
```
** Recommendation**: **Option 1** - Remove this change from this PR and
submit as an independent Bug Fix PR.
---
# ## Issue 2: Insufficient Test Coverage - Missing Default Value and
Boundary Condition Tests
** Location**: `AbstractReadStrategyTest.java:429-464`
** Issue Description**:
当前测试仅覆盖了 `recursive_file_scan=false` 的显式配置场景,缺少以下关键测试:
1. **默认值测试**: 未验证不配置时默认为 `true`
2. **显式 `true` 测试**: 未验证 `recursive_file_scan=true` 的显式配置
3. **空目录场景**: 未验证 `recursive_file_scan=false` + 空目录的行为
4. **只有子目录场景**: 未验证顶层无文件、只有子目录时的行为
5. **多层嵌套场景**: 未验证 `recursive_file_scan=false` 时多层目录的行为
6. **与其他配置的组合测试**:
- 与 `file_filter_pattern` 组合
- 与 `read_partitions` 组合
- 与 `sync_mode=update` 组合
** Confidence**: High (confirmed through test code review)
** Potential Risks**:
- **风险 1**: 未来重构可能意外破坏默认值行为
- **风险 2**: 用户在复杂场景下遇到未预期的行为
- **风险 3**: 回归测试不足,未来引入 Bug
** Impact Scope**:
- 直接影响: 测试覆盖率
- 间接影响: 代码质量保障
- 影响面: 单个测试文件
** Severity**: **MAJOR** (recommended to fix)
** Improvement Suggestions**:
```java
@DisabledOnOs(OS.WINDOWS)
@Test
public void testRecursiveFileScanDefault() throws Exception {
// Test default value (don't configure recursive_file_scan)
String baseDir = "/tmp/test_default";
String file1 = baseDir + "/file1.txt";
String subdirFile = baseDir + "/subdir/file2.txt";
try {
createTestFiles(file1, subdirFile);
Map<String, Object> config = new HashMap<>();
config.put(FileBaseSourceOptions.FILE_PATH.key(), baseDir);
config.put(FileBaseSourceOptions.FILE_FORMAT_TYPE.key(), "text");
// Note: don't configure recursive_file_scan
Config pluginConfig = ConfigFactory.parseMap(config);
try (TextReadStrategy strategy = new TextReadStrategy()) {
ParquetReadStrategyTest.LocalConf localConf =
new
ParquetReadStrategyTest.LocalConf(FS_DEFAULT_NAME_DEFAULT);
strategy.init(localConf);
strategy.setPluginConfig(pluginConfig);
List<String> fileNames = strategy.getFileNamesByPath(baseDir);
// Verify default behavior is recursive scanning
Assertions.assertEquals(2, fileNames.size());
Assertions.assertTrue(fileNames.stream().anyMatch(f ->
f.contains("subdir")));
}
} finally {
deleteTestDirectory(baseDir);
}
}
@DisabledOnOs(OS.WINDOWS)
@Test
public void testNonRecursiveEmptyDirectory() throws Exception {
// Test empty directory scenario
String baseDir = "/tmp/test_empty";
String subdirDir = baseDir + "/subdir";
try {
Configuration conf = new Configuration();
Path path = new Path(subdirDir);
org.apache.hadoop.fs.FileSystem fs = path.getFileSystem(conf);
fs.mkdirs(path); // Create empty subdirectory
Map<String, Object> config = new HashMap<>();
config.put(FileBaseSourceOptions.FILE_PATH.key(), baseDir);
config.put(FileBaseSourceOptions.FILE_FORMAT_TYPE.key(), "text");
config.put(FileBaseSourceOptions.RECURSIVE_FILE_SCAN.key(), false);
Config pluginConfig = ConfigFactory.parseMap(config);
try (TextReadStrategy strategy = new TextReadStrategy()) {
ParquetReadStrategyTest.LocalConf localConf =
new
ParquetReadStrategyTest.LocalConf(FS_DEFAULT_NAME_DEFAULT);
strategy.init(localConf);
strategy.setPluginConfig(pluginConfig);
List<String> fileNames = strategy.getFileNamesByPath(baseDir);
// Verify empty directory returns empty list
Assertions.assertEquals(0, fileNames.size());
}
} finally {
deleteTestDirectory(baseDir);
}
}
@DisabledOnOs(OS.WINDOWS)
@Test
public void testNonRecursiveWithOnlySubdirectories() throws Exception {
// Test scenario with only subdirectories and no top-level files
String baseDir = "/tmp/test_only_subdir";
String subdirFile1 = baseDir + "/subdir1/file1.txt";
String subdirFile2 = baseDir + "/subdir2/file2.txt";
try {
createTestFiles(subdirFile1, subdirFile2);
Map<String, Object> config = new HashMap<>();
config.put(FileBaseSourceOptions.FILE_PATH.key(), baseDir);
config.put(FileBaseSourceOptions.FILE_FORMAT_TYPE.key(), "text");
config.put(FileBaseSourceOptions.RECURSIVE_FILE_SCAN.key(), false);
Config pluginConfig = ConfigFactory.parseMap(config);
try (TextReadStrategy strategy = new TextReadStrategy()) {
ParquetReadStrategyTest.LocalConf localConf =
new
ParquetReadStrategyTest.LocalConf(FS_DEFAULT_NAME_DEFAULT);
strategy.init(localConf);
strategy.setPluginConfig(pluginConfig);
List<String> fileNames = strategy.getFileNamesByPath(baseDir);
// Verify subdirectories are not scanned, return empty list
Assertions.assertEquals(0, fileNames.size());
}
} finally {
deleteTestDirectory(baseDir);
}
}
```
---
# ## Issue 3: Documentation Lacks Usage Examples and Interaction Instructions
** Location**:
- `docs/en/connectors/source/LocalFile.md:497-500`
- `docs/zh/connectors/source/LocalFile.md:497-502`
- (其他 8 个连接器文档类似)
** Issue Description**:
当前文档仅包含基本的配置说明,缺少:
1. **实际使用示例**: 未展示如何在真实场景中使用此配置
2. **性能影响说明**: 未提及非递归模式的性能优势
3. **与其他配置的交互**:
- 与隐藏目录 (`.hive-staging_hive`) 的关系
- 与 `read_partitions` 的组合使用
- 与 `file_filter_pattern` 的组合使用
4. **迁移指南**: 未说明现有用户如何评估是否需要调整配置
** Confidence**: High (confirmed through documentation review)
** Potential Risks**:
- **风险 1**: 用户不知道此功能的存在和价值
- **风险 2**: 用户可能误用配置 (如期望提升性能但设置错误)
- **风险 3**: 用户不清楚配置间的优先级和交互
** Impact Scope**:
- 直接影响: 所有文件连接器用户
- 间接影响: 支持工单量
- 影响面: 所有 9 个文件连接器文档
** Severity**: **MAJOR** (recommended to improve)
** Improvement Suggestions**:
在各个连接器文档中添加详细的说明章节:
```markdown
### recursive_file_scan [boolean]
Whether to scan subdirectories recursively. If `false`, subdirectories will
be ignored.
**Default value**: `true` (maintains backward compatibility)
**Behavior**:
- When `true`: Recursively scans all subdirectories (default behavior)
- When `false`: Only scans files in the specified directory, ignoring all
subdirectories
**Note**: Hidden directories (starting with `.`) are always skipped,
regardless of this setting.
This includes directories like `.hive-staging_hive` which are used by Hive
as temporary directories.
**Performance Considerations**:
Setting `recursive_file_scan` to `false` can significantly improve
performance when:
- You only need files from the top-level directory
- The directory structure is very deep or has many subdirectories
- You want to reduce the initial file discovery time
**Examples**:
**Example 1: Scan only top-level files**
```hocon
source {
LocalFile {
path = "/data/daily"
recursive_file_scan = false # Only read files in /data/daily
file_format_type = "text"
}
}
```
Given this directory structure:
```
/data/daily/
├── file1.txt
├── file2.txt
└── archive/
└── old_data.txt
```
Result: Only `file1.txt` and `file2.txt` will be read.
**Example 2: Recursively scan all files (default)**
```hocon
source {
LocalFile {
path = "/data/daily"
recursive_file_scan = true # or omit this line
file_format_type = "text"
}
}
```
Result: All files in `/data/daily` and its subdirectories will be read.
**Example 3: Combining with partition filtering**
```hocon
source {
LocalFile {
path = "/data"
recursive_file_scan = false # Only discover files in /data
read_partitions = ["dt=20240101"] # Then filter by partition
file_format_type = "parquet"
}
}
```
**Example 4: Combining with file pattern filter**
```hocon
source {
LocalFile {
path = "/data"
recursive_file_scan = false
file_filter_pattern = ".*\\.csv$" # Only CSV files
file_format_type = "csv"
}
}
```
**Migration Guide**:
For existing jobs:
- No action required if you want to keep the current behavior (recursive
scanning)
- To improve performance by only scanning top-level files, add
`recursive_file_scan = false`
- Test the change in a non-production environment first
**See Also**:
- [file_filter_pattern](#file_filter_pattern-regex)
- [read_partitions](#read_partitions-string-list)
```
---
# ## Issue 4: Field Access Control Lacks JavaDoc Explanation
** Location**: `AbstractReadStrategy.java:126`
** Issue Description**:
```java
public abstract class AbstractReadStrategy implements ReadStrategy {
// ...
protected boolean recursiveFileScan = true;
// ...
}
```
字段 `recursiveFileScan` 被声明为 `protected`,这意味着:
- 所有子类都可以访问和修改此字段
- 子类可能意外覆盖此字段,导致行为异常
** Confidence**: Medium (through code pattern analysis)
** Potential Risks**:
- **风险 1**: 子类可能在构造函数中初始化此字段,覆盖默认值
- **风险 2**: 子类可能在运行时修改此字段,导致并发问题
- **风险 3**: 缺少 JavaDoc 可能导致子类开发者误用此字段
** Impact Scope**:
- 直接影响: 所有 `AbstractReadStrategy` 的子类 (约 10 个)
- 间接影响: 继承层级中的字段可见性
- 影响面: connector-file 模块
** Severity**: **MINOR** (optional improvement)
** Improvement Suggestions**:
** Option 1**: Add JavaDoc comments
```java
/**
* Controls whether to recursively scan subdirectories when collecting file
names.
*
* <p>This field is set during {@link #setPluginConfig(Config)} based on the
* {@link FileBaseSourceOptions#RECURSIVE_FILE_SCAN} configuration option.
*
* <p><b>Default value:</b> {@code true} (recursive scanning enabled)
*
* <p><b>Note for subclasses:</b> This field should not be modified directly.
* Subclasses should rely on the configuration mechanism in {@code
setPluginConfig()}
* to control this behavior. Modifying this field at runtime may lead to
unexpected
* behavior, especially in multi-threaded scenarios.
*
* @see FileBaseSourceOptions#RECURSIVE_FILE_SCAN
* @see #setPluginConfig(Config)
*/
protected boolean recursiveFileScan = true;
```
** Option 2**: Change to `private` and provide getter (stricter, but may
break existing subclasses)
```java
private boolean recursiveFileScan = true;
protected boolean isRecursiveFileScan() {
return recursiveFileScan;
}
```
** Recommendation**: **Option 1** - Add detailed JavaDoc, maintain backward
compatibility.
---
# ## Issue 5: Missing OssJindoFileSourceFactory Update
** Location**:
`seatunnel-connectors-v2/connector-file-connector-file-jindo-oss/`
** Issue Description**:
PR 变更列表中包含:
- ✅ 文档更新: `docs/en/connectors/source/OssJindoFile.md`
- ✅ 文档更新: `docs/zh/connectors/source/OssJindoFile.md`
- ❌ Factory 更新: `OssJindoFileSourceFactory.java` **缺失**
** Verification**:
```bash
$ git diff dev --name-only | grep -i jindo
docs/en/connectors/source/OssJindoFile.md
docs/zh/connectors/source/OssJindoFile.md
```
** Confidence**: High (confirmed through git diff)
** Potential Risks**:
- **风险 1**: 用户配置 `recursive_file_scan` 时,配置验证通过但实际不生效
- **风险 2**: Factory 的 `optionRule()` 未注册此选项,用户配置时可能被忽略或报错
- **风险 3**: 文档与实际行为不一致,导致用户困惑
** Impact Scope**:
- 直接影响: OssJindoFile 连接器用户
- 间接影响: 文档可信度
- 影响面: 单个连接器 (OssJindoFile)
** Severity**: **MAJOR** (needs fixing)
** Improvement Suggestions**:
确认 `OssJindoFileSourceFactory` 的位置并添加配置:
```bash
# Find Factory files
find seatunnel-connectors-v2 -name "*OssJindoFileSourceFactory.java"
```
然后在 `optionRule()` 方法中添加:
```java
@Override
public OptionRule optionRule() {
return OptionRule.builder()
// ... existing rules ...
.optional(FileBaseSourceOptions.RECURSIVE_FILE_SCAN)
.build();
}
```
如果 `OssJindoFileSourceFactory` 不存在或使用不同的基类,需要:
1. 确认其继承关系
2. 添加相应的配置注册逻辑
3. 更新 PR 变更列表
---
# ## Issue 6: Test Resource Cleanup May Fail
** Location**: `AbstractReadStrategyTest.java:480-488`
** Issue Description**:
```java
private void deleteTestDirectory(String dirPath) {
try {
Path path = new Path(dirPath);
Configuration conf = new Configuration();
org.apache.hadoop.fs.FileSystem fs = path.getFileSystem(conf);
fs.delete(path, true);
} catch (Exception e) {
// Empty catch, ignore all exceptions
}
}
```
** Confidence**: Medium (confirmed through code review)
** Potential Risks**:
- **风险 1**: 如果删除失败,测试文件可能累积在 `/tmp/` 目录
- **风险 2**: 异常被静默忽略,无法诊断问题
- **风险 3**: 后续测试可能因文件已存在而失败
** Impact Scope**:
- 直接影响: 测试环境
- 间接影响: CI/CD 稳定性
- 影响面: 单个测试文件
** Severity**: **MINOR** (optional improvement)
** Improvement Suggestions**:
```java
private void deleteTestDirectory(String dirPath) {
try {
Path path = new Path(dirPath);
Configuration conf = new Configuration();
org.apache.hadoop.fs.FileSystem fs = path.getFileSystem(conf);
if (fs.exists(path)) {
fs.delete(path, true);
}
} catch (Exception e) {
// Log exception instead of silently ignoring
System.err.println("Warning: Failed to delete test directory: " +
dirPath);
e.printStackTrace();
}
}
```
或者使用测试框架的 `@TempDir` 注解 (JUnit 5):
```java
@Test
@TempDir Path tempDir // Auto cleanup
public void testNonRecursiveFileScan() throws Exception {
String baseDir = tempDir.toAbsolutePath().toString() + "/test";
// ...
}
```
---
# ## Issue 7: Inconsistent Chinese Documentation Description
** Location**:
- `docs/zh/connectors/source/LocalFile.md:501`
- `docs/zh/connectors/source/ObsFile.md:126`
** Issue Description**:
不同连接器的中文文档对 `recursive_file_scan` 的描述略有差异:
**LocalFile.md**:
```markdown
Whether to scan subdirectories recursively.
If set to `false`, subdirectories will be ignored and only files under the
specified path will be scanned.
```
**ObsFile.md**:
```markdown
Whether to scan subdirectories recursively. If `false`, subdirectories will
be ignored.
```
** Confidence**: High (confirmed through documentation comparison)
** Potential Risks**:
- **风险**: 用户阅读不同文档时可能对功能理解不一致
- **影响**: 降低文档专业性
** Impact Scope**:
- 直接影响: 中文文档读者
- 间接影响: 文档一致性
- 影响面: 中文文档
** Severity**: **MINOR** (optional improvement)
** Improvement Suggestions**:
统一所有连接器的中文文档描述为:
```markdown
### recursive_file_scan [boolean]
Whether to scan subdirectories recursively. If set to `false`,
subdirectories will be ignored.
**Default value**: `true` (maintains backward compatibility)
**Description**:
- When set to `true`, recursively scans all subdirectories
- When set to `false`, only scans files under the specified directory,
ignoring all subdirectories
**Note**: Hidden directories (starting with `.`) are always skipped,
regardless of this setting.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]