DanielCarter-stack commented on PR #10486:
URL: https://github.com/apache/seatunnel/pull/10486#issuecomment-3888676972
<!-- code-pr-reviewer -->
<!-- cpr:pr_reply_v2_parts {"group": "apache/seatunnel#10486", "part": 1,
"total": 1} -->
### Issue 1: Error message not clear enough after re-matching fails
**Location**:
`seatunnel-common/src/main/java/org/apache/seatunnel/common/utils/DateTimeParseHelper.java:82-95`
**Related context**:
- Interface definition: `DateTimeParseHelper.java:31-96`
- Caller 1: `TextDeserializationSchema.java:310-312` (DATE case)
- Caller 2: `CsvDeserializationSchema.java:297-299` (TIMESTAMP case)
- Caller 3: `JsonToRowConverters.java:134-136` (DATE case)
**Problem description**:
When the first auto-matched format fails to parse, and the second re-match
also fails (returns `null`), a generic error is thrown without indicating it
was caused by a format change.
**Potential risks**:
- Risk 1: Users cannot determine whether they need to configure an explicit
format
- Risk 2: Difficult to debug, unclear what format was matched the first time
**Impact scope**:
- Direct impact: All Format classes using DateTime Parse
- Indirect impact: User data quality validation processes
- Affected area: Multiple Formats (Text, CSV, JSON)
**Severity**: MINOR
**Improvement suggestion**:
```java
// DateTimeParseHelper.java:88-94
if (isUserConfigured) {
throw errorSupplier.get(fieldVal, fieldName);
}
// Record original format information before retry
DateTimeFormatter originalFormatter = fieldFormatterCache.get(fieldName);
formatter = autoFormatterSupplier.get(fieldVal);
if (formatter == null) {
// Improve error message to indicate that automatic matching was
attempted but failed
throw new SeaTunnelRuntimeException(
CommonErrorCode.FORMAT_DATETIME_ERROR,
Map.of(
"datetime", fieldVal,
"field", fieldName,
"reason", "Auto-match failed after initial format mismatch: " +
(originalFormatter != null ? "cached format incompatible" :
"no matching pattern")
)
);
}
```
**Rationale**: Provide more context to help users understand whether they
need to explicitly configure a format or clean the data.
---
### Issue 2: FormatterConfig.getPatternStr() uses instanceof chain,
violating Open-Closed Principle
**Location**:
`seatunnel-common/src/main/java/org/apache/seatunnel/common/utils/DateTimeParseHelper.java:98-109`
**Related context**:
- Implementation: `DateTimeParseHelper.java:98-109`
- Caller: `DateTimeParseHelper.parseDateTimeValue()` (line 69)
- Related classes: `DateUtils.Formatter`, `TimeUtils.Formatter`,
`DateTimeUtils.Formatter`
**Problem description**:
Uses `instanceof` chain to determine Formatter type; adding new Formatter
types requires modifying this method.
**Potential risks**:
- Risk 1: Maintenance burden - must modify every time a new Formatter is
added
- Risk 2: Easy to miss a branch, resulting in returning an empty string
**Impact scope**:
- Direct impact: `DateTimeParseHelper` interface
- Indirect impact: Future Formatter enum additions
- Affected area: Core utility class
**Severity**: MINOR
**Improvement suggestion**:
```java
// Add to Formatter interface
public interface Formatter<T> {
String getValue();
String getPattern(); // New addition
}
// In each Formatter implementation
public enum Formatter implements Formatter<Formatter> {
YYYY_MM_DD("yyyy-MM-dd");
@Override
public String getPattern() {
return getValue(); // Default implementation
}
}
// Simplify getPatternStr
default String getPatternStr(FormatterConfig<?> formatterConfig) {
return formatterConfig.getFormatter().getPattern();
}
```
**Rationale**: Leverage polymorphism to avoid type checking, complying with
the Open-Closed Principle.
---
### Issue 3: Missing documentation and CHANGELOG
**Location**:
- PR description (claims to follow checklist but not executed)
- Documentation files: No `.md` modifications found
**Related context**:
- PR description: `If necessary, please update the documentation to describe
the new feature.`
- Configuration items: `DATE_FORMAT`, `DATETIME_FORMAT`, `TIME_FORMAT`
(newly added)
- Configuration items: `DATE_FORMAT_LEGACY`, `DATETIME_FORMAT_LEGACY`,
`TIME_FORMAT_LEGACY` (retained)
**Problem description**:
1. Users are unaware of the existence and purpose of new configuration items
2. Unclear about the relationship between new and old configuration items
and which one is recommended
3. Possible breaking changes not explained in `incompatible-changes.md`
**Potential risks**:
- Risk 1: Users continue to use old `DATE_FORMAT_LEGACY`, unaware of new
`DATE_FORMAT`
- Risk 2: After upgrading to a new version, configuration behavior changes
causing confusion
**Impact scope**:
- Direct impact: All File Connector users
- Indirect impact: Documentation maintainers
- Affected area: Configuration files and documentation
**Severity**: MAJOR (user experience issue)
**Improvement suggestion**:
1. Update `docs/en/connector-v2/source/FileSource.md` to explain:
- Newly added configuration items: `date_format`, `datetime_format`,
`time_format`
- Old configuration items retained but marked as deprecated
- Configuration examples
2. Update `docs/en/about/incompatible-changes.md`:
```markdown
## [Version X.Y.Z]
### File Source Configuration
- **Change**: Added new configuration options `date_format`,
`datetime_format`, `time_format` for more explicit format specification
- **Impact**: Old options `date_format_legacy`, `datetime_format_legacy`,
`time_format_legacy` are still supported but deprecated
- **Migration**: Update to use new options without `_legacy` suffix
```
3. 在 `CHANGELOG.md` 中添加条目
** Reason**: Follow Apache project specifications to ensure smooth user
upgrades.
---
# ## Issue 4: Caching in concurrent scenarios may cause race conditions
** Location**:
`seatunnel-formats/seatunnel-format-text/src/main/java/org/apache/seatunnel/format/text/TextDeserializationSchema.java:70`
** Related context**:
- 定义:`TextDeserializationSchema.java:70`
- 使用:`DateTimeParseHelper.parseDateTimeValue()` line 63, 79, 89, 93
-
实现类:`TextDeserializationSchema`、`CsvDeserializationSchema`、`JsonToRowConverters`
** Problem description**:
`fieldFormatterCache` 是 `ConcurrentHashMap`,但在
`DateTimeParseHelper.parseDateTimeValue()` 中的使用模式:
```java
DateTimeFormatter formatter = fieldFormatterCache.get(fieldName); // read
if (formatter == null) {
// ... compute new formatter ...
fieldFormatterCache.put(fieldName, formatter); // write
}
```
虽然 `ConcurrentHashMap` 的 `get()` 和 `put()` 是线程安全的,但存在 **check-then-act 竞态**:
- 线程 A 和线程 B 同时解析同一字段
- 两个线程都看到 `formatter == null`
- 两个线程都调用 `autoFormatterSupplier.get(fieldVal)` 并执行 `put()`
- 结果:一个线程的计算结果覆盖另一个线程的结果
** Potential risks**:
- 风险 1:在极端情况下(多线程同时启动),缓存可能不稳定
- 风险 2:轻微的性能损耗(重复计算)
- 风险 3:理论上可能导致两个线程使用不同的格式解析同一字段
** Impact scope**:
-
直接影响:`TextDeserializationSchema`、`CsvDeserializationSchema`、`JsonToRowConverters`
- 间接影响:多线程 Source 任务
- 影响面:Format 层
** Severity**: MINOR (In actual scenarios, field parsing for the first time
is usually single-threaded, but the design is not robust enough)
** Improvement suggestions**:
```java
// Use computeIfAbsent to guarantee atomicity
DateTimeFormatter formatter = fieldFormatterCache.computeIfAbsent(fieldName,
key -> {
if (isUserConfigured) {
String pattern = getPatternStr(formatterConfig);
return DateTimeFormatter.ofPattern(pattern);
} else {
DateTimeFormatter matched = autoFormatterSupplier.get(fieldVal);
if (matched == null) {
throw errorSupplier.get(fieldVal, fieldName);
}
return matched;
}
});
// Parse phase
try {
return parser.parse(fieldVal, formatter);
} catch (Exception e) {
if (isUserConfigured) {
throw errorSupplier.get(fieldVal, fieldName);
}
// Re-match: replace cached formatter
DateTimeFormatter newFormatter = autoFormatterSupplier.get(fieldVal);
if (newFormatter == null) {
throw errorSupplier.get(fieldVal, fieldName);
}
// Atomic replacement (note: there may be concurrency issues here, but
rare in actual scenarios)
fieldFormatterCache.replace(fieldName, formatter, newFormatter);
return parser.parse(fieldVal, newFormatter);
}
```
** Reason**: Use `computeIfAbsent()` to eliminate check-then-act race
conditions, ensuring that each field's format is calculated only once.
---
# ## Issue 5: Time format parsing lacks unified automatic matching support
** Location**:
`seatunnel-common/src/main/java/org/apache/seatunnel/common/utils/TimeUtils.java:116-123`
** Related context**:
- `TimeUtils.matchTimeFormatter()`: line 116-123
- `DateTimeUtils.matchDateTimeFormatter()`: line 348-367(有按长度分组优化)
- `DateUtils.matchDateFormatter()`: line 124-131(线性列表遍历)
** Problem description**:
`TimeUtils.matchTimeFormatter()` 使用线性遍历列表(8 个模式),而
`DateTimeUtils.matchDateTimeFormatter()` 使用了按长度分组的优化(性能提升 4-5 倍)。
时间格式的多样性(HH:mm:ss vs H:mm:ss 等)也值得类似优化。
** Potential risks**:
- 风险 1:时间字段解析性能相对较差
- 风险 2:架构不一致(为什么 DateTime 有优化而 Time 没有)
** Impact scope**:
- 直接影响:`TimeUtils.parse(String)` 的独立调用者
- 间接影响:通过 `DateTimeParseHelper` 的调用不受影响(已统一)
- 影响面:Core 工具类
** Severity**: MINOR (Performance optimization opportunity)
** Improvement suggestions**:
```java
// TimeUtils.java
private static final Map<Integer, List<TimePattern>> TIME_PATTERN_MAP = new
HashMap<>();
private static final int TIME_LENGTH_THRESHOLD = 12; // e.g.
"HH:mm:ss.SSSSSSSSS"
static {
initTimePatternMap();
}
private static void initTimePatternMap() {
// Group by string length: 8 (HH:mm:ss), 9 (H:mm:ss), 12
(HH:mm:ss.SSSSSSSSS), etc.
List<TimePattern> length8Patterns = new ArrayList<>();
length8Patterns.add(new TimePattern("\\d{2}:\\d{2}:\\d{2}",
Formatter.HH_MM_SS));
TIME_PATTERN_MAP.put(8, length8Patterns);
// ... other lengths
}
public static DateTimeFormatter matchTimeFormatter(String timeStr) {
if (timeStr == null || timeStr.isEmpty()) {
throw new IllegalArgumentException("Time string cannot be null or
empty");
}
int strLength = timeStr.length();
List<TimePattern> timePatterns =
TIME_PATTERN_MAP.getOrDefault(strLength, Collections.emptyList());
for (TimePattern pattern : timePatterns) {
if (pattern.getPattern().matcher(timeStr).matches()) {
return pattern.getFormatter();
}
}
return null; // or return match from extra-long group
}
```
**Rationale**: Maintain architectural consistency, improve time field
parsing performance.
---
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]