hawk9821 commented on PR #10486:
URL: https://github.com/apache/seatunnel/pull/10486#issuecomment-4656038767
> # What This PR Solves
> * User pain: file/JSON/Debezium readers currently require fairly strict
date/time strings, so common inputs such as single-digit month/day/time or
reversed dates can fail during source deserialization unless users configure
exact formats.
> * Fix approach: this PR centralizes date/time auto-detection in
`DateUtils`, `TimeUtils`, `DateTimeUtils`, and `DateTimeParseHelper`, then
wires that helper into CSV/Text/JSON/Debezium deserialization.
> * In one sentence: this is a useful feature that makes time parsing in
common source paths more forgiving.
>
> # 1. Code Change Review
> ## 1.1 Core Logic Analysis
> Precise change:
>
> * Core changes are in
`seatunnel-common/src/main/java/org/apache/seatunnel/common/utils/DateTimeUtils.java`,
`DateUtils.java`, `TimeUtils.java`, and `DateTimeParseHelper.java`.
> * The runtime entry points are `CsvDeserializationSchema.convert()`,
`TextDeserializationSchema.convert()`,
`JsonToRowConverters.createNotNullConverter()`, and
`DebeziumRowConverter.getValue()`.
> * The critical path is DATE/TIME/TIMESTAMP source-field conversion.
>
> Before / after:
>
> ```java
> // Before: CSV used separate local parsing methods
> case DATE:
> return parseDate(field, fieldName);
> case TIME:
> return parseTime(field);
> case TIMESTAMP:
> return parseTimestamp(field, fieldName);
> ```
>
> ```java
> // After: DATE/TIME/TIMESTAMP share DateTimeParseHelper and a field
formatter cache
> case DATE:
> return parseDate(fieldVal, fieldName, dateFormatterConfig,
fieldFormatterCache);
> case TIME:
> return parseTime(fieldVal, fieldName, timeFormatterConfig,
fieldFormatterCache);
> case TIMESTAMP:
> return parseTimestamp(fieldVal, fieldName, dateTimeFormatterConfig,
fieldFormatterCache);
> ```
>
> Key findings:
>
> * The normal file-source path does hit this PR: `FileSource ->
CsvReadStrategy/TextReadStrategy -> DeserializationSchema.convert() ->
DateTimeParseHelper`.
> * JSON and Debezium deserialization also reuse the helper, so the impact
is wider than CSV/Text only.
> * The approach is a good long-term direction, but a few correctness and
maintainability gaps need to be fixed before merge.
> * Current local merge against the latest `upstream/dev` succeeds, `git
diff --check upstream/dev...HEAD` is clean, and the visible GitHub Build check
is green.
>
> Full runtime path:
>
> ```
> File Source startup
> -> FileSourceReader reads a split
> -> CsvReadStrategy.read() / TextReadStrategy.read()
> -> CsvDeserializationSchema.deserialize() /
TextDeserializationSchema.deserialize()
> -> convert(fieldVal, fieldType, level, fieldName)
> -> DATE / TIME / TIMESTAMP
> -> DateTimeParseHelper.parseDateTimeValue()
> ->
fieldFormatterCache.computeIfAbsent(fieldName, ...)
> -> user-configured formatter or auto-matched
formatter
> -> DateUtils / TimeUtils / DateTimeUtils.parse()
>
> JSON / Debezium path
> -> JsonToRowConverters.createConverter()
> -> DATE / TIME / TIMESTAMP use DateTimeParseHelper
> -> DebeziumRowConverter.getValue()
> -> DATE / TIME / TIMESTAMP use DateTimeParseHelper
> ```
>
> ### Issue 1: CSV/Text MAP key and value share the same formatter cache key
> * Location:
`seatunnel-formats/seatunnel-format-csv/src/main/java/org/apache/seatunnel/format/csv/CsvDeserializationSchema.java:285`
> * Description: in the CSV MAP conversion path, both the map key and map
value are converted with the same `fieldName`. After this PR,
DATE/TIME/TIMESTAMP all share `fieldFormatterCache`, keyed only by `fieldName`.
For a supported schema such as `MAP<DATE,TIME>` with explicit
`date_format="yyyy-MM-dd"` and `time_format="HH:mm:ss"`, the key can cache a
date formatter under the map field name; the value then reuses that date
formatter instead of the time formatter, and because the time formatter is
user-configured, the helper throws immediately rather than re-matching.
`TextDeserializationSchema` has the same pattern.
> * Risk: valid file-source reads can fail for complex MAP fields. Without
explicit formats it can still repeatedly throw, re-match, and replace the
formatter between key/value parsing, adding avoidable CPU/GC cost.
> * Recommended fix: use distinct cache keys for map key/value, such as
`fieldName + ".key"` and `fieldName + ".value"`, or make the cache key include
both logical field path and `SqlType`.
> * Severity: Medium
>
> ### Issue 2: Reverse datetime regex accepts `T`, but the formatter only
accepts a space
> * Location:
`seatunnel-common/src/main/java/org/apache/seatunnel/common/utils/DateTimeUtils.java:70`
> * Description: `reversePattern` allows `[T\\s]`, so inputs like
`1/2/2026T12:01:30` are considered matched by `matchDateTimeFormatter()`. The
paired `reverseFormatter`, however, only calls `appendLiteral(' ')` at line 85
and cannot parse the `T` separator.
> * Risk: this creates a confusing “matched but cannot parse” branch in the
normal CSV/Text/JSON path. It also weakens the author’s reply that the reversed
date-time variants are supported.
> * Recommended fix: either restrict the regex to a space separator or
update the formatter to accept both space and `T`, then add a focused unit test
for `1/2/2026T12:01:30`.
> * Severity: Medium
>
> ### Issue 3: Regular unit tests now include several 10-million-iteration
performance loops
> * Location:
`seatunnel-common/src/test/java/org/apache/seatunnel/common/utils/DateTimeUtilsTest.java:635`
> * Description: the new performance tests run large loops in normal JUnit
and only assert that elapsed time is greater than zero. They do not protect
behavior, but they do add CI cost and noisy timing output.
> * Risk: SeaTunnel’s normal unit-test suite becomes slower and more
sensitive to machine load without gaining reliable regression coverage.
> * Recommended fix: remove these tests from normal UT, or move them to an
explicitly disabled benchmark/JMH-style test. Keep behavior tests with exact
parsed-value assertions in regular UT.
> * Severity: Medium
>
> ### Issue 4: The user-visible auto-format behavior is not documented
> * Location: `docs/en/connectors/source/LocalFile.md:240`
> * Description: this PR changes user-visible behavior for file source
`date_format`, `datetime_format`, and `time_format` when formats are not
explicitly configured, but no docs were updated. The current English and
Chinese file-source docs still list only the older supported formats.
> * Risk: users and maintainers cannot tell which formats are now supported,
and the docs no longer match the actual behavior.
> * Recommended fix: update both `docs/en` and `docs/zh` for the file source
connectors that expose these options, and clearly explain the difference
between explicit formats and auto-detection.
> * Severity: Medium
>
> ### Issue 5: The new time-format error message still says “date format”
> * Location:
`seatunnel-common/src/main/java/org/apache/seatunnel/common/exception/CommonErrorCode.java:84`
> * Description: `FORMAT_TIME_ERROR` currently says `The date format
'<time>'... Please check the time format.`
> * Risk: failure messages during source parsing are slightly misleading.
> * Recommended fix: change it to `The time format '<time>' of field
'<field>' is not supported...`.
> * Severity: Low
>
> ## 1.2 Compatibility Impact
> Assessment: partially compatible.
>
> * API: no existing public API is removed, but new helper/config classes
are added.
> * Config: no new key is introduced, but the behavior when no explicit
date/time format is configured becomes more permissive.
> * Defaults: textual default values remain the same; the effective default
parsing behavior is broader.
> * Protocol / serialization: no protocol or SeaTunnelRow serialization
format change.
> * Historical behavior: mostly compatible as an enhancement, but Issue 1
can break complex CSV/Text MAP fields with explicit formats.
>
> ## 1.3 Performance / Side Effects
> * CPU: normal cached parsing is reasonable; repeated exception-driven
re-matching in MAP key/value cases should be fixed.
> * Memory / GC: field formatter caches are small, but parse exceptions can
add GC pressure in the cache-collision case.
> * Network: no impact.
> * Lock contention / concurrency: `ConcurrentHashMap` is used on
CSV/Text/Debezium paths; cache key granularity is the main concern.
> * Retry / idempotency / resource release: no external side effects; parse
failures still fail the source task and rely on engine retry semantics.
>
> ## 1.4 Error Handling and Logging
> The centralized error path is a good improvement, but Issue 2 can still
surface a low-level parse failure after a formatter has already been selected,
and Issue 5 should be corrected for user-facing clarity. No sensitive data
logging concern was found.
>
> # 2. Code Quality
> ## 2.1 Style
> The main production code is generally understandable, but the pattern
table in `DateTimeUtils` is large enough that exact edge-case tests are
important. I would also avoid adding whole-file whitelist entries to the
Chinese-character check when a smaller comment cleanup would work better.
>
> ## 2.2 Test Coverage
> The PR adds useful utility and reader tests, but coverage should be
tightened:
>
> * Assert exact parsed values in CSV/Text/JSON tests, not just non-null
fields.
> * Add MAP key/value tests for DATE/TIME/TIMESTAMP with explicit format
options.
> * Add a test for the reverse datetime `T` separator, or explicitly assert
it is unsupported.
> * Move the performance loops out of normal JUnit.
>
> ## 2.3 Documentation
> Docs need to be updated in both English and Chinese because this is a
user-visible source parsing feature.
>
> # 3. Architecture
> ## 3.1 Elegance
> This is a good direction and more than a temporary workaround, but it
needs the cache and pattern consistency fixes before it is merge-ready.
>
> ## 3.2 Maintainability
> Centralizing parsing helps. The main maintainability risk is keeping each
regex and its paired formatter semantically identical.
>
> ## 3.3 Extensibility
> The helper/pattern approach can be extended for future formats, especially
if formatter cache keys include logical path plus type.
>
> ## 3.4 Historical Compatibility
> Existing simple DATE/TIME/TIMESTAMP fields remain compatible, and many
previously failing inputs now work. The MAP + explicit-format case should be
fixed before merge.
>
> # 4. Issue Summary
> No. Issue Location Severity
> 1 CSV/Text MAP key and value share the same formatter cache key
`seatunnel-formats/seatunnel-format-csv/src/main/java/org/apache/seatunnel/format/csv/CsvDeserializationSchema.java:285`
Medium
> 2 Reverse datetime regex accepts `T`, but the formatter only accepts a
space
`seatunnel-common/src/main/java/org/apache/seatunnel/common/utils/DateTimeUtils.java:70`
Medium
> 3 Regular unit tests include 10-million-iteration performance loops
`seatunnel-common/src/test/java/org/apache/seatunnel/common/utils/DateTimeUtilsTest.java:635`
Medium
> 4 Auto-format behavior is not documented
`docs/en/connectors/source/LocalFile.md:240` Medium
> 5 Time-format error message says “date format”
`seatunnel-common/src/main/java/org/apache/seatunnel/common/exception/CommonErrorCode.java:84`
Low
> # 5. Merge Conclusion
> ### Conclusion: can merge after fixes
> 1. Blocking items
>
> * Issue 1: supported CSV/Text MAP fields can fail when users explicitly
configure date/time formats.
> * Issue 2: the reversed datetime matcher and formatter disagree on `T`
separators.
> * Issue 3: performance loops should not run as normal unit tests.
> * Issue 4: the user-visible parsing enhancement needs English and Chinese
docs.
>
> 2. Recommended non-blocking item
>
> * Issue 5: please fix the typo in the new time-format error message.
>
> Overall, I like the direction of this PR. The centralized parsing helper
is useful and the normal source path really does hit the new code. After the
cache-key fix, the matcher/formatter alignment, the test cleanup, and the docs
update, this should be in good shape.
>
> Local review notes: I fetched and checked out `seatunnel-review-10486` at
`56d85bb24c9712d189d03a2d1ba4ed2edf50469d`; merge-base is
`37999ea5bbac070163b4c9f9e322781fa940a5ee`, and current `upstream/dev` is
`b9cdfeb79e2df010718acde1d6e45b956a76f4fa`. `git diff --check
upstream/dev...HEAD` was clean. A temporary local merge into current
`upstream/dev` reported `Automatic merge went well`. GitHub checks show Build
passed in 21m52s, with label/notify checks also passing. I did not run Maven
tests locally.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]