hawk9821 commented on PR #10486:
URL: https://github.com/apache/seatunnel/pull/10486#issuecomment-4656038767

   > # What This PR Solves
   > * User pain: file/JSON/Debezium readers currently require fairly strict 
date/time strings, so common inputs such as single-digit month/day/time or 
reversed dates can fail during source deserialization unless users configure 
exact formats.
   > * Fix approach: this PR centralizes date/time auto-detection in 
`DateUtils`, `TimeUtils`, `DateTimeUtils`, and `DateTimeParseHelper`, then 
wires that helper into CSV/Text/JSON/Debezium deserialization.
   > * In one sentence: this is a useful feature that makes time parsing in 
common source paths more forgiving.
   > 
   > # 1. Code Change Review
   > ## 1.1 Core Logic Analysis
   > Precise change:
   > 
   > * Core changes are in 
`seatunnel-common/src/main/java/org/apache/seatunnel/common/utils/DateTimeUtils.java`,
 `DateUtils.java`, `TimeUtils.java`, and `DateTimeParseHelper.java`.
   > * The runtime entry points are `CsvDeserializationSchema.convert()`, 
`TextDeserializationSchema.convert()`, 
`JsonToRowConverters.createNotNullConverter()`, and 
`DebeziumRowConverter.getValue()`.
   > * The critical path is DATE/TIME/TIMESTAMP source-field conversion.
   > 
   > Before / after:
   > 
   > ```java
   > // Before: CSV used separate local parsing methods
   > case DATE:
   >     return parseDate(field, fieldName);
   > case TIME:
   >     return parseTime(field);
   > case TIMESTAMP:
   >     return parseTimestamp(field, fieldName);
   > ```
   > 
   > ```java
   > // After: DATE/TIME/TIMESTAMP share DateTimeParseHelper and a field 
formatter cache
   > case DATE:
   >     return parseDate(fieldVal, fieldName, dateFormatterConfig, 
fieldFormatterCache);
   > case TIME:
   >     return parseTime(fieldVal, fieldName, timeFormatterConfig, 
fieldFormatterCache);
   > case TIMESTAMP:
   >     return parseTimestamp(fieldVal, fieldName, dateTimeFormatterConfig, 
fieldFormatterCache);
   > ```
   > 
   > Key findings:
   > 
   > * The normal file-source path does hit this PR: `FileSource -> 
CsvReadStrategy/TextReadStrategy -> DeserializationSchema.convert() -> 
DateTimeParseHelper`.
   > * JSON and Debezium deserialization also reuse the helper, so the impact 
is wider than CSV/Text only.
   > * The approach is a good long-term direction, but a few correctness and 
maintainability gaps need to be fixed before merge.
   > * Current local merge against the latest `upstream/dev` succeeds, `git 
diff --check upstream/dev...HEAD` is clean, and the visible GitHub Build check 
is green.
   > 
   > Full runtime path:
   > 
   > ```
   > File Source startup
   >   -> FileSourceReader reads a split
   >       -> CsvReadStrategy.read() / TextReadStrategy.read()
   >           -> CsvDeserializationSchema.deserialize() / 
TextDeserializationSchema.deserialize()
   >               -> convert(fieldVal, fieldType, level, fieldName)
   >                   -> DATE / TIME / TIMESTAMP
   >                       -> DateTimeParseHelper.parseDateTimeValue()
   >                           -> 
fieldFormatterCache.computeIfAbsent(fieldName, ...)
   >                           -> user-configured formatter or auto-matched 
formatter
   >                           -> DateUtils / TimeUtils / DateTimeUtils.parse()
   > 
   > JSON / Debezium path
   >   -> JsonToRowConverters.createConverter()
   >       -> DATE / TIME / TIMESTAMP use DateTimeParseHelper
   >   -> DebeziumRowConverter.getValue()
   >       -> DATE / TIME / TIMESTAMP use DateTimeParseHelper
   > ```
   > 
   > ### Issue 1: CSV/Text MAP key and value share the same formatter cache key
   > * Location: 
`seatunnel-formats/seatunnel-format-csv/src/main/java/org/apache/seatunnel/format/csv/CsvDeserializationSchema.java:285`
   > * Description: in the CSV MAP conversion path, both the map key and map 
value are converted with the same `fieldName`. After this PR, 
DATE/TIME/TIMESTAMP all share `fieldFormatterCache`, keyed only by `fieldName`. 
For a supported schema such as `MAP<DATE,TIME>` with explicit 
`date_format="yyyy-MM-dd"` and `time_format="HH:mm:ss"`, the key can cache a 
date formatter under the map field name; the value then reuses that date 
formatter instead of the time formatter, and because the time formatter is 
user-configured, the helper throws immediately rather than re-matching. 
`TextDeserializationSchema` has the same pattern.
   > * Risk: valid file-source reads can fail for complex MAP fields. Without 
explicit formats it can still repeatedly throw, re-match, and replace the 
formatter between key/value parsing, adding avoidable CPU/GC cost.
   > * Recommended fix: use distinct cache keys for map key/value, such as 
`fieldName + ".key"` and `fieldName + ".value"`, or make the cache key include 
both logical field path and `SqlType`.
   > * Severity: Medium
   > 
   > ### Issue 2: Reverse datetime regex accepts `T`, but the formatter only 
accepts a space
   > * Location: 
`seatunnel-common/src/main/java/org/apache/seatunnel/common/utils/DateTimeUtils.java:70`
   > * Description: `reversePattern` allows `[T\\s]`, so inputs like 
`1/2/2026T12:01:30` are considered matched by `matchDateTimeFormatter()`. The 
paired `reverseFormatter`, however, only calls `appendLiteral(' ')` at line 85 
and cannot parse the `T` separator.
   > * Risk: this creates a confusing “matched but cannot parse” branch in the 
normal CSV/Text/JSON path. It also weakens the author’s reply that the reversed 
date-time variants are supported.
   > * Recommended fix: either restrict the regex to a space separator or 
update the formatter to accept both space and `T`, then add a focused unit test 
for `1/2/2026T12:01:30`.
   > * Severity: Medium
   > 
   > ### Issue 3: Regular unit tests now include several 10-million-iteration 
performance loops
   > * Location: 
`seatunnel-common/src/test/java/org/apache/seatunnel/common/utils/DateTimeUtilsTest.java:635`
   > * Description: the new performance tests run large loops in normal JUnit 
and only assert that elapsed time is greater than zero. They do not protect 
behavior, but they do add CI cost and noisy timing output.
   > * Risk: SeaTunnel’s normal unit-test suite becomes slower and more 
sensitive to machine load without gaining reliable regression coverage.
   > * Recommended fix: remove these tests from normal UT, or move them to an 
explicitly disabled benchmark/JMH-style test. Keep behavior tests with exact 
parsed-value assertions in regular UT.
   > * Severity: Medium
   > 
   > ### Issue 4: The user-visible auto-format behavior is not documented
   > * Location: `docs/en/connectors/source/LocalFile.md:240`
   > * Description: this PR changes user-visible behavior for file source 
`date_format`, `datetime_format`, and `time_format` when formats are not 
explicitly configured, but no docs were updated. The current English and 
Chinese file-source docs still list only the older supported formats.
   > * Risk: users and maintainers cannot tell which formats are now supported, 
and the docs no longer match the actual behavior.
   > * Recommended fix: update both `docs/en` and `docs/zh` for the file source 
connectors that expose these options, and clearly explain the difference 
between explicit formats and auto-detection.
   > * Severity: Medium
   > 
   > ### Issue 5: The new time-format error message still says “date format”
   > * Location: 
`seatunnel-common/src/main/java/org/apache/seatunnel/common/exception/CommonErrorCode.java:84`
   > * Description: `FORMAT_TIME_ERROR` currently says `The date format 
'<time>'... Please check the time format.`
   > * Risk: failure messages during source parsing are slightly misleading.
   > * Recommended fix: change it to `The time format '<time>' of field 
'<field>' is not supported...`.
   > * Severity: Low
   > 
   > ## 1.2 Compatibility Impact
   > Assessment: partially compatible.
   > 
   > * API: no existing public API is removed, but new helper/config classes 
are added.
   > * Config: no new key is introduced, but the behavior when no explicit 
date/time format is configured becomes more permissive.
   > * Defaults: textual default values remain the same; the effective default 
parsing behavior is broader.
   > * Protocol / serialization: no protocol or SeaTunnelRow serialization 
format change.
   > * Historical behavior: mostly compatible as an enhancement, but Issue 1 
can break complex CSV/Text MAP fields with explicit formats.
   > 
   > ## 1.3 Performance / Side Effects
   > * CPU: normal cached parsing is reasonable; repeated exception-driven 
re-matching in MAP key/value cases should be fixed.
   > * Memory / GC: field formatter caches are small, but parse exceptions can 
add GC pressure in the cache-collision case.
   > * Network: no impact.
   > * Lock contention / concurrency: `ConcurrentHashMap` is used on 
CSV/Text/Debezium paths; cache key granularity is the main concern.
   > * Retry / idempotency / resource release: no external side effects; parse 
failures still fail the source task and rely on engine retry semantics.
   > 
   > ## 1.4 Error Handling and Logging
   > The centralized error path is a good improvement, but Issue 2 can still 
surface a low-level parse failure after a formatter has already been selected, 
and Issue 5 should be corrected for user-facing clarity. No sensitive data 
logging concern was found.
   > 
   > # 2. Code Quality
   > ## 2.1 Style
   > The main production code is generally understandable, but the pattern 
table in `DateTimeUtils` is large enough that exact edge-case tests are 
important. I would also avoid adding whole-file whitelist entries to the 
Chinese-character check when a smaller comment cleanup would work better.
   > 
   > ## 2.2 Test Coverage
   > The PR adds useful utility and reader tests, but coverage should be 
tightened:
   > 
   > * Assert exact parsed values in CSV/Text/JSON tests, not just non-null 
fields.
   > * Add MAP key/value tests for DATE/TIME/TIMESTAMP with explicit format 
options.
   > * Add a test for the reverse datetime `T` separator, or explicitly assert 
it is unsupported.
   > * Move the performance loops out of normal JUnit.
   > 
   > ## 2.3 Documentation
   > Docs need to be updated in both English and Chinese because this is a 
user-visible source parsing feature.
   > 
   > # 3. Architecture
   > ## 3.1 Elegance
   > This is a good direction and more than a temporary workaround, but it 
needs the cache and pattern consistency fixes before it is merge-ready.
   > 
   > ## 3.2 Maintainability
   > Centralizing parsing helps. The main maintainability risk is keeping each 
regex and its paired formatter semantically identical.
   > 
   > ## 3.3 Extensibility
   > The helper/pattern approach can be extended for future formats, especially 
if formatter cache keys include logical path plus type.
   > 
   > ## 3.4 Historical Compatibility
   > Existing simple DATE/TIME/TIMESTAMP fields remain compatible, and many 
previously failing inputs now work. The MAP + explicit-format case should be 
fixed before merge.
   > 
   > # 4. Issue Summary
   > No.        Issue   Location        Severity
   > 1  CSV/Text MAP key and value share the same formatter cache key   
`seatunnel-formats/seatunnel-format-csv/src/main/java/org/apache/seatunnel/format/csv/CsvDeserializationSchema.java:285`
        Medium
   > 2  Reverse datetime regex accepts `T`, but the formatter only accepts a 
space      
`seatunnel-common/src/main/java/org/apache/seatunnel/common/utils/DateTimeUtils.java:70`
        Medium
   > 3  Regular unit tests include 10-million-iteration performance loops       
`seatunnel-common/src/test/java/org/apache/seatunnel/common/utils/DateTimeUtilsTest.java:635`
   Medium
   > 4  Auto-format behavior is not documented  
`docs/en/connectors/source/LocalFile.md:240`    Medium
   > 5  Time-format error message says “date format”    
`seatunnel-common/src/main/java/org/apache/seatunnel/common/exception/CommonErrorCode.java:84`
  Low
   > # 5. Merge Conclusion
   > ### Conclusion: can merge after fixes
   > 1. Blocking items
   > 
   > * Issue 1: supported CSV/Text MAP fields can fail when users explicitly 
configure date/time formats.
   > * Issue 2: the reversed datetime matcher and formatter disagree on `T` 
separators.
   > * Issue 3: performance loops should not run as normal unit tests.
   > * Issue 4: the user-visible parsing enhancement needs English and Chinese 
docs.
   > 
   > 2. Recommended non-blocking item
   > 
   > * Issue 5: please fix the typo in the new time-format error message.
   > 
   > Overall, I like the direction of this PR. The centralized parsing helper 
is useful and the normal source path really does hit the new code. After the 
cache-key fix, the matcher/formatter alignment, the test cleanup, and the docs 
update, this should be in good shape.
   > 
   > Local review notes: I fetched and checked out `seatunnel-review-10486` at 
`56d85bb24c9712d189d03a2d1ba4ed2edf50469d`; merge-base is 
`37999ea5bbac070163b4c9f9e322781fa940a5ee`, and current `upstream/dev` is 
`b9cdfeb79e2df010718acde1d6e45b956a76f4fa`. `git diff --check 
upstream/dev...HEAD` was clean. A temporary local merge into current 
`upstream/dev` reported `Automatic merge went well`. GitHub checks show Build 
passed in 21m52s, with label/notify checks also passing. I did not run Maven 
tests locally.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to