fitzee commented on PR #40127:
URL: https://github.com/apache/superset/pull/40127#issuecomment-4456621215
## Griffen Re-Review — Round 2 🦅
*Reviewed by Griffen (AI Distinguished Engineer), with Matt Fitzgerald.
Reach out to Matt directly if anything needs clarifying.*
---
Ok so — the first round of changes were genuinely good.
`_convert_temporal_columns()` with try-strict-first, catch only
`OutOfBoundsDatetime`, then fall back — that's the right pattern. The four new
tests cover the important cases. The inline comment on `n_coerced` is clear.
`result_set.py` is solid.
**ALMOST... but not quite.** 🎯
---
## The Remaining Gap
There's one edge case the current implementation doesn't handle: a column
containing **both** out-of-bounds timestamps **and** malformed values.
```python
# What happens with this input?
df = pd.DataFrame({"ts": ["3118-01-01", "not-a-date"]})
```
Walk through the code:
1. `pd.to_datetime(df["ts"])` — pandas hits `"3118-01-01"` first → raises
`OutOfBoundsDatetime` ✓ caught
2. Fallback: `pd.to_datetime(df["ts"], errors="coerce")` — now **both**
values become `NaT`
3. `"not-a-date"` silently disappears into `NaT`. No error. Import succeeds.
🙈
The behavior is **order-dependent**: if `"not-a-date"` had appeared first,
pandas would raise a `ValueError` (not `OutOfBoundsDatetime`), the except
clause wouldn't catch it, and the import would correctly fail.
So whether a malformed value corrupts your import silently or fails loudly
depends on... where it happens to sit in the CSV. That's not a contract you
want to ship.
---
## The Fix
Row-level fallback in the except block — parse each value individually so
you can distinguish OOB from malformed:
```python
except OutOfBoundsDatetime:
result = []
n_coerced = 0
for val in df[column_name]:
if pd.isna(val):
result.append(pd.NaT)
else:
try:
result.append(pd.to_datetime(val))
except OutOfBoundsDatetime:
result.append(pd.NaT)
n_coerced += 1
# any other parse error (ValueError, ParserError, etc.) re-raises
# preserving the original import-fail behavior for bad data
df[column_name] = result
if n_coerced > 0:
logger.warning(
"Coerced %d out-of-bounds datetime value(s) in column '%s' to
NaT",
n_coerced,
column_name,
)
```
Yes, this is O(n) per-value instead of vectorized — but this is a CSV import
path, not a hot query path. Correctness wins.
And the test to add:
```python
def test_mixed_oob_and_malformed_raises(self) -> None:
"""OOB + malformed in same column: malformed should still fail the
import."""
df = pd.DataFrame({"ts": ["3118-01-01", "not-a-date"]})
with pytest.raises(Exception):
_convert_temporal_columns(df, {"ts": DateTime()})
```
---
## What's Already Good (and stays good)
- `result_set.py` — no changes needed, ship it
- The 4 existing tests — all valid, keep them
- The overall structure of `_convert_temporal_columns()` — right idea, just
needs the row-level fallback
---
So close! One more pass on the fallback and this is done. 🦅
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]