fitzee commented on PR #40127: URL: https://github.com/apache/superset/pull/40127#issuecomment-4483522272
## Proposed fix for the MEDIUM blocker (mixed OOB + malformed column) Hi @eschutho — to save another review round, here's the exact fix for the invariant issue flagged in prior reviews. The commit is at [`fitzee/superset:griffen/40127-fix`](https://github.com/fitzee/superset/tree/griffen/40127-fix) if you want to cherry-pick. --- ### The issue `_convert_temporal_columns()` catches `OutOfBoundsDatetime` on the whole column, then falls back to `errors="coerce"` on the whole column. This means a column like `["3118-01-01", "not-a-date"]` silently coerces "not-a-date" → NaT instead of raising, because the OOB exception is caught first and the fallback coerces everything. ### The fix — `superset/commands/dataset/importers/v1/utils.py` Replace the `except OutOfBoundsDatetime:` block: ```python except OutOfBoundsDatetime: # Row-level fallback: coerce only OOB values; re-raise for malformed # strings. Whole-column errors="coerce" would silently swallow # malformed values that happen to share a column with an OOB date. original = df[column_name].copy() result = [] for val in original: if pd.isna(val): result.append(pd.NaT) continue try: result.append(pd.to_datetime(val)) except OutOfBoundsDatetime: result.append(pd.NaT) # Other exceptions (e.g. malformed strings) propagate converted = pd.Series(result, index=original.index) n_coerced = int(converted.isna().sum() - original.isna().sum()) if n_coerced > 0: logger.warning( "Coerced %d out-of-bounds datetime value(s) " "in column '%s' to NaT", n_coerced, column_name, ) df[column_name] = converted ``` ### Test changes — `tests/unit_tests/commands/importers/v1/utils_test.py` 1. **Add parametrized mixed-column test** (the missing proof): ```python @pytest.mark.parametrize( "values", [ ["3118-01-01", "not-a-date"], ["not-a-date", "3118-01-01"], ], ) def test_mixed_out_of_bounds_and_malformed_still_raises( self, values: list[str] ) -> None: """ A column mixing out-of-bounds and malformed dates must raise, not silently coerce the malformed value to NaT. Both orderings are tested to ensure the invariant holds regardless of which error pandas encounters first. """ from sqlalchemy import DateTime from superset.commands.dataset.importers.v1.utils import ( _convert_temporal_columns, ) df = pd.DataFrame({"ts": values}) with pytest.raises((ValueError, pd.errors.ParserError)): _convert_temporal_columns(df, {"ts": DateTime()}) ``` 2. **Fix exception type** in `test_malformed_dates_still_raise`: ```python # Before with pytest.raises(pd.errors.ParserError): # After with pytest.raises((ValueError, pd.errors.ParserError)): ``` 3. **Remove unused helpers** `_make_dataset_col` and `_apply_temporal_conversion` (and the `MagicMock` import). --- With the row-level fallback in place, both mixed-column parametrize cases raise correctly. The existing tests (`test_out_of_bounds_coerced_to_nat`, `test_warning_count_excludes_preexisting_nulls`) continue to pass unchanged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
