pearu opened a new pull request, #50146:
URL: https://github.com/apache/arrow/pull/50146

   ### Rationale for this change
   
   CSV columns explicitly typed as `date32`, `date64`, `time32` or `time64` can 
only be parsed from strict ISO-8601 strings: 
`ConvertOptions::timestamp_parsers` is consulted only for `timestamp` columns. 
Reading e.g. `15-OCT-15` into a `date32` column fails even with 
`timestamp_parsers=["%d-%b-%y"]`, and `7:55:00` (non-zero-padded hour) fails 
for `time32[s]`. Users currently work around this by declaring such columns as 
`timestamp`, reading, then casting back to the date/time type.
   
   Closes #28303.
   
   ### What changes are included in this PR?
   
   - A new `DateTimeWithParsersValueDecoder` in `csv/converter.cc`, used for 
date32/date64/time32/time64 columns when `timestamp_parsers` is non-empty. It 
tries the built-in ISO-8601 parser first (preserving all existing behavior), 
then each configured parser in order. A timestamp produced by a fallback parser 
is floored to the day boundary for dates and reduced to the time of day for 
times, consistent with casting a timestamp to a date or time type. Values 
carrying a zone offset are rejected, as for zone-less timestamp columns. When 
no parsers are configured, the pre-existing decoder is used unchanged.
   - Type inference is deliberately unaffected: the Date/Time inference stages 
now explicitly use options with `timestamp_parsers` cleared, so inference keeps 
strict ISO-8601 semantics (otherwise a value with a time-of-day part could be 
inferred as a date and silently truncated). The existing 
`test_timestamp_parsers` Python test pins this behavior.
   - Documentation of the fallback and flooring semantics in 
`ConvertOptions::timestamp_parsers` (C++ and Python docstrings) and a new "Date 
and time parsing" section in the C++ CSV user guide.
   
   ### Are these changes tested?
   
   Yes:
   - New C++ tests (`Date32Conversion.UserDefinedParsers`, 
`Date64Conversion.UserDefinedParsers`, `Time32Conversion.UserDefinedParsers`, 
`Time64Conversion.UserDefinedParsers`) covering custom formats, mixed ISO + 
custom values in one column (backward compatibility of ISO values when parsers 
are set), pre-epoch flooring with a time-of-day component (distinguishes floor 
from truncating division), time-of-day extraction from pre-epoch timestamps, 
zone-offset rejection, and error cases.
   - New Python tests with the reproducers from #28303 and #41488, plus an 
inference-unchanged guard.
   
   ### Are there any user-facing changes?
   
   Yes: `ConvertOptions::timestamp_parsers` now also applies, as a fallback 
after ISO-8601, to columns explicitly typed as date32/date64/time32/time64 
(previously such values always errored). No breaking changes: behavior without 
`timestamp_parsers` is untouched, ISO values keep parsing when parsers are set, 
and type inference is unchanged. All language bindings gain the behavior 
without API changes.
   
   ### AI usage disclosure
   
   This PR was developed with AI assistance (Claude Code): the decoder, tests 
and documentation were AI-generated under my direction, then reviewed 
line-by-line and iterated on by me (design decisions: fallback-after-ISO 
semantics, silent flooring, inference isolation, and several implementation 
details adjusted during review). I own and can debug these changes.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to