[
https://issues.apache.org/jira/browse/ARROW-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624484#comment-17624484
]
Rok Mihevc commented on ARROW-18166:
------------------------------------
Thanks for the extensive Jira [~timlod] ! I see this on Arrow 9.0.0 as well.
It seems we currently don't have a DateParser the way we have the
[TimestampParser|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/value_parsing.cc#L49]
and the return type will always be `timestamp("xs")`. One solution here would
be to add DateParser that would accept arbitrary format and return `date32` or
`date64` (as ARROW-10847 suggests).
Note: `date64` is defined as milliseconds since UNIX epoch so casting it to
`timestamp("ms") is just a metadata change. On the other hand casting `date32`
to `timestamp("s") requires multiplication with 86400 and the metadata change
(which is computationally equivalent to what would proposed DateParser do).
[~timlod] Is your main concern performance or the API?
[~apitrou] What do you think about adding DateParser to handle the gap here?
> Allow ConvertOptions.timestamp_parsers for date types
> -----------------------------------------------------
>
> Key: ARROW-18166
> URL: https://issues.apache.org/jira/browse/ARROW-18166
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 9.0.0
> Reporter: Tim Loderhose
> Priority: Minor
>
> Currently, the timestamp_parsers option of the CSV reader only works for
> timestamp datatypes.
> If one wants to immediately read dates as date32 objects (in my use case csv
> data is read and stored as parquet files with correct types), one has to cast
> the table to a schema with date32 types after the fact.
> This snipped shows that loading the data fails when specifying the date type:
> {code:java}
> import pyarrow as pa
> from pyarrow import csv
> def open_bytes(b, **kwargs):
> return csv.open_csv(pa.py_buffer(b), **kwargs)
> def read_bytes(b, **kwargs):
> return open_bytes(b, **kwargs).read_all()
> rows = b"a,b\n1970/01/01,1980-01-01 00\n1970/01/02,1980-01-02 00\n"
> schema = pa.schema([("a", pa.timestamp("ms")), ("b", pa.string())])
> opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
> table = read_bytes(rows, convert_options=opts)
> assert table.schema == schema # works
> schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
> opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
> table = read_bytes(rows, convert_options=opts) # error here
> assert table.schema == schema
> ---------------------------------------------------------------------------
> ArrowInvalid Traceback (most recent call last)
> Input In [134], in <cell line: 22>()
> 20 schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
> 21 opts = csv.ConvertOptions(column_types=schema,
> timestamp_parsers=["%Y/%m/%d"])
> ---> 22 table = read_bytes(rows, convert_options=opts)
> 23 assert table.schema == schemaInput In [134], in read_bytes(b,
> **kwargs)
> 9 def read_bytes(b, **kwargs):
> ---> 10 return open_bytes(b, **kwargs).read_all()Input In [134], in
> open_bytes(b, **kwargs)
> 5 def open_bytes(b, **kwargs):
> ----> 6 return csv.open_csv(pa.py_buffer(b), **kwargs)File
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1273, in
> pyarrow._csv.open_csv()File
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1137, in
> pyarrow._csv.CSVStreamingReader._open()File
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:144, in
> pyarrow.lib.pyarrow_internal_check_status()File
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:100, in
> pyarrow.lib.check_status()ArrowInvalid: In CSV column #0: CSV conversion
> error to date32[day]: invalid value '1970/01/01'
> {code}
> It would be useful to allow the timestamp_parsers for date types as well (or
> add an analogous argument for dates), such that such errors don't occur and
> the resulting table has the required datatypes without a casting step.
>
> A little bit more context is in the comments of
> https://issues.apache.org/jira/browse/ARROW-10848 (26/Oct/22).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)