[ 
https://issues.apache.org/jira/browse/ARROW-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624484#comment-17624484
 ] 

Rok Mihevc commented on ARROW-18166:
------------------------------------

Thanks for the extensive Jira [~timlod] ! I see this on Arrow 9.0.0 as well.

It seems we currently don't have a DateParser the way we have the 
[TimestampParser|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/value_parsing.cc#L49]
 and the return type will always be `timestamp("xs")`. One solution here would 
be to add DateParser that would accept arbitrary format and return `date32` or 
`date64` (as ARROW-10847 suggests).

Note: `date64` is defined as milliseconds since UNIX epoch so casting it to 
`timestamp("ms") is just a metadata change. On the other hand casting `date32` 
to `timestamp("s") requires multiplication with 86400 and the metadata change 
(which is computationally equivalent to what would proposed DateParser do).

[~timlod] Is your main concern performance or the API?
[~apitrou] What do you think about adding DateParser to handle the gap here?

> Allow ConvertOptions.timestamp_parsers for date types
> -----------------------------------------------------
>
>                 Key: ARROW-18166
>                 URL: https://issues.apache.org/jira/browse/ARROW-18166
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: Tim Loderhose
>            Priority: Minor
>
> Currently, the timestamp_parsers option of the CSV reader only works for 
> timestamp datatypes.
> If one wants to immediately read dates as date32 objects (in my use case csv 
> data is read and stored as parquet files with correct types), one has to cast 
> the table to a schema with date32 types after the fact.
> This snipped shows that loading the data fails when specifying the date type:
> {code:java}
> import pyarrow as pa
> from pyarrow import csv
> def open_bytes(b, **kwargs):
>     return csv.open_csv(pa.py_buffer(b), **kwargs)
> def read_bytes(b, **kwargs):
>     return open_bytes(b, **kwargs).read_all()
> rows = b"a,b\n1970/01/01,1980-01-01 00\n1970/01/02,1980-01-02 00\n"
> schema = pa.schema([("a", pa.timestamp("ms")), ("b", pa.string())])
> opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
> table = read_bytes(rows, convert_options=opts)
> assert table.schema == schema # works
> schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
> opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
> table = read_bytes(rows, convert_options=opts) # error here
> assert table.schema == schema
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> Input In [134], in <cell line: 22>()
>      20 schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
>      21 opts = csv.ConvertOptions(column_types=schema, 
> timestamp_parsers=["%Y/%m/%d"])
> ---> 22 table = read_bytes(rows, convert_options=opts)
>      23 assert table.schema == schemaInput In [134], in read_bytes(b, 
> **kwargs)
>       9 def read_bytes(b, **kwargs):
> ---> 10     return open_bytes(b, **kwargs).read_all()Input In [134], in 
> open_bytes(b, **kwargs)
>       5 def open_bytes(b, **kwargs):
> ----> 6     return csv.open_csv(pa.py_buffer(b), **kwargs)File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1273, in 
> pyarrow._csv.open_csv()File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1137, in 
> pyarrow._csv.CSVStreamingReader._open()File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:144, in 
> pyarrow.lib.pyarrow_internal_check_status()File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:100, in 
> pyarrow.lib.check_status()ArrowInvalid: In CSV column #0: CSV conversion 
> error to date32[day]: invalid value '1970/01/01'
> {code}
> It would be useful to allow the timestamp_parsers for date types as well (or 
> add an analogous argument for dates), such that such errors don't occur and 
> the resulting table has the required datatypes without a casting step.
>  
> A little bit more context is in the comments of 
> https://issues.apache.org/jira/browse/ARROW-10848 (26/Oct/22).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to