[ 
https://issues.apache.org/jira/browse/ARROW-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624495#comment-17624495
 ] 

Tim Loderhose commented on ARROW-18166:
---------------------------------------

My main concern is the API - specifying the date types directly is just neater 
than having to specify an alternate schema that is used for loading (with 
timestamps), and then casting to a schema that is used for writing (with dates).

I use date32 mainly because 32 bits are enough to store these dates, and 
there's a lot of data (so this could lead to using less storage).

If there's a solution that both simplifies usage and provides good performance, 
that would of course be best - I'm not educated enough about arrow's internals 
though to really contribute to this discussion beyond stating what would be 
useful.

Usage is actually in pandas, so this data ironically gets converted back to 
timestamps when using the data.

(I've spent way too much time coming up with a good solution here - the problem 
is that the csvs we receive are not validated, so there could be dates like 
0001/01/01 which mess up timestamp conversions. That's why we settled on 
storing what we know should be the correct type for the given data)

I did not notice too big of an impact in performance when using 
`date_as_object=False` in the `to_pandas()` API, so this is fine.

 

> Allow ConvertOptions.timestamp_parsers for date types
> -----------------------------------------------------
>
>                 Key: ARROW-18166
>                 URL: https://issues.apache.org/jira/browse/ARROW-18166
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: Tim Loderhose
>            Priority: Minor
>
> Currently, the timestamp_parsers option of the CSV reader only works for 
> timestamp datatypes.
> If one wants to immediately read dates as date32 objects (in my use case csv 
> data is read and stored as parquet files with correct types), one has to cast 
> the table to a schema with date32 types after the fact.
> This snipped shows that loading the data fails when specifying the date type:
> {code:java}
> import pyarrow as pa
> from pyarrow import csv
> def open_bytes(b, **kwargs):
>     return csv.open_csv(pa.py_buffer(b), **kwargs)
> def read_bytes(b, **kwargs):
>     return open_bytes(b, **kwargs).read_all()
> rows = b"a,b\n1970/01/01,1980-01-01 00\n1970/01/02,1980-01-02 00\n"
> schema = pa.schema([("a", pa.timestamp("ms")), ("b", pa.string())])
> opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
> table = read_bytes(rows, convert_options=opts)
> assert table.schema == schema # works
> schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
> opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
> table = read_bytes(rows, convert_options=opts) # error here
> assert table.schema == schema
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> Input In [134], in <cell line: 22>()
>      20 schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
>      21 opts = csv.ConvertOptions(column_types=schema, 
> timestamp_parsers=["%Y/%m/%d"])
> ---> 22 table = read_bytes(rows, convert_options=opts)
>      23 assert table.schema == schemaInput In [134], in read_bytes(b, 
> **kwargs)
>       9 def read_bytes(b, **kwargs):
> ---> 10     return open_bytes(b, **kwargs).read_all()Input In [134], in 
> open_bytes(b, **kwargs)
>       5 def open_bytes(b, **kwargs):
> ----> 6     return csv.open_csv(pa.py_buffer(b), **kwargs)File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1273, in 
> pyarrow._csv.open_csv()File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1137, in 
> pyarrow._csv.CSVStreamingReader._open()File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:144, in 
> pyarrow.lib.pyarrow_internal_check_status()File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:100, in 
> pyarrow.lib.check_status()ArrowInvalid: In CSV column #0: CSV conversion 
> error to date32[day]: invalid value '1970/01/01'
> {code}
> It would be useful to allow the timestamp_parsers for date types as well (or 
> add an analogous argument for dates), such that such errors don't occur and 
> the resulting table has the required datatypes without a casting step.
>  
> A little bit more context is in the comments of 
> https://issues.apache.org/jira/browse/ARROW-10848 (26/Oct/22).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to