[jira] [Updated] (ARROW-18166) Allow ConvertOptions.timestamp_parsers for date types

Tim Loderhose (Jira) Wed, 26 Oct 2022 05:18:05 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Loderhose updated ARROW-18166:
----------------------------------
    Priority: Minor  (was: Major)

> Allow ConvertOptions.timestamp_parsers for date types
> -----------------------------------------------------
>
>                 Key: ARROW-18166
>                 URL: https://issues.apache.org/jira/browse/ARROW-18166
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: Tim Loderhose
>            Priority: Minor
>
> Currently, the timestamp_parsers option of the CSV reader only works for 
> timestamp datatypes.
> If one wants to immediately read dates as date32 objects (in my use case csv 
> data is read and stored as parquet files with correct types), one has to cast 
> the table to a schema with date32 types after the fact - this snipped shows 
> that loading the data fails when specifying the date type:
> {code:java}
> import pyarrow as pa
> from pyarrow import csv
> def open_bytes(b, **kwargs):
>     return csv.open_csv(pa.py_buffer(b), **kwargs)
> def read_bytes(b, **kwargs):
>     return open_bytes(b, **kwargs).read_all()
> rows = b"a,b\n1970/01/01,1980-01-01 00\n1970/01/02,1980-01-02 00\n"schema = 
> pa.schema([("a", pa.timestamp("ms")), ("b", pa.string())])
> opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
> table = read_bytes(rows, convert_options=opts)
> assert table.schema == schema # works
> schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
> opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
> table = read_bytes(rows, convert_options=opts) # error here
> assert table.schema == schema
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> Input In [134], in <cell line: 22>()
>      20 schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
>      21 opts = csv.ConvertOptions(column_types=schema, 
> timestamp_parsers=["%Y/%m/%d"])
> ---> 22 table = read_bytes(rows, convert_options=opts)
>      23 assert table.schema == schemaInput In [134], in read_bytes(b, 
> **kwargs)
>       9 def read_bytes(b, **kwargs):
> ---> 10     return open_bytes(b, **kwargs).read_all()Input In [134], in 
> open_bytes(b, **kwargs)
>       5 def open_bytes(b, **kwargs):
> ----> 6     return csv.open_csv(pa.py_buffer(b), **kwargs)File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1273, in 
> pyarrow._csv.open_csv()File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1137, in 
> pyarrow._csv.CSVStreamingReader._open()File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:144, in 
> pyarrow.lib.pyarrow_internal_check_status()File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:100, in 
> pyarrow.lib.check_status()ArrowInvalid: In CSV column #0: CSV conversion 
> error to date32[day]: invalid value '1970/01/01'
> {code}
> It would be useful to allow the timestamp_parsers for date types as well, 
> such that such errors don't occur and the resulting table has the required 
> datatypes without a casting step.
>  
> A little bit more context is in the comments of 
> https://issues.apache.org/jira/browse/ARROW-10848 (26/Oct/22).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18166) Allow ConvertOptions.timestamp_parsers for date types

Reply via email to