[
https://issues.apache.org/jira/browse/ARROW-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Loderhose updated ARROW-18166:
----------------------------------
Priority: Minor (was: Major)
> Allow ConvertOptions.timestamp_parsers for date types
> -----------------------------------------------------
>
> Key: ARROW-18166
> URL: https://issues.apache.org/jira/browse/ARROW-18166
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 9.0.0
> Reporter: Tim Loderhose
> Priority: Minor
>
> Currently, the timestamp_parsers option of the CSV reader only works for
> timestamp datatypes.
> If one wants to immediately read dates as date32 objects (in my use case csv
> data is read and stored as parquet files with correct types), one has to cast
> the table to a schema with date32 types after the fact - this snipped shows
> that loading the data fails when specifying the date type:
> {code:java}
> import pyarrow as pa
> from pyarrow import csv
> def open_bytes(b, **kwargs):
> return csv.open_csv(pa.py_buffer(b), **kwargs)
> def read_bytes(b, **kwargs):
> return open_bytes(b, **kwargs).read_all()
> rows = b"a,b\n1970/01/01,1980-01-01 00\n1970/01/02,1980-01-02 00\n"schema =
> pa.schema([("a", pa.timestamp("ms")), ("b", pa.string())])
> opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
> table = read_bytes(rows, convert_options=opts)
> assert table.schema == schema # works
> schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
> opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
> table = read_bytes(rows, convert_options=opts) # error here
> assert table.schema == schema
> ---------------------------------------------------------------------------
> ArrowInvalid Traceback (most recent call last)
> Input In [134], in <cell line: 22>()
> 20 schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
> 21 opts = csv.ConvertOptions(column_types=schema,
> timestamp_parsers=["%Y/%m/%d"])
> ---> 22 table = read_bytes(rows, convert_options=opts)
> 23 assert table.schema == schemaInput In [134], in read_bytes(b,
> **kwargs)
> 9 def read_bytes(b, **kwargs):
> ---> 10 return open_bytes(b, **kwargs).read_all()Input In [134], in
> open_bytes(b, **kwargs)
> 5 def open_bytes(b, **kwargs):
> ----> 6 return csv.open_csv(pa.py_buffer(b), **kwargs)File
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1273, in
> pyarrow._csv.open_csv()File
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1137, in
> pyarrow._csv.CSVStreamingReader._open()File
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:144, in
> pyarrow.lib.pyarrow_internal_check_status()File
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:100, in
> pyarrow.lib.check_status()ArrowInvalid: In CSV column #0: CSV conversion
> error to date32[day]: invalid value '1970/01/01'
> {code}
> It would be useful to allow the timestamp_parsers for date types as well,
> such that such errors don't occur and the resulting table has the required
> datatypes without a casting step.
>
> A little bit more context is in the comments of
> https://issues.apache.org/jira/browse/ARROW-10848 (26/Oct/22).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)