[jira] [Updated] (ARROW-18166) Allow ConvertOptions.timestamp_parsers for date types

Tim Loderhose (Jira) Wed, 26 Oct 2022 05:19:06 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Loderhose updated ARROW-18166:
----------------------------------
    Description: 
Currently, the timestamp_parsers option of the CSV reader only works for 
timestamp datatypes.

If one wants to immediately read dates as date32 objects (in my use case csv 
data is read and stored as parquet files with correct types), one has to cast 
the table to a schema with date32 types after the fact - this snipped shows 
that loading the data fails when specifying the date type:
{code:java}
import pyarrow as pa
from pyarrow import csv

def open_bytes(b, **kwargs):
    return csv.open_csv(pa.py_buffer(b), **kwargs)
def read_bytes(b, **kwargs):
    return open_bytes(b, **kwargs).read_all()

rows = b"a,b\n1970/01/01,1980-01-01 00\n1970/01/02,1980-01-02 00\n"
schema = pa.schema([("a", pa.timestamp("ms")), ("b", pa.string())])
opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
table = read_bytes(rows, convert_options=opts)
assert table.schema == schema # works

schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
table = read_bytes(rows, convert_options=opts) # error here
assert table.schema == schema
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Input In [134], in <cell line: 22>()
     20 schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
     21 opts = csv.ConvertOptions(column_types=schema, 
timestamp_parsers=["%Y/%m/%d"])
---> 22 table = read_bytes(rows, convert_options=opts)
     23 assert table.schema == schemaInput In [134], in read_bytes(b, **kwargs)
      9 def read_bytes(b, **kwargs):
---> 10     return open_bytes(b, **kwargs).read_all()Input In [134], in 
open_bytes(b, **kwargs)
      5 def open_bytes(b, **kwargs):
----> 6     return csv.open_csv(pa.py_buffer(b), **kwargs)File 
~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1273, in 
pyarrow._csv.open_csv()File 
~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1137, in 
pyarrow._csv.CSVStreamingReader._open()File 
~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:144, in 
pyarrow.lib.pyarrow_internal_check_status()File 
~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:100, in 
pyarrow.lib.check_status()ArrowInvalid: In CSV column #0: CSV conversion error 
to date32[day]: invalid value '1970/01/01'
{code}
It would be useful to allow the timestamp_parsers for date types as well, such 
that such errors don't occur and the resulting table has the required datatypes 
without a casting step.

 

A little bit more context is in the comments of 
https://issues.apache.org/jira/browse/ARROW-10848 (26/Oct/22).

  was:
Currently, the timestamp_parsers option of the CSV reader only works for 
timestamp datatypes.

If one wants to immediately read dates as date32 objects (in my use case csv 
data is read and stored as parquet files with correct types), one has to cast 
the table to a schema with date32 types after the fact - this snipped shows 
that loading the data fails when specifying the date type:
{code:java}
import pyarrow as pa
from pyarrow import csv

def open_bytes(b, **kwargs):
    return csv.open_csv(pa.py_buffer(b), **kwargs)
def read_bytes(b, **kwargs):
    return open_bytes(b, **kwargs).read_all()

rows = b"a,b\n1970/01/01,1980-01-01 00\n1970/01/02,1980-01-02 00\n"schema = 
pa.schema([("a", pa.timestamp("ms")), ("b", pa.string())])
opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
table = read_bytes(rows, convert_options=opts)
assert table.schema == schema # works

schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
table = read_bytes(rows, convert_options=opts) # error here
assert table.schema == schema
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Input In [134], in <cell line: 22>()
     20 schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
     21 opts = csv.ConvertOptions(column_types=schema, 
timestamp_parsers=["%Y/%m/%d"])
---> 22 table = read_bytes(rows, convert_options=opts)
     23 assert table.schema == schemaInput In [134], in read_bytes(b, **kwargs)
      9 def read_bytes(b, **kwargs):
---> 10     return open_bytes(b, **kwargs).read_all()Input In [134], in 
open_bytes(b, **kwargs)
      5 def open_bytes(b, **kwargs):
----> 6     return csv.open_csv(pa.py_buffer(b), **kwargs)File 
~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1273, in 
pyarrow._csv.open_csv()File 
~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1137, in 
pyarrow._csv.CSVStreamingReader._open()File 
~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:144, in 
pyarrow.lib.pyarrow_internal_check_status()File 
~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:100, in 
pyarrow.lib.check_status()ArrowInvalid: In CSV column #0: CSV conversion error 
to date32[day]: invalid value '1970/01/01'
{code}
It would be useful to allow the timestamp_parsers for date types as well, such 
that such errors don't occur and the resulting table has the required datatypes 
without a casting step.

 

A little bit more context is in the comments of 
https://issues.apache.org/jira/browse/ARROW-10848 (26/Oct/22).


> Allow ConvertOptions.timestamp_parsers for date types
> -----------------------------------------------------
>
>                 Key: ARROW-18166
>                 URL: https://issues.apache.org/jira/browse/ARROW-18166
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: Tim Loderhose
>            Priority: Minor
>
> Currently, the timestamp_parsers option of the CSV reader only works for 
> timestamp datatypes.
> If one wants to immediately read dates as date32 objects (in my use case csv 
> data is read and stored as parquet files with correct types), one has to cast 
> the table to a schema with date32 types after the fact - this snipped shows 
> that loading the data fails when specifying the date type:
> {code:java}
> import pyarrow as pa
> from pyarrow import csv
> def open_bytes(b, **kwargs):
>     return csv.open_csv(pa.py_buffer(b), **kwargs)
> def read_bytes(b, **kwargs):
>     return open_bytes(b, **kwargs).read_all()
> rows = b"a,b\n1970/01/01,1980-01-01 00\n1970/01/02,1980-01-02 00\n"
> schema = pa.schema([("a", pa.timestamp("ms")), ("b", pa.string())])
> opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
> table = read_bytes(rows, convert_options=opts)
> assert table.schema == schema # works
> schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
> opts = csv.ConvertOptions(column_types=schema, timestamp_parsers=["%Y/%m/%d"])
> table = read_bytes(rows, convert_options=opts) # error here
> assert table.schema == schema
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> Input In [134], in <cell line: 22>()
>      20 schema = pa.schema([("a", pa.date32()), ("b", pa.string())])
>      21 opts = csv.ConvertOptions(column_types=schema, 
> timestamp_parsers=["%Y/%m/%d"])
> ---> 22 table = read_bytes(rows, convert_options=opts)
>      23 assert table.schema == schemaInput In [134], in read_bytes(b, 
> **kwargs)
>       9 def read_bytes(b, **kwargs):
> ---> 10     return open_bytes(b, **kwargs).read_all()Input In [134], in 
> open_bytes(b, **kwargs)
>       5 def open_bytes(b, **kwargs):
> ----> 6     return csv.open_csv(pa.py_buffer(b), **kwargs)File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1273, in 
> pyarrow._csv.open_csv()File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/_csv.pyx:1137, in 
> pyarrow._csv.CSVStreamingReader._open()File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:144, in 
> pyarrow.lib.pyarrow_internal_check_status()File 
> ~/.virtualenvs/ogi/lib/python3.9/site-packages/pyarrow/error.pxi:100, in 
> pyarrow.lib.check_status()ArrowInvalid: In CSV column #0: CSV conversion 
> error to date32[day]: invalid value '1970/01/01'
> {code}
> It would be useful to allow the timestamp_parsers for date types as well, 
> such that such errors don't occur and the resulting table has the required 
> datatypes without a casting step.
>  
> A little bit more context is in the comments of 
> https://issues.apache.org/jira/browse/ARROW-10848 (26/Oct/22).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18166) Allow ConvertOptions.timestamp_parsers for date types

Reply via email to