[jira] [Updated] (ARROW-9065) [Python] Support parsing date32 in dataset partition folders

Joris Van den Bossche (Jira) Thu, 11 Jun 2020 05:08:15 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-9065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-9065:
-----------------------------------------
    Description: 
I have some data which is partitioned by year/month/date. It would be useful if 
the date could be automatically parsed:

{code:python}

In [17]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
("day", pa.date32())])

In [18]: partition = DirectoryPartitioning(schema)

In [19]: partition.parse("/2020/06/2020-06-08")
---------------------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
<ipython-input-19-c227c808b401> in <module>
----> 1 partition.parse("/2020/06/2020-06-08")

~\envs\dev\lib\site-packages\pyarrow\_dataset.pyx in 
pyarrow._dataset.Partitioning.parse()

~\envs\dev\lib\site-packages\pyarrow\error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~\envs\dev\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: parsing scalars of type date32[day]
{code}


Not a big issue since you can just use string and convert, but nevertheless it 
would be nice if it Just Worked
{code}

In [22]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
("day", pa.string())])

In [23]: partition = DirectoryPartitioning(schema)

In [24]: partition.parse("/2020/06/2020-06-08")
Out[24]: <pyarrow.dataset.AndExpression (((year == 2020:int16) and (month == 
6:int8)) and (day == 2020-06-08:string))>
{code}

  was:
I have some data which is partitioned by year/month/date. It would be useful if 
the date could be automatically parsed:
```python

In [17]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
("day", pa.date32())])

In [18]: partition = DirectoryPartitioning(schema)

In [19]: partition.parse("/2020/06/2020-06-08")
---------------------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
<ipython-input-19-c227c808b401> in <module>
----> 1 partition.parse("/2020/06/2020-06-08")

~\envs\dev\lib\site-packages\pyarrow\_dataset.pyx in 
pyarrow._dataset.Partitioning.parse()

~\envs\dev\lib\site-packages\pyarrow\error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~\envs\dev\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: parsing scalars of type date32[day]

```


Not a big issue since you can just use string and convert, but nevertheless it 
would be nice if it Just Worked
```python

In [22]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
("day", pa.string())])

In [23]: partition = DirectoryPartitioning(schema)

In [24]: partition.parse("/2020/06/2020-06-08")
Out[24]: <pyarrow.dataset.AndExpression (((year == 2020:int16) and (month == 
6:int8)) and (day == 2020-06-08:string))>
```


> [Python] Support parsing date32 in dataset partition folders
> ------------------------------------------------------------
>
>                 Key: ARROW-9065
>                 URL: https://issues.apache.org/jira/browse/ARROW-9065
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Dave Hirschfeld
>            Priority: Minor
>
> I have some data which is partitioned by year/month/date. It would be useful 
> if the date could be automatically parsed:
> {code:python}
> In [17]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
> ("day", pa.date32())])
> In [18]: partition = DirectoryPartitioning(schema)
> In [19]: partition.parse("/2020/06/2020-06-08")
> ---------------------------------------------------------------------------
> ArrowNotImplementedError Traceback (most recent call last)
> <ipython-input-19-c227c808b401> in <module>
> ----> 1 partition.parse("/2020/06/2020-06-08")
> ~\envs\dev\lib\site-packages\pyarrow\_dataset.pyx in 
> pyarrow._dataset.Partitioning.parse()
> ~\envs\dev\lib\site-packages\pyarrow\error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~\envs\dev\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: parsing scalars of type date32[day]
> {code}
> Not a big issue since you can just use string and convert, but nevertheless 
> it would be nice if it Just Worked
> {code}
> In [22]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
> ("day", pa.string())])
> In [23]: partition = DirectoryPartitioning(schema)
> In [24]: partition.parse("/2020/06/2020-06-08")
> Out[24]: <pyarrow.dataset.AndExpression (((year == 2020:int16) and (month == 
> 6:int8)) and (day == 2020-06-08:string))>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9065) [Python] Support parsing date32 in dataset partition folders

Reply via email to