[jira] [Updated] (ARROW-2160) [C++/Python] Decimal precision inference

Antony Mayi (JIRA) Wed, 14 Feb 2018 15:23:53 -0800

     [ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Antony Mayi updated ARROW-2160:
-------------------------------
    Description: 
{code}
import pyarrow as pa
import pandas as pd
import decimal

df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
pa.Table.from_pandas(df)
{code}

raises:
{code}
pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
precision inferred from first array element: 1
{code}

Looks arrow is inferring the highest precision for given column based on the 
first cell and expecting the rest fits in. I understand this is by design but 
from the point of view of pandas-arrow compatibility this is quite painful as 
pandas is more flexible (as demonstrated).

What this means is that user trying to pass pandas {{DataFrame}} with 
{{Decimal}} column(s) to arrow {{Table}} would always have to first:
# Find the highest precision used in (each of) that column(s)
# Adjust the first cell of (each of) that column(s) so it explicitly uses the 
highest precision of that column(s)
# Only then pass such {{DataFrame}} to {{Table.from_pandas()}}

So given this unavoidable procedure (and assuming arrow needs to be strict 
about the highest precision for a column) - shouldn't some similar logic be 
part of the {{Table.from_pandas()}} directly to make this transparent?

  was:
{code}
import pyarrow as pa
import pandas as pd
import decimal

df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
pa.Table.from_pandas(df)
{code}

raises:
{code}
pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
precision inferred from first array element: 1
{code}

Looks arrow is inferring the highest precision for given column based on the 
first cell and expecting the rest fits in. I understand this is by design but 
from the point of view of pandas-arrow compatibility this is quite painful as 
pandas is more flexible (as demonstrated).

What this means is that user trying to pass pandas {{DataFrame}} with 
{{Decimal}} column(s) to arrow {{Table}} would always have to first:
# Find the highest precision used in (each of) that column(s)
# Adjust the first cell of (each of) that column(s) so it has the highest 
precision of that column(s)
# Only then pass such {{DataFrame}} to {{Table.from_pandas()}}

So given this unavoidable procedure (and assuming arrow needs to be strict 
about the highest precision for a column) - shouldn't some similar logic be 
part of the {{Table.from_pandas()}} directly to make this transparent?


> [C++/Python]  Decimal precision inference
> -----------------------------------------
>
>                 Key: ARROW-2160
>                 URL: https://issues.apache.org/jira/browse/ARROW-2160
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.8.0
>            Reporter: Antony Mayi
>            Assignee: Phillip Cloud
>            Priority: Major
>             Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so it explicitly uses the 
> highest precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2160) [C++/Python] Decimal precision inference

Reply via email to