[ https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

## Advertising

Antony Mayi updated ARROW-2160: ------------------------------- Description: {code} import pyarrow as pa import pandas as pd import decimal df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]}) pa.Table.from_pandas(df) {code} raises: {code} pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into precision inferred from first array element: 1 {code} Looks arrow is inferring the highest precision for given column based on the first cell and expecting the rest fits in. I understand this is by design but from the point of view of pandas-arrow compatibility this is quite painful as pandas is more flexible (as demonstrated). What this means is that user trying to pass pandas {{DataFrame}} with {{Decimal}} column(s) to arrow {{Table}} would always have to first: # Find the highest precision used in (each of) that column(s) # Adjust the first cell of (each of) that column(s) so that it explicitly uses the highest precision of that column(s) # Only then pass such {{DataFrame}} to {{Table.from_pandas()}} So given this unavoidable procedure (and assuming arrow needs to be strict about the highest precision for a column) - shouldn't some similar logic be part of the {{Table.from_pandas()}} directly to make this transparent? was: {code} import pyarrow as pa import pandas as pd import decimal df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]}) pa.Table.from_pandas(df) {code} raises: {code} pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into precision inferred from first array element: 1 {code} Looks arrow is inferring the highest precision for given column based on the first cell and expecting the rest fits in. I understand this is by design but from the point of view of pandas-arrow compatibility this is quite painful as pandas is more flexible (as demonstrated). What this means is that user trying to pass pandas {{DataFrame}} with {{Decimal}} column(s) to arrow {{Table}} would always have to first: # Find the highest precision used in (each of) that column(s) # Adjust the first cell of (each of) that column(s) so it explicitly uses the highest precision of that column(s) # Only then pass such {{DataFrame}} to {{Table.from_pandas()}} So given this unavoidable procedure (and assuming arrow needs to be strict about the highest precision for a column) - shouldn't some similar logic be part of the {{Table.from_pandas()}} directly to make this transparent? > [C++/Python] Decimal precision inference > ----------------------------------------- > > Key: ARROW-2160 > URL: https://issues.apache.org/jira/browse/ARROW-2160 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.8.0 > Reporter: Antony Mayi > Assignee: Phillip Cloud > Priority: Major > Fix For: 0.9.0 > > > {code} > import pyarrow as pa > import pandas as pd > import decimal > df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]}) > pa.Table.from_pandas(df) > {code} > raises: > {code} > pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into > precision inferred from first array element: 1 > {code} > Looks arrow is inferring the highest precision for given column based on the > first cell and expecting the rest fits in. I understand this is by design but > from the point of view of pandas-arrow compatibility this is quite painful as > pandas is more flexible (as demonstrated). > What this means is that user trying to pass pandas {{DataFrame}} with > {{Decimal}} column(s) to arrow {{Table}} would always have to first: > # Find the highest precision used in (each of) that column(s) > # Adjust the first cell of (each of) that column(s) so that it explicitly > uses the highest precision of that column(s) > # Only then pass such {{DataFrame}} to {{Table.from_pandas()}} > So given this unavoidable procedure (and assuming arrow needs to be strict > about the highest precision for a column) - shouldn't some similar logic be > part of the {{Table.from_pandas()}} directly to make this transparent? -- This message was sent by Atlassian JIRA (v7.6.3#76005)