[jira] [Updated] (ARROW-8545) Allow fast writing of Decimal column to parquet

Fons de Leeuw (Jira) Tue, 21 Apr 2020 12:01:28 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Fons de Leeuw updated ARROW-8545:
---------------------------------
    Description: 
Currently, when one wants to use a decimal datatype in Pandas, the only 
possibility is to use the `decimal.Decimal` standard-libary type. This is then 
an "object" column in the DataFrame.

Arrow can write a column of decimal type to Parquet, which is quite impressive 
given that [fastparquet does not write decimals|#data-types]] at all. However, 
the writing is *very* slow, in the code snippet below a factor of 4.

*Improvements*

Of course the best outcome would be if the conversion of a decimal column can 
be made faster, but I am not familiar enough with pandas internals to know if 
that's possible. (This same behavior also applies to `.to_pickle` etc.)

It would be nice, if a warning is shown that object-typed columns are being 
converted which is very slow. That would at least make this behavior more 
explicit.

Now, if fast parsing of a decimal.Decimal object column is not possible, it 
would be nice if a workaround is possible. For example, pass an int and then 
shift the dot "x" places to the left. (It is already possible to pass an int 
column and specify "decimal" dtype in the Arrow schema during 
`pa.Table.from_pandas()` but then it simply becomes a decimal without 
decimals.) Also, it might be nice if it can be encoded as a 128-bit byte string 
in the pandas column and then directly interpreted by Arrow.

*Usecase*

I need to save large dataframes (~10GB) of geospatial data with 
latitude/longitude. I can't use float as comparisons need to be exact, and the 
BigQuery "clustering" feature needs either an integer or a decimal but not a 
float. In the meantime, I have to do a workaround where I use only ints (the 
original number multiplied by 1000.)

*Snippet*
{code:java}
import decimal
from time import time

import numpy as np
import pandas as pd

d = dict()

for col in "abcdefghijklmnopqrstuvwxyz":
    d[col] = np.random.rand(int(1E7)) * 100

df = pd.DataFrame(d)

t0 = time()

df.to_parquet("/tmp/testabc.pq", engine="pyarrow")

t1 = time()

df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)

t2 = time()

df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")

t3 = time()

print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal column 
{t3-t2:.3f}s")
# Saving the normal dataframe took 4.430s, with one decimal column 17.673s{code}
 

 

  was:
Currently, when one wants to use a decimal datatype in Pandas, the only 
possibility is to use the `decimal.Decimal` standard-libary type. This is then 
an "object" column in the DataFrame.

Arrow can write a column of decimal type to Parquet, which is quite impressive 
given that [fastparquet does not write 
decimals|[https://fastparquet.readthedocs.io/en/latest/details.html#data-types]]
 at all. However, the writing is *very* slow, in the code snippet below a 
factor of 4.

*Improvements***

Of course the best outcome would be if the conversion of a decimal column can 
be made faster, but I am not familiar enough with pandas internals to know if 
that's possible. (This same behavior also applies to `.to_pickle` etc.)

It would be nice, if a warning is shown that object-typed columns are being 
converted which is very slow. That would at least make this behavior more 
explicit.

Now, if fast parsing of a decimal.Decimal object column is not possible, it 
would be nice if a workaround is possible. For example, pass an int and then 
shift the dot "x" places to the left. (It is already possible to pass an int 
column and specify "decimal" dtype in the Arrow schema during 
`pa.Table.from_pandas()` but then it simply becomes a decimal without 
decimals.) Also, it might be nice if it can be encoded as a 128-bit byte string 
in the pandas column and then directly interpreted by Arrow.

*Usecase*

I need to save large dataframes (~10GB) of geospatial data with 
latitude/longitude. I can't use float as comparisons need to be exact, and the 
BigQuery "clustering" feature needs either an integer or a decimal but not a 
float. In the meantime, I have to do a workaround where I use only ints (the 
original number multiplied by 1000.)

*Snippet*

 
{code:java}

{code}
*import decimal
from time import time

import numpy as np
import pandas as pd

d = dict()

for col in "abcdefghijklmnopqrstuvwxyz":
    d[col] = np.random.rand(int(1E7)) * 100

df = pd.DataFrame(d)

t0 = time()

df.to_parquet("/tmp/testabc.pq", engine="pyarrow")

t1 = time()

df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)

t2 = time()

df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")

t3 = time()

print(f"Saving the normal dataframe took \{t1-t0:.3f}s, with one decimal column 
\{t3-t2:.3f}s")*

*# Saving the normal dataframe took 4.430s, with one decimal column 17.673s***

 

 


> Allow fast writing of Decimal column to parquet
> -----------------------------------------------
>
>                 Key: ARROW-8545
>                 URL: https://issues.apache.org/jira/browse/ARROW-8545
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.17.0
>            Reporter: Fons de Leeuw
>            Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only 
> possibility is to use the `decimal.Decimal` standard-libary type. This is 
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite 
> impressive given that [fastparquet does not write decimals|#data-types]] at 
> all. However, the writing is *very* slow, in the code snippet below a factor 
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can 
> be made faster, but I am not familiar enough with pandas internals to know if 
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being 
> converted which is very slow. That would at least make this behavior more 
> explicit.
> Now, if fast parsing of a decimal.Decimal object column is not possible, it 
> would be nice if a workaround is possible. For example, pass an int and then 
> shift the dot "x" places to the left. (It is already possible to pass an int 
> column and specify "decimal" dtype in the Arrow schema during 
> `pa.Table.from_pandas()` but then it simply becomes a decimal without 
> decimals.) Also, it might be nice if it can be encoded as a 128-bit byte 
> string in the pandas column and then directly interpreted by Arrow.
> *Usecase*
> I need to save large dataframes (~10GB) of geospatial data with 
> latitude/longitude. I can't use float as comparisons need to be exact, and 
> the BigQuery "clustering" feature needs either an integer or a decimal but 
> not a float. In the meantime, I have to do a workaround where I use only ints 
> (the original number multiplied by 1000.)
> *Snippet*
> {code:java}
> import decimal
> from time import time
> import numpy as np
> import pandas as pd
> d = dict()
> for col in "abcdefghijklmnopqrstuvwxyz":
>     d[col] = np.random.rand(int(1E7)) * 100
> df = pd.DataFrame(d)
> t0 = time()
> df.to_parquet("/tmp/testabc.pq", engine="pyarrow")
> t1 = time()
> df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)
> t2 = time()
> df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")
> t3 = time()
> print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal 
> column {t3-t2:.3f}s")
> # Saving the normal dataframe took 4.430s, with one decimal column 
> 17.673s{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8545) Allow fast writing of Decimal column to parquet

Reply via email to