[
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089069#comment-17089069
]
Jacek Pliszka commented on ARROW-8545:
--------------------------------------
Cast from float should allow quick conversion
> [Python] Allow fast writing of Decimal column to parquet
> --------------------------------------------------------
>
> Key: ARROW-8545
> URL: https://issues.apache.org/jira/browse/ARROW-8545
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 0.17.0
> Reporter: Fons de Leeuw
> Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only
> possibility is to use the `decimal.Decimal` standard-libary type. This is
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite
> impressive given that [fastparquet does not write decimals|#data-types]] at
> all. However, the writing is *very* slow, in the code snippet below a factor
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can
> be made faster, but I am not familiar enough with pandas internals to know if
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being
> converted which is very slow. That would at least make this behavior more
> explicit.
> Now, if fast parsing of a decimal.Decimal object column is not possible, it
> would be nice if a workaround is possible. For example, pass an int and then
> shift the dot "x" places to the left. (It is already possible to pass an int
> column and specify "decimal" dtype in the Arrow schema during
> `pa.Table.from_pandas()` but then it simply becomes a decimal without
> decimals.) Also, it might be nice if it can be encoded as a 128-bit byte
> string in the pandas column and then directly interpreted by Arrow.
> *Usecase*
> I need to save large dataframes (~10GB) of geospatial data with
> latitude/longitude. I can't use float as comparisons need to be exact, and
> the BigQuery "clustering" feature needs either an integer or a decimal but
> not a float. In the meantime, I have to do a workaround where I use only ints
> (the original number multiplied by 1000.)
> *Snippet*
> {code:java}
> import decimal
> from time import time
> import numpy as np
> import pandas as pd
> d = dict()
> for col in "abcdefghijklmnopqrstuvwxyz":
> d[col] = np.random.rand(int(1E7)) * 100
> df = pd.DataFrame(d)
> t0 = time()
> df.to_parquet("/tmp/testabc.pq", engine="pyarrow")
> t1 = time()
> df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)
> t2 = time()
> df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")
> t3 = time()
> print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal
> column {t3-t2:.3f}s")
> # Saving the normal dataframe took 4.430s, with one decimal column
> 17.673s{code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)