[jira] [Commented] (ARROW-8545) [Python] Allow fast writing of Decimal column to parquet

Joris Van den Bossche (Jira) Thu, 23 Apr 2020 01:24:25 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090385#comment-17090385
 ]


Joris Van den Bossche commented on ARROW-8545:
----------------------------------------------

As [~jacek.pliszka] says, there are indeed two different parts to this: 1) 
converting pandas dataframe with decimal objects to arrow Table, and 2) writing 
the arrow table to parquet. 
>From a quick timing, the slowdown you see with decimals is almost entirely due 
>to step 1 (so not related to writing parquet itself).

Using the same dataframe creation as your code above (only using 10x less data 
to easily fit on my laptop):
{code:python}
...
df1 = pd.DataFrame(d)  
# second dataframe with the decimal column
df2 = df.copy() 
df2["a"] = df2["a"].round(decimals=3).astype(str).map(decimal.Decimal) 

# convert each of them to a pyarrow.Table
table1 = pa.table(df1)    
table2 = pa.table(df2)  
{code}

Timing the conversion to pyarrow.Table:
{code}
In [13]: %timeit pa.table(df1)                                                  
                                                                                
                                                   
32 ms ± 7.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [14]: %timeit pa.table(df2)                                                  
                                                                                
                                                   
1.54 s ± 221 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}

and then timing the writing of the pyarrow.Table to parquet:
{code}

In [16]: import pyarrow.parquet as pq  

In [17]: %timeit pq.write_table(table1, "/tmp/testabc.parquet")   
710 ms ± 29.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit pq.write_table(table2, "/tmp/testabc.parquet")  
750 ms ± 44.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}

and timing {{to_parquet()}} more or less gives the sum of the two steps above:

{code}
In [20]: %timeit df1.to_parquet("/tmp/testabc.pq", engine="pyarrow")   
793 ms ± 73.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [21]: %timeit df2.to_parquet("/tmp/testabc.pq", engine="pyarrow")  
2.01 s ± 61.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}


So you can see here that the actual writing to parquet is only slightly slower, 
and that the large slowdown comes from converting the python Decimal objects to 
a pyarrow decimal column.

Of course, when starting from a pandas DataFrame to write to parquet, this 
conversion of pandas to pyarrow.Table is part of the overall process. But, to 
improve this, I think the only solution is to _not_ use python 
{{decimal.Decimal}} objects in an object-dtype column.  
Some options for this:

* Do the casting to decimal on the pyarrow side. However, as [~jacek.pliszka] 
linked, this is not yet implemented for floats (ARROW-8557). I am not directly 
sure if other conversion are possible right now in pyarrow (like converting as 
ints and convert those to decimal with a factor).
* Use a pandas ExtensionDtype to store decimals in a pandas DataFrame 
differently (not as python objects). I am not aware of an existing project that 
already does this (except for Fletcher, which experiments with storing arrow 
types in pandas dataframes in general).

It might be that this python Decimal object -> pyarrow decimal array conversion 
is not fully optimized, however, since it involves dealing with a numpy array 
of python objects, it will never be as fast as converting a numpy float array 
to pyarrow.




> [Python] Allow fast writing of Decimal column to parquet
> --------------------------------------------------------
>
>                 Key: ARROW-8545
>                 URL: https://issues.apache.org/jira/browse/ARROW-8545
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.17.0
>            Reporter: Fons de Leeuw
>            Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only 
> possibility is to use the `decimal.Decimal` standard-libary type. This is 
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite 
> impressive given that [fastparquet does not write decimals|#data-types]] at 
> all. However, the writing is *very* slow, in the code snippet below a factor 
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can 
> be made faster, but I am not familiar enough with pandas internals to know if 
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being 
> converted which is very slow. That would at least make this behavior more 
> explicit.
> Now, if fast parsing of a decimal.Decimal object column is not possible, it 
> would be nice if a workaround is possible. For example, pass an int and then 
> shift the dot "x" places to the left. (It is already possible to pass an int 
> column and specify "decimal" dtype in the Arrow schema during 
> `pa.Table.from_pandas()` but then it simply becomes a decimal without 
> decimals.) Also, it might be nice if it can be encoded as a 128-bit byte 
> string in the pandas column and then directly interpreted by Arrow.
> *Usecase*
> I need to save large dataframes (~10GB) of geospatial data with 
> latitude/longitude. I can't use float as comparisons need to be exact, and 
> the BigQuery "clustering" feature needs either an integer or a decimal but 
> not a float. In the meantime, I have to do a workaround where I use only ints 
> (the original number multiplied by 1000.)
> *Snippet*
> {code:java}
> import decimal
> from time import time
> import numpy as np
> import pandas as pd
> d = dict()
> for col in "abcdefghijklmnopqrstuvwxyz":
>     d[col] = np.random.rand(int(1E7)) * 100
> df = pd.DataFrame(d)
> t0 = time()
> df.to_parquet("/tmp/testabc.pq", engine="pyarrow")
> t1 = time()
> df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)
> t2 = time()
> df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")
> t3 = time()
> print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal 
> column {t3-t2:.3f}s")
> # Saving the normal dataframe took 4.430s, with one decimal column 
> 17.673s{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8545) [Python] Allow fast writing of Decimal column to parquet

Reply via email to