Florian Jetter created ARROW-5089:
-------------------------------------

             Summary: [C++/Python] Writing dictionary encoded columns to 
parquet is extremely slow when using chunk size
                 Key: ARROW-5089
                 URL: https://issues.apache.org/jira/browse/ARROW-5089
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Florian Jetter


Currently, there is a workaround for dict encoded columns in place to handle 
writing dict encoded columns to parquet.

The workaround converts the dict encoded array to its plain version before 
writing to parquet. This is painfully slow since for every row group the entire 
array is converted over and over again.

The following example is orders of magnitude slower than the non-dict encoded 
version:
{code}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame({"col": ["A", "B"] * 100000}).astype("category")
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(
    table,
    buf,
    chunk_size=100,
)
 {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to