[ https://issues.apache.org/jira/browse/ARROW-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-5089: ----------------------------------------- Labels: parquet performance (was: performance) > [C++/Python] Writing dictionary encoded columns to parquet is extremely slow > when using chunk size > -------------------------------------------------------------------------------------------------- > > Key: ARROW-5089 > URL: https://issues.apache.org/jira/browse/ARROW-5089 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Affects Versions: 0.13.0 > Reporter: Florian Jetter > Priority: Major > Labels: parquet, performance > > Currently, there is a workaround for dict encoded columns in place to handle > writing dict encoded columns to parquet. > The workaround converts the dict encoded array to its plain version before > writing to parquet. This is painfully slow since for every row group the > entire array is converted over and over again. > The following example is orders of magnitude slower than the non-dict encoded > version: > {code} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > df = pd.DataFrame({"col": ["A", "B"] * 100000}).astype("category") > table = pa.Table.from_pandas(df) > buf = pa.BufferOutputStream() > pq.write_table( > table, > buf, > chunk_size=100, > ) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)