[
https://issues.apache.org/jira/browse/ARROW-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney updated ARROW-3564:
--------------------------------
Summary: [Python] writing version 2.0 parquet format with dictionary
encoding enabled (was: pyarrow: writing version 2.0 parquet format with
dictionary encoding enabled)
> [Python] writing version 2.0 parquet format with dictionary encoding enabled
> ----------------------------------------------------------------------------
>
> Key: ARROW-3564
> URL: https://issues.apache.org/jira/browse/ARROW-3564
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 0.11.0
> Reporter: Hatem Helal
> Priority: Major
> Labels: parquet
> Fix For: 0.13.0
>
> Attachments: example_v1.0_dict_False.parquet,
> example_v1.0_dict_True.parquet, example_v2.0_dict_False.parquet,
> example_v2.0_dict_True.parquet, pyarrow_repro.py
>
>
> Using pyarrow v0.11.0, the following script writes a simple table (lifted
> from the [pyarrow doc|https://arrow.apache.org/docs/python/parquet.html]) to
> both parquet format versions 1.0 and 2.0, with and without dictionary
> encoding enabled.
> |{{import}} {{pyarrow.parquet as pq}}
> {{import}} {{numpy as np}}
> {{import}} {{pandas as pd}}
> {{import}} {{pyarrow as pa}}
> {{import}} {{itertools}}
>
> {{df }}{{=}} {{pd.DataFrame({}}{{'one'}}{{: [}}{{-}}{{1}}{{, np.nan,
> }}{{2.5}}{{],}}
> {{ }}{{'two'}}{{: [}}{{'foo'}}{{, }}{{'bar'}}{{, }}{{'baz'}}{{],}}
> {{ }}{{'three'}}{{: [}}{{True}}{{, }}{{False}}{{, }}{{True}}{{]},}}
> {{ }}{{index}}{{=}}{{list}}{{(}}{{'abc'}}{{))}}
>
> {{table }}{{=}} {{pa.Table.from_pandas(df)}}
>
> {{use_dict }}{{=}} {{[}}{{True}}{{, }}{{False}}{{]}}
> {{version }}{{=}} {{[}}{{'1.0'}}{{, }}{{'2.0'}}{{]}}
>
> {{for}} {{tf, v }}{{in}} {{itertools.product(use_dict, version):}}
> {{ }}{{filename }}{{=}} {{'example_v'}} {{+}} {{v }}{{+}} {{'_dict_'}}
> {{+}} {{str}}{{(tf) }}{{+}} {{'.parquet'}}
> {{ }}{{pq.write_table(table, filename, use_dictionary}}{{=}}{{tf,
> version}}{{=}}{{v)}}|
> Inspecting the written files using
> [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools]
> appears to show that dictionary encoding is not used in either of the
> version 2.0 files. Both files report that the columns are encoded using
> {{PLAIN,RLE}} and that the dictionary page offset is zero. I was expecting
> that the column encoding would include {{RLE_DICTIONARY}}. Attached are the
> script with repro steps and the files that were generated by it.
> Below is the output of using {{parquet-tools meta}} on the version 2.0 files
> {panel:title=version='2.0', use_dictionary = True}
> {panel}
> |{{% parquet-tools meta example_v2.0_dict_True.parquet}}
> {{file: file:.../example_v2.0_dict_True.parquet}}
> {{creator: parquet-cpp version 1.5.1-SNAPSHOT}}
> {{extra: pandas = \{"pandas_version": "0.23.4", "index_columns":
> ["__index_level_0__"], "columns": [{"metadata": null, "field_name": "one",
> "name": "one", "numpy_type": "float64", "pandas_type": "float64"},
> \{"metadata": null, "field_name": "three", "name": "three", "numpy_type":
> "bool", "pandas_type": "bool"}, \{"metadata": null, "field_name": "two",
> "name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \{"metadata":
> null, "field_name": "__index_level_0__", "name": null, "numpy_type":
> "object", "pandas_type": "bytes"}], "column_indexes": [\{"metadata": null,
> "field_name": null, "name": null, "numpy_type": "object", "pandas_type":
> "bytes"}]}}}
>
> {{file schema: schema}}
> {{--------------------------------------------------------------------------------}}
> {{one: OPTIONAL DOUBLE R:0 D:1}}
> {{three: OPTIONAL BOOLEAN R:0 D:1}}
> {{two: OPTIONAL BINARY R:0 D:1}}
> {{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
>
> {{row group 1: RC:3 TS:211 OFFSET:4}}
> {{--------------------------------------------------------------------------------}}
> {{one: DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3
> ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
> {{three: BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3
> ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
> {{two: BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3
> ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
> {{__index_level_0__: BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3
> ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|
> {panel:title=version='2.0', use_dictionary = False}
> {panel}
> |{{% parquet-tools meta example_v2.0_dict_False.parquet}}
> {{file: file:.../example_v2.0_dict_False.parquet}}
> {{creator: parquet-cpp version 1.5.1-SNAPSHOT}}
> {{extra: pandas = \{"pandas_version": "0.23.4", "index_columns":
> ["__index_level_0__"], "columns": [{"metadata": null, "field_name": "one",
> "name": "one", "numpy_type": "float64", "pandas_type": "float64"},
> \{"metadata": null, "field_name": "three", "name": "three", "numpy_type":
> "bool", "pandas_type": "bool"}, \{"metadata": null, "field_name": "two",
> "name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \{"metadata":
> null, "field_name": "__index_level_0__", "name": null, "numpy_type":
> "object", "pandas_type": "bytes"}], "column_indexes": [\{"metadata": null,
> "field_name": null, "name": null, "numpy_type": "object", "pandas_type":
> "bytes"}]}}}
>
> {{file schema: schema}}
> {{--------------------------------------------------------------------------------}}
> {{one: OPTIONAL DOUBLE R:0 D:1}}
> {{three: OPTIONAL BOOLEAN R:0 D:1}}
> {{two: OPTIONAL BINARY R:0 D:1}}
> {{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
>
> {{row group 1: RC:3 TS:211 OFFSET:4}}
> {{--------------------------------------------------------------------------------}}
> {{one: DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3
> ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
> {{three: BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3
> ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
> {{two: BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3
> ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
> {{__index_level_0__: BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3
> ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)