[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11724: ARROW-13781: [Python] Allow per column encoding in parquet writer

GitBox Wed, 17 Nov 2021 04:35:22 -0800


jorisvandenbossche commented on a change in pull request #11724:
URL: https://github.com/apache/arrow/pull/11724#discussion_r751189769




##########
File path: python/pyarrow/parquet.py
##########
@@ -560,6 +560,15 @@ def _sanitize_table(table, new_schema, flavor):
     enabled, then dictionary is preferred.
     The byte_stream_split encoding is valid only for floating-point data types
     and should be combined with a compression codec.
+col_encoding : dict, default None

Review comment:
       I was thinking we can maybe also use the full `column_encoding` ? 
(that's only a bit longer)

##########
File path: python/pyarrow/tests/parquet/test_basic.py
##########
@@ -352,6 +352,37 @@ def test_byte_stream_split(use_legacy_dataset):
                          use_legacy_dataset=use_legacy_dataset)
 
 
+@parametrize_legacy_dataset
+def test_col_encoding(use_legacy_dataset):
+    arr_float = pa.array(list(map(float, range(100))))
+    arr_int = pa.array(list(map(int, range(100))))
+    mixed_table = pa.Table.from_arrays([arr_float, arr_int],
+                                       names=['a', 'b'])
+
+    # Check NONE col_encoding.
+    _check_roundtrip(mixed_table, expected=mixed_table, use_dictionary=False,
+                     col_encoding=None, use_legacy_dataset=use_legacy_dataset)
+
+    # Check "BYTE_STREAM_SPLIT" for column 'a' and "PLAIN" col_encoding for
+    # column 'b'.
+    _check_roundtrip(mixed_table, expected=mixed_table, use_dictionary=False,
+                     col_encoding={'a': "BYTE_STREAM_SPLIT", 'b': "PLAIN"},
+                     use_legacy_dataset=use_legacy_dataset)
+
+    # Check "RLE" for column 'a' and "BYTE_STREAM_SPLIT" col_encoding for
+    # column 'b'.
+    _check_roundtrip(mixed_table, expected=mixed_table,
+                     use_byte_stream_split=['a'],
+                     col_encoding={'a': "RLE", 'b': "BYTE_STREAM_SPLIT"},

Review comment:
       In this case we are overriding the `use_byte_stream_split`, I think? 
(since `col_encoding` is handled after `use_byte_stream_split`) In that case we 
should maybe raise an error that you can't set both at the same time?

##########
File path: python/pyarrow/tests/parquet/test_basic.py
##########
@@ -352,6 +352,37 @@ def test_byte_stream_split(use_legacy_dataset):
                          use_legacy_dataset=use_legacy_dataset)
 
 
+@parametrize_legacy_dataset
+def test_col_encoding(use_legacy_dataset):
+    arr_float = pa.array(list(map(float, range(100))))
+    arr_int = pa.array(list(map(int, range(100))))
+    mixed_table = pa.Table.from_arrays([arr_float, arr_int],
+                                       names=['a', 'b'])
+
+    # Check NONE col_encoding.
+    _check_roundtrip(mixed_table, expected=mixed_table, use_dictionary=False,
+                     col_encoding=None, use_legacy_dataset=use_legacy_dataset)
+
+    # Check "BYTE_STREAM_SPLIT" for column 'a' and "PLAIN" col_encoding for
+    # column 'b'.
+    _check_roundtrip(mixed_table, expected=mixed_table, use_dictionary=False,
+                     col_encoding={'a': "BYTE_STREAM_SPLIT", 'b': "PLAIN"},
+                     use_legacy_dataset=use_legacy_dataset)
+
+    # Check "RLE" for column 'a' and "BYTE_STREAM_SPLIT" col_encoding for
+    # column 'b'.
+    _check_roundtrip(mixed_table, expected=mixed_table,
+                     use_byte_stream_split=['a'],
+                     col_encoding={'a': "RLE", 'b': "BYTE_STREAM_SPLIT"},
+                     use_legacy_dataset=use_legacy_dataset)
+
+    # Check "DELTA_BINARY_PACKED" col_encoding for column 'b'.
+    _check_roundtrip(mixed_table, expected=mixed_table,
+                     use_dictionary=['b'],
+                     col_encoding={'b': "DELTA_BINARY_PACKED"},

Review comment:
       This specific case will also not actually have effect, right? Since 
`use_dictionary=['b']` will have priority over `col_encoding:{'b': ...}` ?

##########
File path: python/pyarrow/_parquet.pyx
##########
@@ -879,6 +879,15 @@ cdef encoding_name_from_enum(ParquetEncoding encoding_):
         ParquetEncoding_BYTE_STREAM_SPLIT: 'BYTE_STREAM_SPLIT',
     }.get(encoding_, 'UNKNOWN')
 
+cdef encoding_enum_from_name(str encoding_name):

Review comment:
       Small code style nit: can you leave two blank lines between 
functions/defintions? so one more above and below this object (you will see 
that the other definitions here do the same)

##########
File path: python/pyarrow/tests/parquet/test_basic.py
##########
@@ -352,6 +352,37 @@ def test_byte_stream_split(use_legacy_dataset):
                          use_legacy_dataset=use_legacy_dataset)
 
 
+@parametrize_legacy_dataset
+def test_col_encoding(use_legacy_dataset):
+    arr_float = pa.array(list(map(float, range(100))))
+    arr_int = pa.array(list(map(int, range(100))))
+    mixed_table = pa.Table.from_arrays([arr_float, arr_int],
+                                       names=['a', 'b'])
+
+    # Check NONE col_encoding.
+    _check_roundtrip(mixed_table, expected=mixed_table, use_dictionary=False,
+                     col_encoding=None, use_legacy_dataset=use_legacy_dataset)
+
+    # Check "BYTE_STREAM_SPLIT" for column 'a' and "PLAIN" col_encoding for
+    # column 'b'.
+    _check_roundtrip(mixed_table, expected=mixed_table, use_dictionary=False,
+                     col_encoding={'a': "BYTE_STREAM_SPLIT", 'b': "PLAIN"},
+                     use_legacy_dataset=use_legacy_dataset)
+
+    # Check "RLE" for column 'a' and "BYTE_STREAM_SPLIT" col_encoding for
+    # column 'b'.
+    _check_roundtrip(mixed_table, expected=mixed_table,
+                     use_byte_stream_split=['a'],
+                     col_encoding={'a': "RLE", 'b': "BYTE_STREAM_SPLIT"},
+                     use_legacy_dataset=use_legacy_dataset)
+
+    # Check "DELTA_BINARY_PACKED" col_encoding for column 'b'.
+    _check_roundtrip(mixed_table, expected=mixed_table,
+                     use_dictionary=['b'],
+                     col_encoding={'b': "DELTA_BINARY_PACKED"},
+                     use_legacy_dataset=use_legacy_dataset)
+

Review comment:
       Can you add tests for some failure cases (asserting they raise a proper 
error with `pytest.raises`, see the test above or below): 1) passing an unknown 
encoding in the dict, 2) passing an unknown column name in the dict, and 3) not 
passing

##########
File path: python/pyarrow/parquet.py
##########
@@ -560,6 +560,15 @@ def _sanitize_table(table, new_schema, flavor):
     enabled, then dictionary is preferred.
     The byte_stream_split encoding is valid only for floating-point data types
     and should be combined with a compression codec.
+col_encoding : dict, default None
+    Specify the encoding scheme on a per column basis.
+    Valid values: {'PLAIN', 'BIT_PACKED', 'RLE', 'BYTE_STREAM_SPLIT',
+    'DELTA_BINARY_PACKED', 'DELTA_BYTE_ARRAY'}
+    Unsupported encodings: DELTA_LENGTH_BYTE_ARRAY, PLAIN_DICTIONARY and
+    RLE_DICTIONARY. Last two options are already used by default.
+    Certain encodings are only compatible with certain data types.
+    Please refer to the encodings section of Reading and writing Parquet
+    files <https://arrow.apache.org/docs/cpp/parquet.html#encodings>_.

Review comment:
       ```suggestion
       files `<https://arrow.apache.org/docs/cpp/parquet.html#encodings>`__.
   ```
   
   (Restructuredtext needs the backticks around it ..)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11724: ARROW-13781: [Python] Allow per column encoding in parquet writer

Reply via email to