[
https://issues.apache.org/jira/browse/ARROW-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248696#comment-16248696
]
ASF GitHub Bot commented on ARROW-1787:
---------------------------------------
wesm closed pull request #1298: ARROW-1787: [Python] Support reading parquet
files into DataFrames in a backward compatible way
URL: https://github.com/apache/arrow/pull/1298
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py
index 87b47b8a6..db28ee09e 100644
--- a/python/pyarrow/pandas_compat.py
+++ b/python/pyarrow/pandas_compat.py
@@ -18,6 +18,7 @@
import ast
import collections
import json
+import re
import numpy as np
import pandas as pd
@@ -353,6 +354,14 @@ def make_datetimetz(tz):
return DatetimeTZDtype('ns', tz=tz)
+def backwards_compatible_index_name(raw_name, logical_name):
+ pattern = r'^__index_level_\d+__$'
+ if raw_name == logical_name and re.match(pattern, raw_name) is not None:
+ return None
+ else:
+ return logical_name
+
+
def table_to_blockmanager(options, table, memory_pool, nthreads=1):
import pandas.core.internals as _int
import pyarrow.lib as lib
@@ -394,7 +403,9 @@ def table_to_blockmanager(options, table, memory_pool,
nthreads=1):
values = values.copy()
index_arrays.append(pd.Series(values, dtype=col_pandas.dtype))
- index_names.append(logical_name)
+ index_names.append(
+ backwards_compatible_index_name(raw_name, logical_name)
+ )
block_table = block_table.remove_column(
block_table.schema.get_field_index(raw_name)
)
diff --git a/python/pyarrow/tests/data/v0.7.1.all-named-index.parquet
b/python/pyarrow/tests/data/v0.7.1.all-named-index.parquet
new file mode 100644
index 000000000..e9efd9b39
Binary files /dev/null and
b/python/pyarrow/tests/data/v0.7.1.all-named-index.parquet differ
diff --git a/python/pyarrow/tests/data/v0.7.1.parquet
b/python/pyarrow/tests/data/v0.7.1.parquet
new file mode 100644
index 000000000..44670bcd1
Binary files /dev/null and b/python/pyarrow/tests/data/v0.7.1.parquet differ
diff --git a/python/pyarrow/tests/data/v0.7.1.some-named-index.parquet
b/python/pyarrow/tests/data/v0.7.1.some-named-index.parquet
new file mode 100644
index 000000000..34097ca12
Binary files /dev/null and
b/python/pyarrow/tests/data/v0.7.1.some-named-index.parquet differ
diff --git a/python/pyarrow/tests/test_parquet.py
b/python/pyarrow/tests/test_parquet.py
index e2e6863c4..6ba4fd2fa 100644
--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -1458,3 +1458,76 @@ def test_index_column_name_duplicate(tmpdir):
arrow_table = _read_table(path)
result_df = arrow_table.to_pandas()
tm.assert_frame_equal(result_df, dfx)
+
+
+def test_backwards_compatible_index_naming():
+ expected_string = b"""\
+carat cut color clarity depth table price x y z
+ 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
+ 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
+ 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
+ 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
+ 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
+ 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
+ 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
+ 0.26 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
+ 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
+ 0.23 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39"""
+ expected = pd.read_csv(
+ io.BytesIO(expected_string), sep=r'\s{2,}', index_col=None, header=0
+ )
+ path = os.path.join(os.path.dirname(__file__), 'data', 'v0.7.1.parquet')
+ t = _read_table(path)
+ result = t.to_pandas()
+ tm.assert_frame_equal(result, expected)
+
+
+def test_backwards_compatible_index_multi_level_named():
+ expected_string = b"""\
+carat cut color clarity depth table price x y z
+ 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
+ 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
+ 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
+ 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
+ 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
+ 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
+ 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
+ 0.26 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
+ 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
+ 0.23 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39"""
+ expected = pd.read_csv(
+ io.BytesIO(expected_string),
+ sep=r'\s{2,}', index_col=['cut', 'color', 'clarity'], header=0
+ ).sort_index()
+ path = os.path.join(
+ os.path.dirname(__file__), 'data', 'v0.7.1.all-named-index.parquet'
+ )
+ t = _read_table(path)
+ result = t.to_pandas()
+ tm.assert_frame_equal(result, expected)
+
+
+def test_backwards_compatible_index_multi_level_some_named():
+ expected_string = b"""\
+carat cut color clarity depth table price x y z
+ 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
+ 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
+ 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
+ 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
+ 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
+ 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
+ 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
+ 0.26 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
+ 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
+ 0.23 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39"""
+ expected = pd.read_csv(
+ io.BytesIO(expected_string),
+ sep=r'\s{2,}', index_col=['cut', 'color', 'clarity'], header=0
+ ).sort_index()
+ expected.index = expected.index.set_names(['cut', None, 'clarity'])
+ path = os.path.join(
+ os.path.dirname(__file__), 'data', 'v0.7.1.some-named-index.parquet'
+ )
+ t = _read_table(path)
+ result = t.to_pandas()
+ tm.assert_frame_equal(result, expected)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [Python] Support reading parquet files into DataFrames in a backward
> compatible way
> -----------------------------------------------------------------------------------
>
> Key: ARROW-1787
> URL: https://issues.apache.org/jira/browse/ARROW-1787
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 0.7.1
> Reporter: Phillip Cloud
> Assignee: Phillip Cloud
> Labels: pull-request-available
> Fix For: 0.8.0
>
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)