[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

ASF GitHub Bot (JIRA) Fri, 02 Feb 2018 09:26:19 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350694#comment-16350694
 ]


ASF GitHub Bot commented on ARROW-1754:
---------------------------------------

wesm closed pull request #1408: ARROW-1754: [Python] alternative fix for 
duplicate index/column name that preserves index name if available
URL: https://github.com/apache/arrow/pull/1408
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py
index 4a30fb3b4..240cccdaf 100644
--- a/python/pyarrow/pandas_compat.py
+++ b/python/pyarrow/pandas_compat.py
@@ -179,10 +179,8 @@ def get_column_metadata(column, name, arrow_type, 
field_name):
     }
 
 
-index_level_name = '__index_level_{:d}__'.format
-
-
-def construct_metadata(df, column_names, index_levels, preserve_index, types):
+def construct_metadata(df, column_names, index_levels, index_column_names,
+                       preserve_index, types):
     """Returns a dictionary containing enough metadata to reconstruct a pandas
     DataFrame as an Arrow Table, including index columns.
 
@@ -197,9 +195,8 @@ def construct_metadata(df, column_names, index_levels, 
preserve_index, types):
     -------
     dict
     """
-    ncolumns = len(column_names)
-    df_types = types[:ncolumns - len(index_levels)]
-    index_types = types[ncolumns - len(index_levels):]
+    df_types = types[:-len(index_levels)]
+    index_types = types[-len(index_levels):]
 
     column_metadata = [
         get_column_metadata(
@@ -213,9 +210,6 @@ def construct_metadata(df, column_names, index_levels, 
preserve_index, types):
     ]
 
     if preserve_index:
-        index_column_names = list(map(
-            index_level_name, range(len(index_levels))
-        ))
         index_column_metadata = [
             get_column_metadata(
                 level,
@@ -294,9 +288,29 @@ def _column_name_to_strings(name):
     return str(name)
 
 
+def _index_level_name(index, i, column_names):
+    """Return the name of an index level or a default name if `index.name` is
+    None or is already a column name.
+
+    Parameters
+    ----------
+    index : pandas.Index
+    i : int
+
+    Returns
+    -------
+    name : str
+    """
+    if index.name is not None and index.name not in column_names:
+        return index.name
+    else:
+        return '__index_level_{:d}__'.format(i)
+
+
 def dataframe_to_arrays(df, schema, preserve_index, nthreads=1):
-    names = []
+    column_names = []
     index_columns = []
+    index_column_names = []
     type = None
 
     if preserve_index:
@@ -324,12 +338,13 @@ def dataframe_to_arrays(df, schema, preserve_index, 
nthreads=1):
 
         columns_to_convert.append(col)
         convert_types.append(type)
-        names.append(name)
+        column_names.append(name)
 
     for i, column in enumerate(index_columns):
         columns_to_convert.append(column)
         convert_types.append(None)
-        names.append(index_level_name(i))
+        name = _index_level_name(column, i, column_names)
+        index_column_names.append(name)
 
     # NOTE(wesm): If nthreads=None, then we use a heuristic to decide whether
     # using a thread pool is worth it. Currently the heuristic is whether the
@@ -358,8 +373,10 @@ def convert_column(col, ty):
     types = [x.type for x in arrays]
 
     metadata = construct_metadata(
-        df, names, index_columns, preserve_index, types
+        df, column_names, index_columns, index_column_names, preserve_index,
+        types
     )
+    names = column_names + index_column_names
     return names, arrays, metadata
 
 
diff --git a/python/pyarrow/tests/test_convert_pandas.py 
b/python/pyarrow/tests/test_convert_pandas.py
index ca2f1e361..f1f40a695 100644
--- a/python/pyarrow/tests/test_convert_pandas.py
+++ b/python/pyarrow/tests/test_convert_pandas.py
@@ -191,8 +191,9 @@ def test_index_metadata_field_name(self):
         assert idx0['field_name'] == idx0_name
         assert idx0['name'] is None
 
-        assert foo_name == '__index_level_1__'
-        assert foo['name'] == 'foo'
+        assert foo_name == 'foo'
+        assert foo['field_name'] == foo_name
+        assert foo['name'] == foo_name
 
     def test_categorical_column_index(self):
         df = pd.DataFrame(


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [Python] Fix buggy Parquet roundtrip when an index name is the same as a 
> column name
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-1754
>                 URL: https://issues.apache.org/jira/browse/ARROW-1754
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.7.1
>            Reporter: Wes McKinney
>            Assignee: Phillip Cloud
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> See upstream report 
> https://stackoverflow.com/questions/47013052/issue-with-pyarrow-when-loading-parquet-file-where-index-has-redundant-column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1754) [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name

Reply via email to