[ 
https://issues.apache.org/jira/browse/ARROW-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340192#comment-16340192
 ] 

ASF GitHub Bot commented on ARROW-1961:
---------------------------------------

xhochy closed pull request #1511: ARROW-1961: [Python] Preserve pre-existing 
schema metadata in Parquet files when passing flavor='spark'
URL: https://github.com/apache/arrow/pull/1511
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py
index 151e0df8a..3a0924a27 100644
--- a/python/pyarrow/parquet.py
+++ b/python/pyarrow/parquet.py
@@ -215,7 +215,9 @@ def _sanitize_schema(schema, flavor):
                 sanitized_fields.append(sanitized_field)
             else:
                 sanitized_fields.append(field)
-        return pa.schema(sanitized_fields), schema_changed
+
+        new_schema = pa.schema(sanitized_fields, metadata=schema.metadata)
+        return new_schema, schema_changed
     else:
         return schema, False
 
diff --git a/python/pyarrow/tests/test_parquet.py 
b/python/pyarrow/tests/test_parquet.py
index c2bb31c9b..7c2edb378 100644
--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -748,6 +748,28 @@ def test_sanitized_spark_field_names():
     assert result.schema[0].name == expected_name
 
 
+def _roundtrip_pandas_dataframe(df, write_kwargs):
+    table = pa.Table.from_pandas(df)
+
+    buf = io.BytesIO()
+    _write_table(table, buf, **write_kwargs)
+
+    buf.seek(0)
+    table1 = _read_table(buf)
+    return table1.to_pandas()
+
+
+@parquet
+def test_spark_flavor_preserves_pandas_metadata():
+    df = _test_dataframe(size=100)
+    df.index = np.arange(0, 10 * len(df), 10)
+    df.index.name = 'foo'
+
+    result = _roundtrip_pandas_dataframe(df, {'version': '2.0',
+                                              'flavor': 'spark'})
+    tm.assert_frame_equal(result, df)
+
+
 @parquet
 def test_fixed_size_binary():
     t0 = pa.binary(10)


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [Python] Writing Parquet file with flavor='spark' loses pandas schema metadata
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-1961
>                 URL: https://issues.apache.org/jira/browse/ARROW-1961
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> You can see the issue in the {{_sanitize_schema}} method
> https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L201
> see https://github.com/apache/arrow/issues/1452



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to