[jira] [Commented] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

ASF GitHub Bot (JIRA) Thu, 08 Mar 2018 20:12:21 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392375#comment-16392375
 ]


ASF GitHub Bot commented on ARROW-1940:
---------------------------------------

wesm commented on a change in pull request #1728: ARROW-1940: [Python] Extra 
metadata gets added after multiple conversions between pd.DataFrame and pa.Table
URL: https://github.com/apache/arrow/pull/1728#discussion_r173361703
 
 

 ##########
 File path: cpp/src/arrow/python/helpers.cc
 ##########
 @@ -116,7 +116,8 @@ static Status InferDecimalPrecisionAndScale(PyObject* 
python_decimal, int32_t* p
   DCHECK_NE(scale, NULLPTR);
 
   // TODO(phillipc): Make sure we perform PyDecimal_Check(python_decimal) as a 
DCHECK
-  OwnedRef as_tuple(PyObject_CallMethod(python_decimal, "as_tuple", ""));
+  OwnedRef as_tuple(PyObject_CallMethod(python_decimal, 
const_cast<char*>("as_tuple"),
+                                        const_cast<char*>("")));
 
 Review comment:
   see also the `cpp_PyObject_CallMethod` wrapper for this issue in io.cc

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Extra metadata gets added after multiple conversions between 
> pd.DataFrame and pa.Table
> -----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-1940
>                 URL: https://issues.apache.org/jira/browse/ARROW-1940
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>            Reporter: Dima Ryazanov
>            Assignee: Phillip Cloud
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>         Attachments: fail.py
>
>
> We have a unit test that verifies that loading a dataframe from a .parq file 
> and saving it back with no changes produces the same result as the original 
> file. It started failing with pyarrow 0.8.0.
> After digging into it, I discovered that after the first conversion from 
> pd.DataFrame to pa.Table, the table contains the following metadata (among 
> other things):
> {code}
> "column_indexes": [{"metadata": null, "field_name": null, "name": null, 
> "numpy_type": "object", "pandas_type": "bytes"}]
> {code}
> However, after converting it to pd.DataFrame and back into a pa.Table for the 
> second time, the metadata gets an encoding field:
> {code}
> "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, 
> "name": null, "numpy_type": "object", "pandas_type": "unicode"}]
> {code}
> See the attached file for a test case.
> So specifically, it appears that dataframe->table->dataframe->table 
> conversion produces a different result from just dataframe->table - which I 
> think is unexpected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

Reply via email to