[jira] [Commented] (ARROW-1976) Handling unicode pandas columns on pq.read_table

ASF GitHub Bot (JIRA) Wed, 17 Jan 2018 09:25:15 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329053#comment-16329053
 ]


ASF GitHub Bot commented on ARROW-1976:
---------------------------------------

wesm commented on a change in pull request #1476: ARROW-1976: [Python] Fix 
Pandas data SerDe with Unicode column names in Python 2.7
URL: https://github.com/apache/arrow/pull/1476#discussion_r162120769
 
 

 ##########
 File path: python/pyarrow/pandas_compat.py
 ##########
 @@ -544,9 +547,14 @@ def table_to_blockmanager(options, table, memory_pool, 
nthreads=1,
 
     column_strings = [x.name for x in block_table.itercolumns()]
     if columns:
-        columns_name_dict = {
-            c.get('field_name', str(c['name'])): c['name'] for c in columns
-        }
+        columns_name_dict = {}
+        for c in columns:
+            column_name = c['name']
+            if not isinstance(column_name, six.text_type):
+                column_name = str(column_name)
 
 Review comment:
   There is also `frombytes` in `pyarrow.compat`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Handling unicode pandas columns on pq.read_table
> ------------------------------------------------
>
>                 Key: ARROW-1976
>                 URL: https://issues.apache.org/jira/browse/ARROW-1976
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>            Reporter: Simbarashe Nyatsanga
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> Unicode columns in pandas DataFrames aren't being handled correctly for some 
> datasets when reading a parquet file into a pandas DataFrame, leading to the 
> common Python ASCII encoding error.
>  
> The dataset used to get the error is here: 
> https://catalog.data.gov/dataset/college-scorecard
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.read_csv('college_data.csv')
> {code}
> For verification, the DataFrame's columns are indeed unicode
> {code}
> df.columns
> > Index([u'UNITID', u'OPEID', u'OPEID6', u'INSTNM', u'CITY', u'STABBR',
>        u'INSTURL', u'NPCURL', u'HCM2', u'PREDDEG',
>        ...
>        u'RET_PTL4', u'PCTFLOAN', u'UG25ABV', u'MD_EARN_WNE_P10', u'GT_25K_P6',
>        u'GRAD_DEBT_MDN_SUPP', u'GRAD_DEBT_MDN10YR_SUPP', u'RPY_3YR_RT_SUPP',
>        u'C150_L4_POOLED_SUPP', u'C150_4_POOLED_SUPP'],
>       dtype='object', length=123)
> {code}
> The DataFrame can be saved into a parquet file
> {code}
> arrow_table = pa.Table.from_pandas(df)
> pq.write_table(arrow_table, 'college_data.parquet')
> {code}
> But trying to read the parquet file immediately afterwards results in the 
> following
> {code}
> df = pq.read_table('college_data.parquet').to_pandas()
> > ---------------------------------------------------------------------------
> UnicodeEncodeError                        Traceback (most recent call last)
> <ipython-input-29-23906ea1efe3> in <module>()
> ----> 2 df = pq.read_table('college_data.parquet').to_pandas()
> /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table.to_pandas 
> (/Users/travis/build/BryanCutler/arrow-dist/arrow/python/build/temp.macosx-10.6-intel-2.7/lib.cxx:46331)()
>    1041         if nthreads is None:
>    1042             nthreads = cpu_count()
> -> 1043         mgr = pdcompat.table_to_blockmanager(options, self, 
> memory_pool,
>    1044                                              nthreads)
>    1045         return pd.DataFrame(mgr)
> /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in table_to_blockmanager(options, table, memory_pool, nthreads, categoricals)
>     539     if columns:
>     540         columns_name_dict = {
> --> 541             c.get('field_name', str(c['name'])): c['name'] for c in 
> columns
>     542         }
>     543         columns_values = [
> /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in <dictcomp>((c,))
>     539     if columns:
>     540         columns_name_dict = {
> --> 541             c.get('field_name', str(c['name'])): c['name'] for c in 
> columns
>     542         }
>     543         columns_values = [
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in 
> position 0: ordinal not in range(128)
> {code}
> Looking at the stacktrace , it looks like this line, which is using str which 
> by default will try to do ascii encoding: 
> https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py#L541



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1976) Handling unicode pandas columns on pq.read_table

Reply via email to