[ https://issues.apache.org/jira/browse/ARROW-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney resolved ARROW-1976. --------------------------------- Resolution: Fixed Issue resolved by pull request 1553 [https://github.com/apache/arrow/pull/1553] > [Python] Handling unicode pandas columns on parquet.read_table > -------------------------------------------------------------- > > Key: ARROW-1976 > URL: https://issues.apache.org/jira/browse/ARROW-1976 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.8.0 > Reporter: Simbarashe Nyatsanga > Assignee: Licht Takeuchi > Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Unicode columns in pandas DataFrames aren't being handled correctly for some > datasets when reading a parquet file into a pandas DataFrame, leading to the > common Python ASCII encoding error. > > The dataset used to get the error is here: > https://catalog.data.gov/dataset/college-scorecard > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.read_csv('college_data.csv') > {code} > For verification, the DataFrame's columns are indeed unicode > {code} > df.columns > > Index([u'UNITID', u'OPEID', u'OPEID6', u'INSTNM', u'CITY', u'STABBR', > u'INSTURL', u'NPCURL', u'HCM2', u'PREDDEG', > ... > u'RET_PTL4', u'PCTFLOAN', u'UG25ABV', u'MD_EARN_WNE_P10', u'GT_25K_P6', > u'GRAD_DEBT_MDN_SUPP', u'GRAD_DEBT_MDN10YR_SUPP', u'RPY_3YR_RT_SUPP', > u'C150_L4_POOLED_SUPP', u'C150_4_POOLED_SUPP'], > dtype='object', length=123) > {code} > The DataFrame can be saved into a parquet file > {code} > arrow_table = pa.Table.from_pandas(df) > pq.write_table(arrow_table, 'college_data.parquet') > {code} > But trying to read the parquet file immediately afterwards results in the > following > {code} > df = pq.read_table('college_data.parquet').to_pandas() > > --------------------------------------------------------------------------- > UnicodeEncodeError Traceback (most recent call last) > <ipython-input-29-23906ea1efe3> in <module>() > ----> 2 df = pq.read_table('college_data.parquet').to_pandas() > /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/table.pxi in > pyarrow.lib.Table.to_pandas > (/Users/travis/build/BryanCutler/arrow-dist/arrow/python/build/temp.macosx-10.6-intel-2.7/lib.cxx:46331)() > 1041 if nthreads is None: > 1042 nthreads = cpu_count() > -> 1043 mgr = pdcompat.table_to_blockmanager(options, self, > memory_pool, > 1044 nthreads) > 1045 return pd.DataFrame(mgr) > /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc > in table_to_blockmanager(options, table, memory_pool, nthreads, categoricals) > 539 if columns: > 540 columns_name_dict = { > --> 541 c.get('field_name', str(c['name'])): c['name'] for c in > columns > 542 } > 543 columns_values = [ > /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc > in <dictcomp>((c,)) > 539 if columns: > 540 columns_name_dict = { > --> 541 c.get('field_name', str(c['name'])): c['name'] for c in > columns > 542 } > 543 columns_values = [ > UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in > position 0: ordinal not in range(128) > {code} > Looking at the stacktrace , it looks like this line, which is using str which > by default will try to do ascii encoding: > https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py#L541 -- This message was sent by Atlassian JIRA (v7.6.3#76005)