Diego Argueta created ARROW-4883:
------------------------------------

             Summary: [Python] read_csv() gives mojibake if given file object 
in text mode
                 Key: ARROW-4883
                 URL: https://issues.apache.org/jira/browse/ARROW-4883
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.12.1
         Environment: Python: 3.7.2, 2.7.15
PyArrow: 0.12.1
OS: MacOS 10.13.6 (High Sierra)
            Reporter: Diego Argueta


h1. Summary:

Python 3:

* {{read_csv}} returns mojibake if given file objects opened in text mode. It 
behaves as expected in binary mode.
* Files encoded in anything other than valid UTF-8 will cause a crash.

Python 2:

{{read_csv}} only handles ASCII files. If given a file in UTF-8 with characters 
over U+007F, it crashes.

h1. To reproduce:

1) Create a CSV like this

{code}
Header
123.45
{code}

2) Then run this code on Python 3:

{code:python}
>>> import pyarrow.csv as pa_csv
>>> pa_csv.read_csv(open('test.csv', 'r'))
pyarrow.Table
䧢: string
{code}

Notice the file descriptor is open in text mode. Changing the encoding doesn't 
help:

{code:python}
>>> pa_csv.read_csv(open('test.csv', 'r', encoding='utf-8'))
pyarrow.Table
䧢: string

>>> pa_csv.read_csv(open('test.csv', 'r', encoding='ascii'))
pyarrow.Table
䧢: string

>>> pa_csv.read_csv(open('test.csv', 'r', encoding='iso-8859-1'))
pyarrow.Table
䧢: string
{code}

If I open the file in binary mode it works:

{code:python}
>>> pa_csv.read_csv(open('test.csv', 'rb'))                                     
>>>                                                                             
>>>             
pyarrow.Table
Header: double
{code}

I tried this with a file encoded in UTF-16 and it freaked out:

{code}                                                                          
                                        
Traceback (most recent call last):
  File 
"<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py",
 line 84, in _process_text
    self._execute(line)
  File 
"<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py",
 line 139, in _execute
    result_str = '%s\n' % repr(result).decode('utf-8')
  File "pyarrow/table.pxi", line 960, in pyarrow.lib.Table.__repr__
  File "pyarrow/types.pxi", line 903, in pyarrow.lib.Schema.__str__
  File 
"<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/compat.py",
 line 143, in frombytes
    return o.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid 
start byte

'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
{code}

Presumably this is because the code always assumes the file is in UTF-8.

h2. Python 2 behavior

Python 2 behaves differently -- it uses the ASCII codec by default, so when 
handed a file encoded in UTF-8, it will return without an error. Try to access 
the table...

{code}
>>> t = pa_csv.read_csv(open('/Users/diegoargueta/Desktop/test.csv', 'r'))

>>> list(t)
Traceback (most recent call last):
  File 
"/Users/diegoargueta/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py",
 line 84, in _process_text
    self._execute(line)
  File 
"/Users/diegoargueta/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py",
 line 139, in _execute
    result_str = '%s\n' % repr(result).decode('utf-8')
  File "pyarrow/table.pxi", line 387, in pyarrow.lib.Column.__repr__
    result.write('\n{}'.format(str(self.data)))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 11: 
ordinal not in range(128)

'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128)
{code}


h1. Expectation

We should be able to hand read_csv() a file in text mode so that the CSV file 
can be in any text encoding. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to