Thomas Buhrmann created ARROW-16682:
---------------------------------------

             Summary: [Python] CSV reader: allow parsing without encoding errors
                 Key: ARROW-16682
                 URL: https://issues.apache.org/jira/browse/ARROW-16682
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 8.0.0
            Reporter: Thomas Buhrmann


When trying to read arbitrary CSV files, it is not possible to infer/guess the 
correct encoding 100% of the time. The Arrow CSV reader will currently fail if 
any byte cannot be decoded given the specified encoding (see example below).

With pandas.read_csv(), I can often get a result that is 99.9% correct by 
passing it a text stream decoded in Python with 
[errors="replace"|https://docs.python.org/3/library/codecs.html#error-handlers] 
(or "ignore" etc.).

Pyarrow's csv.read_csv() on the other hand neither accepts an already decoded 
text stream (TypeError: binary file expected, got text file), nor a parameter 
to configure what to do with decoding errors. As a result the parser simply 
fails.

The simplest solution would probably be to expose Python's error handling in 
pyarrow.csv.ReadOptions (e.g. encoding_errors: "strict" | "ignore" | "replace" 
...).

It would also be useful to document the behaviour of the CSV reader. E.g. that 
it only accepts binary streams, and how encoding errors are handled. In 
particular it is unclear what "Columns that cannot decode using this encoding 
can still be read as Binary" means, since the parser will currently fail if any 
bytes cannot be decoded.

Toy example:

 
{code:java}
import io

txt = """
col_😀_1, col2
0,a
1,b
"""
buffer = io.BytesIO(txt.encode("utf-8"))
pa.csv.read_csv(buffer, pa.csv.ReadOptions(encoding="ascii")){code}
{noformat}
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 5: ordinal 
not in range(128){noformat}
whereas "with pandas":
{code:java}
buffer = io.BytesIO(txt.encode("utf-8"))
text = io.TextIOWrapper(buffer, encoding="ascii", errors="replace")
pd.read_csv(text){code}
{noformat}
   col_����_1  col2
0           0     a
1           1     b
{noformat}
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to