[
https://issues.apache.org/jira/browse/ARROW-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109551#comment-16109551
]
Wes McKinney commented on ARROW-1306:
-------------------------------------
This is happening because Windows unicode file names have to be encoded to
UTF16-LE (see
https://github.com/apache/arrow/blob/master/python/pyarrow/compat.py#L133).
This definitely should be fixed -- marked for 0.6.0. I'm not sure of the right
fix without taking a deeper look, since the Arrow file APIs take
{{std::string}} they aren't aware of the encoded file name when they generate
the error message. One way to handle it might be to add some kind of auxiliary
data structure that has both the platform-encoded path and a UTF8 path. On
Linux/macOS they'll be the same, but we can use the UTF8 version for making
error messages
> [Python] Encoding? issue with error reporting for parquet.read_table
> --------------------------------------------------------------------
>
> Key: ARROW-1306
> URL: https://issues.apache.org/jira/browse/ARROW-1306
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.5.0
> Reporter: Chris Bartak
> Fix For: 0.6.0
>
>
> This is only error-reporting, somehow the filename in the exception for a not
> found file is getting garbled, example below
> {code}
> import pyarrow.parquet as pq
> pq.read_table('non_existent_file.parquet')
> ArrowIOError Traceback (most recent call last)
> pq.read_table('non_existent_file.parquet')
> ---------------------------------------------------------------------------
> ArrowIOError Traceback (most recent call last)
> ----> 1 pq.read_table('non_existent_file.parquet')
> ~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pyarrow\parquet.py
> in read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 709 metadata=metadata)
> 710
> --> 711 pf = ParquetFile(source, metadata=metadata)
> 712 return pf.read(columns=columns, nthreads=nthreads,
> 713 use_pandas_metadata=use_pandas_metadata)
> ~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pyarrow\parquet.py
> in __init__(self, source, metadata, common_metadata)
> 52 def __init__(self, source, metadata=None, common_metadata=None):
> 53 self.reader = ParquetReader()
> ---> 54 self.reader.open(source, metadata=metadata)
> 55 self.common_metadata = common_metadata
> 56
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> io.pxi in pyarrow.lib.get_reader()
> io.pxi in pyarrow.lib.memory_map()
> io.pxi in pyarrow.lib.MemoryMappedFile._open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: IOError: Failed to open file: 潮彮硥獩整瑮晟汩慰煲敵
> {code}
> verions - Python 3.6 Windows x64
> {code}
> arrow-cpp: 0.5.0-np112py36_vc14_1 conda-forge [vc14]
> parquet-cpp: 1.2.0.pre-vc14_3 conda-forge [vc14]
> pyarrow: 0.5.0-np112py36_vc14_0 conda-forge [vc14]
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)