[
https://issues.apache.org/jira/browse/ARROW-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870919#comment-16870919
]
Robin Kåveland commented on ARROW-5647:
---------------------------------------
This error is a bit misleading from Arrow. I'm guessing you wrote this parquet
file with spark?
Spark will tend to put non-parquet files in a parquet folder / dataset, files
named things like {{_started}}, {{_comitted}}, {{_SUCCESS}} and so on. These
aren't valid parquet files, spark uses them for book-keeping reasons. I think
it has to do with writing datasets to blob storage, which doesn't support
things like atomic directory renames.
I'm not sure why arrow attempts to read them as parquet files, but you can
easily work around this. Here are two things that I do to get around this issue:
{code:java}
import pyarrow.parquet as pq
import glob
segments = glob.glob('/dbfs/mnt/aa/example.parquet/*.parquet')
pdf = pq.ParquetDataset(segments).read().to_pandas(){code}
This won't always work, eg. you may have a more deeply nested parquet dataset.
In this case I've found no other way around this than deleting the offending
files.
But I bet, if you were to check, you'd find that your {{example.parquet}} is a
folder that contains exactly 1 file that ends with {{.parquet}}, so you could
just read that one out.
> [Python] Accessing a file from Databricks using pandas read_parquet using the
> pyarrow engine fails with : Passed non-file path: /mnt/aa/example.parquet
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-5647
> URL: https://issues.apache.org/jira/browse/ARROW-5647
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.13.0
> Environment: Azure Databricks
> Reporter: Simon Lidberg
> Priority: Major
> Attachments: arrow_error.txt
>
>
> When trying to access a file using a mount point pointing to an Azure blob
> storage account the code fails with the following error:
> {color:#8b0000}OSError{color}: Passed non-file path: /mnt/aa/example.parquet
> {color:#8b0000}---------------------------------------------------------------------------{color}
> {color:#8b0000}OSError{color} Traceback (most recent call last)
> {color:#006400}<command-1848295812523966>{color} in
> {color:#4682b4}<module>{color}{color:#00008b}(){color} {color:#006400}---->
> 1{color} pddf2 {color:#AA4B00}={color}
> pd{color:#AA4B00}.{color}read_parquet{color:#AA4B00}({color}{color:#00008b}"/mnt/aa/example.parquet"{color}{color:#AA4B00},{color}
>
> engine{color:#AA4B00}={color}{color:#00008b}'pyarrow'{color}{color:#AA4B00}){color}
> {color:#006400} 2{color}
> display{color:#AA4B00}({color}pddf2{color:#AA4B00}){color}
> {color:#006400}/databricks/python/lib/python3.5/site-packages/pandas/io/parquet.py{color}
> in {color:#4682b4}read_parquet{color}{color:#00008b}(path, engine, columns,
> **kwargs){color} {color:#006400} 280{color} {color:#006400} 281{color} impl
> {color:#AA4B00}={color}
> get_engine{color:#AA4B00}({color}engine{color:#AA4B00}){color}
> {color:#006400}--> 282{color} {color:#006400}return{color}
> impl{color:#AA4B00}.{color}read{color:#AA4B00}({color}path{color:#AA4B00},{color}
> columns{color:#AA4B00}={color}columns{color:#AA4B00},{color}
> {color:#AA4B00}**{color}kwargs{color:#AA4B00}){color}
> {color:#006400}/databricks/python/lib/python3.5/site-packages/pandas/io/parquet.py{color}
> in {color:#4682b4}read{color}{color:#00008b}(self, path, columns,
> **kwargs){color} {color:#006400} 127{color}
> kwargs{color:#AA4B00}[{color}{color:#00008b}'use_pandas_metadata'{color}{color:#AA4B00}]{color}
> {color:#AA4B00}={color} {color:#006400}True{color} {color:#006400}
> 128{color} result = self.api.parquet.read_table(path, columns=columns,
> {color:#006400}--> 129{color}{color:#AA4B00} **kwargs).to_pandas()
> {color}{color:#006400} 130{color} {color:#006400}if{color}
> should_close{color:#AA4B00}:{color} {color:#006400} 131{color}
> {color:#006400}try{color}{color:#AA4B00}:{color}
> {color:#006400}/databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py{color}
> in {color:#4682b4}read_table{color}{color:#00008b}(source, columns,
> use_threads, metadata, use_pandas_metadata, memory_map, filesystem){color}
> {color:#006400} 1150{color} return fs.read_parquet(path, columns=columns,
> {color:#006400} 1151{color}
> use_threads{color:#AA4B00}={color}use_threads{color:#AA4B00},{color}
> metadata{color:#AA4B00}={color}metadata{color:#AA4B00},{color}
> {color:#006400}-> 1152{color}{color:#AA4B00}
> use_pandas_metadata=use_pandas_metadata) {color}{color:#006400} 1153{color}
> {color:#006400} 1154{color} pf {color:#AA4B00}={color}
> ParquetFile{color:#AA4B00}({color}source{color:#AA4B00},{color}
> metadata{color:#AA4B00}={color}metadata{color:#AA4B00}){color}
> {color:#006400}/databricks/python/lib/python3.5/site-packages/pyarrow/filesystem.py{color}
> in {color:#4682b4}read_parquet{color}{color:#00008b}(self, path, columns,
> metadata, schema, use_threads, use_pandas_metadata){color} {color:#006400}
> 177{color} {color:#006400}from{color} pyarrow{color:#AA4B00}.{color}parquet
> {color:#006400}import{color} ParquetDataset {color:#006400} 178{color}
> dataset = ParquetDataset(path, schema=schema, metadata=metadata,
> {color:#006400}--> 179{color}{color:#AA4B00} filesystem=self)
> {color}{color:#006400} 180{color} return dataset.read(columns=columns,
> use_threads=use_threads, {color:#006400} 181{color}
> use_pandas_metadata=use_pandas_metadata)
> {color:#006400}/databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py{color}
> in {color:#4682b4}__init__{color}{color:#00008b}(self, path_or_paths,
> filesystem, schema, metadata, split_row_groups, validate_schema, filters,
> metadata_nthreads, memory_map){color} {color:#006400} 933{color}
> self{color:#AA4B00}.{color}metadata_path{color:#AA4B00}){color}
> {color:#AA4B00}={color} _make_manifest{color:#AA4B00}({color} {color:#006400}
> 934{color} path_or_paths{color:#AA4B00},{color}
> self{color:#AA4B00}.{color}fs{color:#AA4B00},{color}
> metadata_nthreads{color:#AA4B00}={color}metadata_nthreads{color:#AA4B00},{color}
> {color:#006400}--> 935{color}{color:#AA4B00}
> open_file_func=self._open_file_func) {color}{color:#006400} 936{color}
> {color:#006400} 937{color} {color:#006400}if{color}
> self{color:#AA4B00}.{color}common_metadata_path {color:#006400}is{color}
> {color:#006400}not{color} {color:#006400}None{color}{color:#AA4B00}:{color}
> {color:#006400}/databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py{color}
> in {color:#4682b4}_make_manifest{color}{color:#00008b}(path_or_paths, fs,
> pathsep, metadata_nthreads, open_file_func){color} {color:#006400}
> 1108{color} {color:#006400}if{color} {color:#006400}not{color}
> fs{color:#AA4B00}.{color}isfile{color:#AA4B00}({color}path{color:#AA4B00}){color}{color:#AA4B00}:{color}
> {color:#006400} 1109{color} raise IOError('Passed non-file path: \{0}'
> {color:#006400}-> 1110{color}{color:#AA4B00} .format(path))
> {color}{color:#006400} 1111{color} piece {color:#AA4B00}={color}
> ParquetDatasetPiece{color:#AA4B00}({color}path{color:#AA4B00},{color}
> open_file_func{color:#AA4B00}={color}open_file_func{color:#AA4B00}){color}
> {color:#006400} 1112{color}
> pieces{color:#AA4B00}.{color}append{color:#AA4B00}({color}piece{color:#AA4B00}){color}
> {color:#8b0000}OSError{color}: Passed non-file path: /mnt/aa/example.parquet
>
> I am using the following code from a Databricks notebook to reproduce the
> issue:
> {color:#005000}%sh
> {color}
> {color:#005000}sudo apt-get -y install python3-pip
> /databricks/python3/bin/pip3 uninstall pandas -y
> /databricks/python3/bin/pip3 uninstall numpy -y{color}
> {color:#005000}{color:#b08000}/databricks/python3/bin/pip3 uninstall pyarrow
> -y{color}{color}
>
>
> {color:#005000}{color:#b08000}{color:#b08000}%sh
> /databricks/python3/bin/pip3 install numpy==1.14.0
> /databricks/python3/bin/pip3 install pandas==0.24.1
> /databricks/python3/bin/pip3 install pyarrow==0.13.0{color}{color}{color}
>
> {color:#005000}{color:#b08000}{color:#b08000}{color:#b08000}dbutils.fs.mount(
> source = "wasbs://<mycontainer>@<mystorageaccount>.blob.core.windows.net",
> mount_point = "/mnt/aa",
> extra_configs =
> \{"fs.azure.account.key.<mystorageaccount>.blob.core.windows.net":dbutils.secrets.get(scope
> = "storage", key = "blob_key")}){color}{color}{color}{color}
>
> {color:#005000}{color:#b08000}{color:#b08000}{color:#b08000}pddf2 =
> pd.read_parquet("/mnt/aa/example.parquet", engine='pyarrow')
> display(pddf2){color}{color}{color}{color}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)