jorisvandenbossche commented on a change in pull request #9192:
URL: https://github.com/apache/arrow/pull/9192#discussion_r557464597
##########
File path: cpp/src/arrow/filesystem/hdfs.cc
##########
@@ -69,6 +69,14 @@ class HadoopFileSystem::Impl {
HdfsOptions options() const { return options_; }
Result<FileInfo> GetFileInfo(const std::string& path) {
+ // It has unfortunately been a frequent logic error to pass URIs down
+ // to GetFileInfo (e.g. ARROW-10264). Unlike other filesystems, HDFS
+ // silently accepts URIs but returns different results than if given the
+ // equivalent in-filesystem paths. Instead of raising cryptic errors
+ // later, notify the underlying problem immediately.
+ if (path.substr(0, 5) == "hdfs:") {
Review comment:
or "viewfs" ?
(I am not familiar with it, I only know that in the python/cython code there
are some places that checks for this as well ..)
##########
File path: python/pyarrow/parquet.py
##########
@@ -1493,15 +1493,16 @@ def __init__(self, path_or_paths, filesystem=None,
filters=None,
single_file = path_or_paths[0]
else:
if _is_path_like(path_or_paths):
- path = str(path_or_paths)
+ path_or_paths = str(path_or_paths)
if filesystem is None:
# path might be a URI describing the FileSystem as well
try:
- filesystem, path = FileSystem.from_uri(path)
+ filesystem, path_or_paths = FileSystem.from_uri(
+ path_or_paths)
Review comment:
Ah, good catch. So we were passing below still the original
`path_or_paths` URI to the dataset constructor (instead of the non-URI path
returned by from_uri), but also passing the filesystem inferred from the URI
here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]