jorisvandenbossche commented on a change in pull request #8069:
URL: https://github.com/apache/arrow/pull/8069#discussion_r478689723



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -539,7 +549,7 @@ cdef class FileSystemDataset(Dataset):
         ]:
             if not isinstance(arg, class_):
                 raise TypeError(
-                    "Argument '{0}' has incorrect type (expected {1}, "
+                    "Argument '{0}' wtf has incorrect type (expected {1}, "

Review comment:
       This was not an intended change .. ? ;-)

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -467,12 +467,13 @@ cdef class FileSystemDataset(Dataset):
     cdef:
         CFileSystemDataset* filesystem_dataset
 
-    def __init__(self, fragments, Schema schema, FileFormat format,
+    def __init__(self, filesystem, fragments, Schema schema, FileFormat format,

Review comment:
       Can you add this to the docstring? 
   I would also move this keyword after `format`, I think fragments is most 
logical to come first. 
   
   If we want to have this backwards compatible, we could also make this 
keyword optional, and if not specified, take it from the first fragment.

##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -82,25 +82,34 @@ Result<ScanTaskIterator> 
FileFragment::Scan(std::shared_ptr<ScanOptions> options
 FileSystemDataset::FileSystemDataset(std::shared_ptr<Schema> schema,
                                      std::shared_ptr<Expression> 
root_partition,
                                      std::shared_ptr<FileFormat> format,
+                                     std::shared_ptr<fs::FileSystem> 
filesystem,
                                      
std::vector<std::shared_ptr<FileFragment>> fragments)
     : Dataset(std::move(schema), std::move(root_partition)),
       format_(std::move(format)),
+      filesystem_(std::move(filesystem)),
       fragments_(std::move(fragments)) {}
 
 Result<std::shared_ptr<FileSystemDataset>> FileSystemDataset::Make(
     std::shared_ptr<Schema> schema, std::shared_ptr<Expression> root_partition,
-    std::shared_ptr<FileFormat> format,
+    std::shared_ptr<FileFormat> format, std::shared_ptr<fs::FileSystem> 
filesystem,
     std::vector<std::shared_ptr<FileFragment>> fragments) {
-  return std::shared_ptr<FileSystemDataset>(
-      new FileSystemDataset(std::move(schema), std::move(root_partition),
-                            std::move(format), std::move(fragments)));
+  for (const auto& fragment : fragments) {
+    if ((filesystem == nullptr && fragment->source().filesystem() != nullptr) 
||
+        (filesystem != nullptr &&
+         !fragment->source().filesystem()->Equals(*filesystem))) {
+      return Status::Invalid("FileSystemDataset's filesystem differed from a 
fragment's");
+    }
+  }

Review comment:
       This validation should not be needed in many cases (eg from filesystem 
discovery, or from ParquetFactory, we know for sure that all fragments are 
already coming from the same filesystem), so I think we should avoid that when 
possible.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to