[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7468: ARROW-8283: [Python] Limit FileSystemDataset constructor from fragments/paths, no filesystem interaction

GitBox Thu, 18 Jun 2020 03:09:05 -0700


jorisvandenbossche commented on a change in pull request #7468:
URL: https://github.com/apache/arrow/pull/7468#discussion_r442117037




##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -407,42 +407,82 @@ cdef class UnionDataset(Dataset):
 
 
 cdef class FileSystemDataset(Dataset):
-    """A Dataset created from a set of files on a particular filesystem.
+    """A Dataset of file fragments.
+
+    A FileSystemDataset is composed of one or more FileFragment.
 
     Parameters
     ----------
-    paths_or_selector : Union[FileSelector, List[FileInfo]]
-        List of files/directories to consume.
+    fragments : list[Fragments]
+        List of fragments to consume.
     schema : Schema
-        The top-level schema of the DataDataset.
+        The top-level schema of the Dataset.
     format : FileFormat
-        File format to create fragments from, currently only
-        ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported.
-    filesystem : FileSystem
-        The filesystem which files are from.
-    partitions : List[Expression], optional
-        Attach additional partition information for the file paths.
+        File format of the fragments, currently only ParquetFileFormat,
+        IpcFileFormat, and CsvFileFormat are supported.
     root_partition : Expression, optional
         The top-level partition of the DataDataset.
     """
 
     cdef:
         CFileSystemDataset* filesystem_dataset
 
-    def __init__(self, paths_or_selector, schema=None, format=None,
-                 filesystem=None, partitions=None, root_partition=None):
+    def __init__(self, fragments, Schema schema, FileFormat format,
+                 root_partition=None):
         cdef:
-            FileInfo info
-            Expression expr
             FileFragment fragment
-            vector[CFileInfo] c_file_infos
-            vector[shared_ptr[CExpression]] c_partitions
-            shared_ptr[CFileFragment] c_fragment
             vector[shared_ptr[CFileFragment]] c_fragments
             CResult[shared_ptr[CDataset]] result
 
         root_partition = root_partition or _true
+        if not isinstance(root_partition, Expression):
+            raise TypeError(
+                "Argument 'root_partition' has incorrect type (expected "
+                "Epression, got {0})".format(type(root_partition))
+            )
+
+        for fragment in fragments:
+            c_fragments.push_back(
+                static_pointer_cast[CFileFragment, CFragment](
+                    fragment.unwrap()))
+
+        result = CFileSystemDataset.Make(
+            pyarrow_unwrap_schema(schema),
+            (<Expression> root_partition).unwrap(),
+            (<FileFormat> format).unwrap(),
+            c_fragments
+        )
+        self.init(GetResultValue(result))
+
+    cdef void init(self, const shared_ptr[CDataset]& sp):
+        Dataset.init(self, sp)
+        self.filesystem_dataset = <CFileSystemDataset*> sp.get()
+
+    @classmethod
+    def from_paths(cls, paths, schema=None, format=None,

Review comment:
       Yes, it's not optional in practice: a bit below I check that the value 
is a Schema (and not None). 
   I did this here to give proper error messages when omitting one of those 
(there is some discussion about this in the original PR that added this: 
https://github.com/apache/arrow/pull/6913#discussion_r407510524), because 
cython is otherwise generating very confusing error messages. I still need to 
create a minimal example to report to cython ..
   
   But added a test for `from_paths` that covers this to ensure it is indeed 
checked.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7468: ARROW-8283: [Python] Limit FileSystemDataset constructor from fragments/paths, no filesystem interaction

Reply via email to