[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7437: ARROW-8943: [C++][Python][Dataset] Add partitioning support to ParquetDatasetFactory

GitBox Tue, 16 Jun 2020 02:35:15 -0700


jorisvandenbossche commented on a change in pull request #7437:
URL: https://github.com/apache/arrow/pull/7437#discussion_r440673889




##########
File path: python/pyarrow/dataset.py
##########
@@ -445,7 +446,8 @@ def _union_dataset(children, schema=None, **kwargs):
     return UnionDataset(schema, children)
 
 
-def parquet_dataset(metadata_path, schema=None, filesystem=None, format=None):
+def parquet_dataset(metadata_path, schema=None, filesystem=None, format=None,
+                    partitioning=None, partition_base_dir=None):

Review comment:
       Can you also update the docstring here?

##########
File path: cpp/src/arrow/dataset/file_parquet.h
##########
@@ -215,6 +215,34 @@ class ARROW_DS_EXPORT ParquetFileFragment : public 
FileFragment {
   friend class ParquetFileFormat;
 };
 
+struct ParquetFactoryOptions {
+  // Either an explicit Partitioning or a PartitioningFactory to discover one.
+  //
+  // If a factory is provided, it will be used to infer a schema for partition 
fields
+  // based on file and directory paths then construct a Partitioning. The 
default
+  // is a Partitioning which will yield no partition information.
+  //
+  // The (explicit or discovered) partitioning will be applied to discovered 
files
+  // and the resulting partition information embedded in the Dataset.
+  PartitioningOrFactory partitioning{Partitioning::Default()};
+
+  // For the purposes of applying the partitioning, paths will be stripped
+  // of the partition_base_dir. Files not matching the partition_base_dir
+  // prefix will be skipped for partition discovery. The ignored files will 
still
+  // be part of the Dataset, but will not have partition information.
+  //
+  // Example:
+  // partition_base_dir = "/dataset";
+  //
+  // - "/dataset/US/sales.csv" -> "US/sales.csv" will be given to the 
partitioning
+  //
+  // - "/home/john/late_sales.csv" -> Will be ignored for partition discovery.
+  //
+  // This is useful for partitioning which parses directory when ordering
+  // is important, e.g. DirectoryPartitioning.
+  std::string partition_base_dir;

Review comment:
       Does this have a default? (eg I would assume the base directory of 
`_metadata` file)

##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -618,15 +641,37 @@ ParquetDatasetFactory::CollectParquetFragments(
   }
 }
 
+Result<std::vector<std::shared_ptr<Schema>>> 
ParquetDatasetFactory::InspectSchemas(
+    InspectOptions options) {
+  std::shared_ptr<Schema> schema;
+  RETURN_NOT_OK(parquet::arrow::FromParquetSchema(metadata_->schema(), 
&schema));
+
+  // Gather paths found in RowGroups' ColumnChunks.
+  auto properties = MakeArrowReaderProperties(*format_, *metadata_);
+  ARROW_ASSIGN_OR_RAISE(auto paths, CollectPaths(*metadata_, properties));

Review comment:
       Should we only collect those paths here again if there is an actual 
partitioning? (when `options_.partitioning` is not the default "no 
partitioning") 
   (I don't know how costly this is, probably not much)

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1493,6 +1493,80 @@ cdef class UnionDatasetFactory(DatasetFactory):
         self.union_factory = <CUnionDatasetFactory*> sp.get()
 
 
+cdef class ParquetFactoryOptions:
+    """
+    Influences the discovery of parquet dataset.
+
+    Parameters
+    ----------
+    partition_base_dir : str, optional
+        For the purposes of applying the partitioning, paths will be
+        stripped of the partition_base_dir. Files not matching the
+        partition_base_dir prefix will be skipped for partitioning discovery.
+        The ignored files will still be part of the Dataset, but will not
+        have partition information.
+    """

Review comment:
       Can you include an entry for the `partitioning` parameter as well? 
   (for some reason it seems this is also not included in the 
FileSystemFactoryOptions, but I think it is better to include)

##########
File path: python/pyarrow/dataset.py
##########
@@ -486,8 +488,13 @@ def parquet_dataset(metadata_path, schema=None, 
filesystem=None, format=None):
         filesystem, _ = _ensure_filesystem(filesystem)
 
     metadata_path = _normalize_path(filesystem, _stringify_path(metadata_path))
+    options = ParquetFactoryOptions(
+        partition_base_dir=partition_base_dir,
+        partitioning=partitioning

Review comment:
       Can you add here a `partitioning = _ensure_partitioning(partitioning)` 
before passing it to ParquetFactoryOptions? 
   
   (that will eg make sure that `parquet_dataset(..., partitioning="hive")` 
will also work, similarly as for `dataset(..)`)

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1522,19 +1522,20 @@ def _create_parquet_dataset_partitioned(root_path):
 @pytest.mark.parquet
 @pytest.mark.pandas
 def test_parquet_dataset_factory_partitioned(tempdir):
-    # TODO support for specifying partitioning scheme
-
     root_path = tempdir / "test_parquet_dataset_factory_partitioned"
     metadata_path, table = _create_parquet_dataset_partitioned(root_path)
 
-    dataset = ds.parquet_dataset(metadata_path)
-    # TODO partition column not yet included
-    # assert dataset.schema.equals(table.schema)
+    partitioning = ds.partitioning(flavor="hive")
+    dataset = ds.parquet_dataset(metadata_path,
+                                 partitioning=partitioning,
+                                 partition_base_dir=str(root_path))

Review comment:
       Related to my previous question about the default of this keyword: is it 
in this case needed to pass this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7437: ARROW-8943: [C++][Python][Dataset] Add partitioning support to ParquetDatasetFactory

Reply via email to