[GitHub] [arrow] fsaintjacques commented on a change in pull request #7437: ARROW-8943: [C++][Python][Dataset] Add partitioning support to ParquetDatasetFactory

2020-06-17 Thread GitBox


fsaintjacques commented on a change in pull request #7437:
URL: https://github.com/apache/arrow/pull/7437#discussion_r441633483



##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -1522,19 +1522,20 @@ def _create_parquet_dataset_partitioned(root_path):
 @pytest.mark.parquet
 @pytest.mark.pandas
 def test_parquet_dataset_factory_partitioned(tempdir):
-# TODO support for specifying partitioning scheme
-
 root_path = tempdir / "test_parquet_dataset_factory_partitioned"
 metadata_path, table = _create_parquet_dataset_partitioned(root_path)
 
-dataset = ds.parquet_dataset(metadata_path)
-# TODO partition column not yet included
-# assert dataset.schema.equals(table.schema)
+partitioning = ds.partitioning(flavor="hive")
+dataset = ds.parquet_dataset(metadata_path,
+ partitioning=partitioning,
+ partition_base_dir=str(root_path))

Review comment:
   It is not needed anymore and fixed :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7437: ARROW-8943: [C++][Python][Dataset] Add partitioning support to ParquetDatasetFactory

2020-06-17 Thread GitBox


fsaintjacques commented on a change in pull request #7437:
URL: https://github.com/apache/arrow/pull/7437#discussion_r441621088



##
File path: cpp/src/arrow/dataset/file_parquet.h
##
@@ -215,6 +215,34 @@ class ARROW_DS_EXPORT ParquetFileFragment : public 
FileFragment {
   friend class ParquetFileFormat;
 };
 
+struct ParquetFactoryOptions {
+  // Either an explicit Partitioning or a PartitioningFactory to discover one.
+  //
+  // If a factory is provided, it will be used to infer a schema for partition 
fields
+  // based on file and directory paths then construct a Partitioning. The 
default
+  // is a Partitioning which will yield no partition information.
+  //
+  // The (explicit or discovered) partitioning will be applied to discovered 
files
+  // and the resulting partition information embedded in the Dataset.
+  PartitioningOrFactory partitioning{Partitioning::Default()};
+
+  // For the purposes of applying the partitioning, paths will be stripped
+  // of the partition_base_dir. Files not matching the partition_base_dir
+  // prefix will be skipped for partition discovery. The ignored files will 
still
+  // be part of the Dataset, but will not have partition information.
+  //
+  // Example:
+  // partition_base_dir = "/dataset";
+  //
+  // - "/dataset/US/sales.csv" -> "US/sales.csv" will be given to the 
partitioning
+  //
+  // - "/home/john/late_sales.csv" -> Will be ignored for partition discovery.
+  //
+  // This is useful for partitioning which parses directory when ordering
+  // is important, e.g. DirectoryPartitioning.
+  std::string partition_base_dir;

Review comment:
   That's a good point. I'll follow what FileSystemFactory does with the 
selector base's path.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org