[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7515: ARROW-2801: [Python] Add split_row_group keyword to ParquetDataset / document split_by_row_group

GitBox Mon, 22 Jun 2020 08:59:04 -0700


jorisvandenbossche commented on a change in pull request #7515:
URL: https://github.com/apache/arrow/pull/7515#discussion_r443663674




##########
File path: python/pyarrow/parquet.py
##########
@@ -1404,27 +1403,36 @@ def __init__(self, path_or_paths, filesystem=None, 
filters=None,
         self._filter_expression = filters and _filters_to_expression(filters)
 
         # check for single NativeFile dataset
-        if not isinstance(path_or_paths, list):
-            if not _is_path_like(path_or_paths):
-                fragment = parquet_format.make_fragment(path_or_paths)
-                self._dataset = ds.FileSystemDataset(
-                    [fragment], schema=fragment.physical_schema,
-                    format=parquet_format
-                )
-                return
-
-        # map old filesystems to new one
-        # TODO(dataset) deal with other file systems
-        if isinstance(filesystem, LocalFileSystem):
-            filesystem = pyarrow.fs.LocalFileSystem(use_mmap=memory_map)
-        elif filesystem is None and memory_map:
-            # if memory_map is specified, assume local file system (string
-            # path can in principle be URI for any filesystem)
-            filesystem = pyarrow.fs.LocalFileSystem(use_mmap=True)
-
-        self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
-                                   format=parquet_format,
-                                   partitioning=partitioning)
+        if (not isinstance(path_or_paths, list) and
+                not _is_path_like(path_or_paths)):
+            fragment = parquet_format.make_fragment(path_or_paths)
+            dataset = ds.FileSystemDataset(
+                [fragment], schema=fragment.physical_schema,
+                format=parquet_format
+            )
+        else:
+            # map old filesystems to new one
+            # TODO(dataset) deal with other file systems
+            if isinstance(filesystem, LocalFileSystem):
+                filesystem = pyarrow.fs.LocalFileSystem(use_mmap=memory_map)
+            elif filesystem is None and memory_map:
+                # if memory_map is specified, assume local file system (string
+                # path can in principle be URI for any filesystem)
+                filesystem = pyarrow.fs.LocalFileSystem(use_mmap=True)
+
+            dataset = ds.dataset(path_or_paths, filesystem=filesystem,
+                                 format=parquet_format,
+                                 partitioning=partitioning)
+
+        if split_row_groups:
+            fragments = dataset.get_fragments()
+            fragments = [rg for fragment in fragments
+                         for rg in fragment.split_by_row_group()]
+            dataset = ds.FileSystemDataset(
+                fragments, dataset.schema, dataset.format,
+                dataset.partition_expression
+            )

Review comment:
       This is basically what was requested in ARROW-2801, but I am not fully 
sure whether it is actually worth adding it here (we are adding it to 
ParquetDataset for which it is not yet clear we are keeping it in the future). 
And if we want it, it's maybe rather something to add in the actual Dataset 
class (or DatasetFactory).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7515: ARROW-2801: [Python] Add split_row_group keyword to ParquetDataset / document split_by_row_group

Reply via email to