[GitHub] [arrow] pitrou commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

GitBox Mon, 28 Mar 2022 07:19:13 -0700


pitrou commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r836480624




##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1571,26 +1586,114 @@ cdef class HivePartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CHivePartitioning.MakeFactory(c_options))
 
-    @property
-    def dictionaries(self):
+
+cdef class FilenamePartitioning(KeyValuePartitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))

Review comment:
       The variable name doesn't match what's below:
   ```suggestion
       >>> partitioning = FilenamePartitioning(
       ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
   ```

##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -406,6 +445,18 @@ TEST_F(TestPartitioning, HivePartitioningFormat) {
            equal(field_ref("beta"), literal("hello"))));
 }
 
+TEST_F(TestPartitioning, FilenamePartitioningFormat) {
+  partitioning_ = std::make_shared<FilenamePartitioning>(
+      schema({field("alpha", int32()), field("beta", utf8())}));
+
+  written_schema_ = partitioning_->schema();
+
+  AssertFormat(and_(equal(field_ref("alpha"), literal(0)),
+                    equal(field_ref("beta"), literal("hello"))),
+               "", "0_hello_");

Review comment:
       What happens if instead of `literal("hello")` I pass 
`literal("foo_bar")` or `literal("foo/bar")`?

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1571,26 +1586,114 @@ cdef class HivePartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CHivePartitioning.MakeFactory(c_options))
 
-    @property
-    def dictionaries(self):
+
+cdef class FilenamePartitioning(KeyValuePartitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))

Review comment:
       Actually, this gives an error:
   ```
   >>> partitioning.parse("2009_11")
   Traceback (most recent call last):
     Input In [11] in <cell line: 1>
       partitioning.parse("2009_11")
     File pyarrow/_dataset.pyx:1252 in pyarrow._dataset.Partitioning.parse
       return Expression.wrap(GetResultValue(result))
     File pyarrow/error.pxi:143 in pyarrow.lib.pyarrow_internal_check_status
       return check_status(status)
     File pyarrow/error.pxi:99 in pyarrow.lib.check_status
       raise ArrowInvalid(message)
   ArrowInvalid: error parsing '' as scalar of type int8
   ```

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1571,26 +1586,114 @@ cdef class HivePartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CHivePartitioning.MakeFactory(c_options))
 
-    @property
-    def dictionaries(self):
+
+cdef class FilenamePartitioning(KeyValuePartitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).

Review comment:
       Judging by the exemple, probably this should be `"2009_11_"`?

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -569,6 +570,22 @@ def test_partitioning():
         with pytest.raises(pa.ArrowInvalid):
             partitioning.parse(shouldfail)
 
+    partitioning = ds.FilenamePartitioning(
+        pa.schema([
+            pa.field('group', pa.int64()),
+            pa.field('key', pa.float64())
+        ])
+    )
+    assert partitioning.dictionaries is None

Review comment:
       It would be nice IMHO.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] pitrou commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Reply via email to