Re: [PR] GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format [arrow]

via GitHub Wed, 04 Sep 2024 19:21:40 -0700


amoeba commented on code in PR #43740:
URL: https://github.com/apache/arrow/pull/43740#discussion_r1744706643



##########
python/pyarrow/_dataset.pyx:
##########
@@ -2505,6 +2505,43 @@ cdef class Partitioning(_Weakrefable):
         result = self.partitioning.Parse(tobytes(path))
         return Expression.wrap(GetResultValue(result))
 
+    def format(self, expr):
+        """
+        Convert a filter expression into a tuple of (directory, filename) 
using 
+        the current partitioning scheme
+
+        Parameters
+        ----------
+        expr : pyarrow.dataset.Expression
+
+        Returns
+        -------
+        tuple[str, str]
+
+        Examples
+        --------
+
+        Specify the Schema for paths like "/2009/June":
+
+        >>> import pyarrow as pa
+        >>> import pyarrow.dataset as ds
+        >>> import pyarrow.compute as pc
+        >>> part = ds.partitioning(pa.schema([("year", pa.int16()),
+        ...                                   ("month", pa.string())]))
+        >>> part.format(
+        ...     (pc.field("year") == 1862) & (pc.field("month") == "Jan")
+        ... )
+        """
+        cdef:
+            CResult[CPartitionPathFormat] result
+            CPartitionPathFormat result_value
+        result = self.partitioning.Format(
+            Expression.unwrap(expr)
+        )
+        result_value = GetResultValue(result)
+
+        return frombytes(result_value.directory), 
frombytes(result_value.filename)

Review Comment:
   ```suggestion
           cdef:
               CPartitionPathFormat result
   
           result = GetResultValue(self.partitioning.Format(
               Expression.unwrap(expr)
           ))
   
           return frombytes(result.directory), frombytes(result.filename)
   ```



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1216,7 +1222,6 @@ def test_read_table_duplicate_column_selection(tempdir):
 
 
 def test_dataset_partitioning(tempdir):
-    import pyarrow.dataset as ds

Review Comment:
   I don't think this and a few other changes are necessary here, can you 
revert just those that are formatting-related?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format [arrow]

Reply via email to