jorisvandenbossche commented on code in PR #13155:
URL: https://github.com/apache/arrow/pull/13155#discussion_r875076518
##########
python/pyarrow/_dataset.pyx:
##########
@@ -405,6 +405,27 @@ cdef class Dataset(_Weakrefable):
use_threads=use_threads,
coalesce_keys=coalesce_keys,
output_type=InMemoryDataset)
+ def filter(self, expr):
+ """
+ Select rows from the Dataset.
+
+ The Dataset can be filtered based on a boolean :class:`Expression`
filter.
+
+ Parameters
+ ----------
+ expr : Expression
+ The boolean :class:`Expression` to filter the table with.
+
+ Returns
+ -------
+ filtered : InMemoryDataset
+ An InMemoryDataset of the same schema, with only the rows selected
+ by applied filtering
+
+ """
+ return _pc()._exec_plan._filter_table(self, expr,
Review Comment:
I think for the `Dataset` method, we should rather add this filter to the
Scanner (for which there is already functionality, i.e. see this as a different
way to express `dataset.scanner/to_table/..(filter=...)`)?
That would avoid actually materializing the full table? (before putting it
again in an InMemoryDataset)
##########
python/pyarrow/table.pxi:
##########
@@ -2882,24 +2882,27 @@ cdef class Table(_PandasConvertible):
return pyarrow_wrap_table(result)
- def filter(self, mask, object null_selection_behavior="drop"):
+ def filter(self, mask_or_expr, object null_selection_behavior="drop"):
"""
Select rows from the table.
- See :func:`pyarrow.compute.filter` for full usage.
+ The Table can be filtered based on a mask, which will be passed to
+ :func:`pyarrow.compute.filter` to perform the filtering, or it can
+ be filtered through a boolean :class:`.Expression`
Parameters
----------
- mask : Array or array-like
- The boolean mask to filter the table with.
+ mask_or_expr : Array or array-like or .Expression
+ The boolean mask or the :class:`.Expression` to filter the table
with.
null_selection_behavior
- How nulls in the mask should be handled.
+ How nulls in the mask should be handled, does nothing if
+ an :class:`.Expression` is used.
Review Comment:
This is not possible to pass through to the filter node?
##########
python/pyarrow/table.pxi:
##########
@@ -2882,24 +2882,27 @@ cdef class Table(_PandasConvertible):
return pyarrow_wrap_table(result)
- def filter(self, mask, object null_selection_behavior="drop"):
+ def filter(self, mask_or_expr, object null_selection_behavior="drop"):
"""
Select rows from the table.
- See :func:`pyarrow.compute.filter` for full usage.
+ The Table can be filtered based on a mask, which will be passed to
+ :func:`pyarrow.compute.filter` to perform the filtering, or it can
+ be filtered through a boolean :class:`.Expression`
Parameters
----------
- mask : Array or array-like
- The boolean mask to filter the table with.
+ mask_or_expr : Array or array-like or .Expression
Review Comment:
Strictly speaking renaming the keyword can break code. We could also leave
it as `mask`, and only update the documentation (the expression still
_represents_ a mask anyway, so I would say it is not a wrong name)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]