jorisvandenbossche commented on code in PR #13409:
URL: https://github.com/apache/arrow/pull/13409#discussion_r1034448378
##########
docs/source/python/compute.rst:
##########
@@ -368,8 +368,19 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
nums: [[6,8,10]]
chars: [["f","h","l"]]
-:class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table`
method
-passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset
documentation.
+:class:`.Dataset` can similarly be filtered with the :meth:`.Dataset.filter`
method.
+The method will return an instance of :class:`.FilteredDataset` which will
lazily
Review Comment:
```suggestion
The method will return an instance of :class:`.Dataset` which will lazily
```
##########
python/pyarrow/_dataset.pyx:
##########
@@ -433,7 +477,6 @@ cdef class Dataset(_Weakrefable):
use_threads=use_threads,
coalesce_keys=coalesce_keys,
output_type=InMemoryDataset)
-
Review Comment:
Small nit, but you can keep this blank line (PEP8 recommends 2 blank lines
between class definitions)
##########
python/pyarrow/includes/libarrow.pxd:
##########
@@ -2545,6 +2545,12 @@ cdef extern from "arrow/compute/exec/options.h"
namespace "arrow::compute" nogil
cdef cppclass CExecNodeOptions "arrow::compute::ExecNodeOptions":
pass
+ cdef cppclass CSourceNodeOptions
"arrow::compute::SourceNodeOptions"(CExecNodeOptions):
+ @staticmethod
+ CResult[shared_ptr[CSourceNodeOptions]] FromRecordBatchReader(
Review Comment:
Is this still used in the current version of the PR?
##########
python/pyarrow/_exec_plan.pyx:
##########
@@ -89,35 +93,42 @@ cdef execplan(inputs, output_type, vector[CDeclaration]
plan, c_bool use_threads
# Create source nodes for each input
for ipt in inputs:
if isinstance(ipt, Table):
- node_factory = "table_source"
c_in_table = pyarrow_unwrap_table(ipt)
c_tablesourceopts = make_shared[CTableSourceNodeOptions](
c_in_table)
c_input_node_opts = static_pointer_cast[CExecNodeOptions,
CTableSourceNodeOptions](
c_tablesourceopts)
+
+ current_decl = CDeclaration(
+ tobytes("table_source"), no_c_inputs, c_input_node_opts)
elif isinstance(ipt, Dataset):
- node_factory = "scan"
c_in_dataset = (<Dataset>ipt).unwrap()
c_scanopts = make_shared[CScanNodeOptions](
- c_in_dataset, make_shared[CScanOptions]())
- deref(deref(c_scanopts).scan_options).use_threads = use_threads
+ c_in_dataset, Scanner._make_scan_options(ipt, {"use_threads":
use_threads}))
c_input_node_opts = static_pointer_cast[CExecNodeOptions,
CScanNodeOptions](
c_scanopts)
+
+ # Filters applied in CScanNodeOptions are "best effort" for the
scan node itself,
+ # so we always need to inject an additional Filter node to apply
them for real.
+ current_decl = CDeclaration(tobytes("filter"), no_c_inputs,
+ static_pointer_cast[CExecNodeOptions,
CFilterNodeOptions](
+ make_shared[CFilterNodeOptions](
+ deref(deref(c_scanopts).scan_options).filter)
+ )
+ )
+ current_decl.inputs.push_back(
+ CDeclaration.Input(CDeclaration(
+ tobytes("scan"), no_c_inputs, c_input_node_opts))
Review Comment:
I would naively expect the scan node to go first, followed by the filter
node. Or the order doesn't matter?
##########
python/pyarrow/_exec_plan.pyx:
##########
@@ -89,35 +93,42 @@ cdef execplan(inputs, output_type, vector[CDeclaration]
plan, c_bool use_threads
# Create source nodes for each input
for ipt in inputs:
if isinstance(ipt, Table):
- node_factory = "table_source"
c_in_table = pyarrow_unwrap_table(ipt)
c_tablesourceopts = make_shared[CTableSourceNodeOptions](
c_in_table)
c_input_node_opts = static_pointer_cast[CExecNodeOptions,
CTableSourceNodeOptions](
c_tablesourceopts)
+
+ current_decl = CDeclaration(
+ tobytes("table_source"), no_c_inputs, c_input_node_opts)
elif isinstance(ipt, Dataset):
- node_factory = "scan"
c_in_dataset = (<Dataset>ipt).unwrap()
c_scanopts = make_shared[CScanNodeOptions](
- c_in_dataset, make_shared[CScanOptions]())
- deref(deref(c_scanopts).scan_options).use_threads = use_threads
+ c_in_dataset, Scanner._make_scan_options(ipt, {"use_threads":
use_threads}))
c_input_node_opts = static_pointer_cast[CExecNodeOptions,
CScanNodeOptions](
c_scanopts)
+
+ # Filters applied in CScanNodeOptions are "best effort" for the
scan node itself,
+ # so we always need to inject an additional Filter node to apply
them for real.
+ current_decl = CDeclaration(tobytes("filter"), no_c_inputs,
+ static_pointer_cast[CExecNodeOptions,
CFilterNodeOptions](
+ make_shared[CFilterNodeOptions](
+ deref(deref(c_scanopts).scan_options).filter)
+ )
+ )
Review Comment:
This is very hard to read in the current formatting. Could you see if you
can reformat this a bit (with more logical indentation), or otherwise maybe
assign the FilterNodeOptions first to a temp variable?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]