[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13409: ARROW-16616: [Python] Lazy datasets filtering

GitBox Tue, 29 Nov 2022 01:28:57 -0800


jorisvandenbossche commented on code in PR #13409:
URL: https://github.com/apache/arrow/pull/13409#discussion_r1034448378



##########
docs/source/python/compute.rst:
##########
@@ -368,8 +368,19 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
    nums: [[6,8,10]]
    chars: [["f","h","l"]]
 
-:class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` 
method
-passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset 
documentation.
+:class:`.Dataset` can similarly be filtered with the :meth:`.Dataset.filter` 
method.
+The method will return an instance of :class:`.FilteredDataset` which will 
lazily

Review Comment:
   ```suggestion
   The method will return an instance of :class:`.Dataset` which will lazily
   ```



##########
python/pyarrow/_dataset.pyx:
##########
@@ -433,7 +477,6 @@ cdef class Dataset(_Weakrefable):
                                               use_threads=use_threads, 
coalesce_keys=coalesce_keys,
                                               output_type=InMemoryDataset)
 
-

Review Comment:
   Small nit, but you can keep this blank line (PEP8 recommends 2 blank lines 
between class definitions)



##########
python/pyarrow/includes/libarrow.pxd:
##########
@@ -2545,6 +2545,12 @@ cdef extern from "arrow/compute/exec/options.h" 
namespace "arrow::compute" nogil
     cdef cppclass CExecNodeOptions "arrow::compute::ExecNodeOptions":
         pass
 
+    cdef cppclass CSourceNodeOptions 
"arrow::compute::SourceNodeOptions"(CExecNodeOptions):
+        @staticmethod
+        CResult[shared_ptr[CSourceNodeOptions]] FromRecordBatchReader(

Review Comment:
   Is this still used in the current version of the PR?



##########
python/pyarrow/_exec_plan.pyx:
##########
@@ -89,35 +93,42 @@ cdef execplan(inputs, output_type, vector[CDeclaration] 
plan, c_bool use_threads
     # Create source nodes for each input
     for ipt in inputs:
         if isinstance(ipt, Table):
-            node_factory = "table_source"
             c_in_table = pyarrow_unwrap_table(ipt)
             c_tablesourceopts = make_shared[CTableSourceNodeOptions](
                 c_in_table)
             c_input_node_opts = static_pointer_cast[CExecNodeOptions, 
CTableSourceNodeOptions](
                 c_tablesourceopts)
+
+            current_decl = CDeclaration(
+                tobytes("table_source"), no_c_inputs, c_input_node_opts)
         elif isinstance(ipt, Dataset):
-            node_factory = "scan"
             c_in_dataset = (<Dataset>ipt).unwrap()
             c_scanopts = make_shared[CScanNodeOptions](
-                c_in_dataset, make_shared[CScanOptions]())
-            deref(deref(c_scanopts).scan_options).use_threads = use_threads
+                c_in_dataset, Scanner._make_scan_options(ipt, {"use_threads": 
use_threads}))
             c_input_node_opts = static_pointer_cast[CExecNodeOptions, 
CScanNodeOptions](
                 c_scanopts)
+
+            # Filters applied in CScanNodeOptions are "best effort" for the 
scan node itself,
+            # so we always need to inject an additional Filter node to apply 
them for real.
+            current_decl = CDeclaration(tobytes("filter"), no_c_inputs,
+                                        static_pointer_cast[CExecNodeOptions, 
CFilterNodeOptions](
+                make_shared[CFilterNodeOptions](
+                    deref(deref(c_scanopts).scan_options).filter)
+            )
+            )
+            current_decl.inputs.push_back(
+                CDeclaration.Input(CDeclaration(
+                    tobytes("scan"), no_c_inputs, c_input_node_opts))

Review Comment:
   I would naively expect the scan node to go first, followed by the filter 
node. Or the order doesn't matter?



##########
python/pyarrow/_exec_plan.pyx:
##########
@@ -89,35 +93,42 @@ cdef execplan(inputs, output_type, vector[CDeclaration] 
plan, c_bool use_threads
     # Create source nodes for each input
     for ipt in inputs:
         if isinstance(ipt, Table):
-            node_factory = "table_source"
             c_in_table = pyarrow_unwrap_table(ipt)
             c_tablesourceopts = make_shared[CTableSourceNodeOptions](
                 c_in_table)
             c_input_node_opts = static_pointer_cast[CExecNodeOptions, 
CTableSourceNodeOptions](
                 c_tablesourceopts)
+
+            current_decl = CDeclaration(
+                tobytes("table_source"), no_c_inputs, c_input_node_opts)
         elif isinstance(ipt, Dataset):
-            node_factory = "scan"
             c_in_dataset = (<Dataset>ipt).unwrap()
             c_scanopts = make_shared[CScanNodeOptions](
-                c_in_dataset, make_shared[CScanOptions]())
-            deref(deref(c_scanopts).scan_options).use_threads = use_threads
+                c_in_dataset, Scanner._make_scan_options(ipt, {"use_threads": 
use_threads}))
             c_input_node_opts = static_pointer_cast[CExecNodeOptions, 
CScanNodeOptions](
                 c_scanopts)
+
+            # Filters applied in CScanNodeOptions are "best effort" for the 
scan node itself,
+            # so we always need to inject an additional Filter node to apply 
them for real.
+            current_decl = CDeclaration(tobytes("filter"), no_c_inputs,
+                                        static_pointer_cast[CExecNodeOptions, 
CFilterNodeOptions](
+                make_shared[CFilterNodeOptions](
+                    deref(deref(c_scanopts).scan_options).filter)
+            )
+            )

Review Comment:
   This is very hard to read in the current formatting. Could you see if you 
can reformat this a bit (with more logical indentation), or otherwise maybe 
assign the FilterNodeOptions first to a temp variable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13409: ARROW-16616: [Python] Lazy datasets filtering

Reply via email to