westonpace commented on code in PR #14663:
URL: https://github.com/apache/arrow/pull/14663#discussion_r1029893649


##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -243,10 +250,17 @@ struct ARROW_DS_EXPORT ScanV2Options : public 
compute::ExecNodeOptions {
   /// one fragment at a time.
   int32_t fragment_readahead = kDefaultFragmentReadahead;
   /// \brief Options specific to the file format
-  FragmentScanOptions* format_options;
+  const FragmentScanOptions* format_options = NULLPTR;
 
   /// \brief Utility method to get a selection representing all columns in a 
dataset
-  static std::vector<FieldPath> AllColumns(const Dataset& dataset);
+  static std::vector<FieldPath> AllColumns(const Schema& dataset_schema);
+
+  /// \brief Utility method to add fields needed for the current filter
+  ///
+  /// This method adds any fields that are needed by `filter` which are not 
already
+  /// included in the list of columns.  Any new fields added will be added to 
the end
+  /// in no particular order.
+  static Status AddFieldsNeededForFilter(ScanV2Options* options);

Review Comment:
   There is a (potentially rare) possibility that a format could fully satisfy 
the best effort filter.  In that case, if the field is only needed for 
filtering, we shouldn't load it.
   
   Although, in that case, @jacques-n has convinced me that we might want two 
filter fields, "best-effort filter" which does not have to be satisfied (and 
thus should always be a selected column) and `filter` which MUST be satisfied 
(and the format should error if it cannot, which doesn't have to be in the 
columns list.
   
   Either way, the main reason I'm not automagically including this is because 
I am trying to move away from the concept of a scan node automagically adding 
more columns than the user asked for.  I will be doing something similar in 
ARROW-16072 so that the __filename, etc. fields are only included if asked for.
   
   The goal is that, a scan request with X columns will emit batches with X 
columns.
   
   > Why isn't this just a regular method?
   
   I could be convinced otherwise.  In my head all exec node options objects 
are PODs and I never know whether a POD should have methods or not.  I think, 
however, it would be fine to make an immutable 
   
   ```
   Result<ScanV2Options> AddFIeldsNeededForFilter() const
   ```
   
   Would that be preferred?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to