westonpace commented on code in PR #14663: URL: https://github.com/apache/arrow/pull/14663#discussion_r1029893649
########## cpp/src/arrow/dataset/scanner.h: ########## @@ -243,10 +250,17 @@ struct ARROW_DS_EXPORT ScanV2Options : public compute::ExecNodeOptions { /// one fragment at a time. int32_t fragment_readahead = kDefaultFragmentReadahead; /// \brief Options specific to the file format - FragmentScanOptions* format_options; + const FragmentScanOptions* format_options = NULLPTR; /// \brief Utility method to get a selection representing all columns in a dataset - static std::vector<FieldPath> AllColumns(const Dataset& dataset); + static std::vector<FieldPath> AllColumns(const Schema& dataset_schema); + + /// \brief Utility method to add fields needed for the current filter + /// + /// This method adds any fields that are needed by `filter` which are not already + /// included in the list of columns. Any new fields added will be added to the end + /// in no particular order. + static Status AddFieldsNeededForFilter(ScanV2Options* options); Review Comment: There is a (potentially rare) possibility that a format could fully satisfy the best effort filter. In that case, if the field is only needed for filtering, we shouldn't load it. Although, in that case, @jacques-n has convinced me that we might want two filter fields, "best-effort filter" which does not have to be satisfied (and thus should always be a selected column) and `filter` which MUST be satisfied (and the format should error if it cannot, which doesn't have to be in the columns list. Either way, the main reason I'm not automagically including this is because I am trying to move away from the concept of a scan node automagically adding more columns than the user asked for. I will be doing something similar in ARROW-16072 so that the __filename, etc. fields are only included if asked for. The goal is that, a scan request with X columns will emit batches with X columns. > Why isn't this just a regular method? I could be convinced otherwise. In my head all exec node options objects are PODs and I never know whether a POD should have methods or not. I think, however, it would be fine to make an immutable ``` Result<ScanV2Options> AddFIeldsNeededForFilter() const ``` Would that be preferred? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org