lidavidm commented on a change in pull request #11704:
URL: https://github.com/apache/arrow/pull/11704#discussion_r750280185
##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -111,9 +111,26 @@ static inline Result<csv::ConvertOptions>
GetConvertOptions(
if (!scan_options) return convert_options;
- auto materialized = scan_options->MaterializedFields();
- std::unordered_set<std::string> materialized_fields(materialized.begin(),
- materialized.end());
+ auto field_refs = scan_options->MaterializedFields();
+ std::unordered_set<std::string> materialized_fields;
+ materialized_fields.reserve(field_refs.size());
+ // Preprocess field refs. We try to avoid FieldRef::GetFoo here since that's
+ // quadratic (and this is significant overhead with 1000+ columns)
+ for (const auto& ref : field_refs) {
+ if (const auto* name = ref.name()) {
+ // Common case
+ materialized_fields.emplace(*name);
+ continue;
+ }
+ // CSV doesn't really support nested types so do our best
Review comment:
This is the same behavior as IPC/ORC: if a nested field is selected,
load the entire top-level field and let the projection take care of selecting
the child field later. By "do our best" I just mean that CSV doesn't support
reading any nested types in the first place so there's nothing to test/this
branch should never really get hit, but I'll clarify what's going on here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]