lidavidm commented on a change in pull request #11704:
URL: https://github.com/apache/arrow/pull/11704#discussion_r750280185



##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -111,9 +111,26 @@ static inline Result<csv::ConvertOptions> 
GetConvertOptions(
 
   if (!scan_options) return convert_options;
 
-  auto materialized = scan_options->MaterializedFields();
-  std::unordered_set<std::string> materialized_fields(materialized.begin(),
-                                                      materialized.end());
+  auto field_refs = scan_options->MaterializedFields();
+  std::unordered_set<std::string> materialized_fields;
+  materialized_fields.reserve(field_refs.size());
+  // Preprocess field refs. We try to avoid FieldRef::GetFoo here since that's
+  // quadratic (and this is significant overhead with 1000+ columns)
+  for (const auto& ref : field_refs) {
+    if (const auto* name = ref.name()) {
+      // Common case
+      materialized_fields.emplace(*name);
+      continue;
+    }
+    // CSV doesn't really support nested types so do our best

Review comment:
       This is the same behavior as IPC/ORC: if a nested field is selected, 
load the entire top-level field and let the projection take care of selecting 
the child field later. By "do our best" I just mean that CSV doesn't support 
reading any nested types in the first place so there's nothing to test/this 
branch should never really get hit, but I'll clarify what's going on here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to