nealrichardson commented on code in PR #33770:
URL: https://github.com/apache/arrow/pull/33770#discussion_r1081612013


##########
cpp/src/arrow/dataset/scanner.cc:
##########
@@ -135,20 +136,19 @@ Result<std::shared_ptr<Schema>> 
GetProjectedSchemaFromExpression(
     const std::shared_ptr<Schema>& dataset_schema) {
   // process resultant dataset_schema after projection
   FieldVector project_fields;
+  std::set<std::string> field_names;

Review Comment:
   I used a set here because my R test failed because it was generating 
duplicated fields in the schema--the projection expression included the nested 
field in two different places. Maybe `->arguments` does deduplication so this 
wasn't a problem with non-nested refs. But IDK if this is the right choice, if 
someone cares about order that gets lost, or if there's a better way. What do 
you think @westonpace ? (I didn't run the C++ tests yet so maybe there are 
order-dependent tests that fail.)
   
   Also, this function seems like a natural place to use `FieldsInExpression` 
(from expression.cc)--is there a reason it wasn't used here? It wouldn't solve 
the duplication issue because you could still have two nested field refs 
pointing to different fields within the same top-level struct, but it would let 
you assume that everything you're iterating over is a FieldRef.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to