nealrichardson commented on code in PR #33770:
URL: https://github.com/apache/arrow/pull/33770#discussion_r1081612013
##########
cpp/src/arrow/dataset/scanner.cc:
##########
@@ -135,20 +136,19 @@ Result<std::shared_ptr<Schema>>
GetProjectedSchemaFromExpression(
const std::shared_ptr<Schema>& dataset_schema) {
// process resultant dataset_schema after projection
FieldVector project_fields;
+ std::set<std::string> field_names;
Review Comment:
I used a set here because my R test failed because it was generating
duplicated fields in the schema--the projection expression included the nested
field in two different places. Maybe `->arguments` does deduplication so this
wasn't a problem with non-nested refs. But IDK if this is the right choice, if
someone cares about order that gets lost, or if there's a better way. What do
you think @westonpace ? (I didn't run the C++ tests yet so maybe there are
order-dependent tests that fail.)
Also, this function seems like a natural place to use `FieldsInExpression`
(from expression.cc)--is there a reason it wasn't used here? It wouldn't solve
the duplication issue because you could still have two nested field refs
pointing to different fields within the same top-level struct, but it would let
you assume that everything you're iterating over is a FieldRef.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]