LuciferYang commented on code in PR #37843:
URL: https://github.com/apache/spark/pull/37843#discussion_r968216755


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Expression.java:
##########
@@ -44,7 +44,16 @@ public interface Expression {
    * List of fields or columns that are referenced by this expression.
    */
   default NamedReference[] references() {
-    return Arrays.stream(children()).map(e -> e.references())
-      .flatMap(Arrays::stream).distinct().toArray(NamedReference[]::new);
+    List<NamedReference> list = new ArrayList<>();
+    Set<NamedReference> uniqueValues = new HashSet<>();
+    for (Expression e : children()) {
+      NamedReference[] references = e.references();
+      for (NamedReference reference : references) {
+        if (uniqueValues.add(reference)) {
+          list.add(reference);
+        }
+      }
+    }
+    return list.toArray(new NamedReference[0]);

Review Comment:
   The test results of GA are as follows:
   
   - Java 8 : ArrayList + HashSet is 5 ~ 10% faster than HashSet
   - Java 11: ArrayList + HashSet is 5 ~ 10% slower than HashSet
   - Java 17: ArrayList + HashSet is 5 ~ 10% slower than HashSet
   
   But there are [2 test failed 
](https://github.com/LuciferYang/spark/runs/8300340727?check_suite_focus=true)in
 `V2PredicateSuite`, the test failure is related to the use of 
`uniqueValues.toArray()`
   
   
https://github.com/apache/spark/blob/443eea97578c41870c343cdb88cf69bfdf27033a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/V2PredicateSuite.scala#L266
   
   
https://github.com/apache/spark/blob/443eea97578c41870c343cdb88cf69bfdf27033a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/V2PredicateSuite.scala#L290
   
   Judging from the intention of the assertion, ordering is considered. We may 
need more people to check this if we want change to use `uniqueValues.toArray()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to