lcspinter commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r618093086
##########
File path:
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -194,6 +210,54 @@ public boolean canProvideBasicStatistics() {
return stats;
}
+ public boolean
addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+ ExprNodeDesc syntheticFilterPredicate) {
+ try {
+ Collection<String> partitionColumns = ((HiveIcebergSerDe)
table.getDeserializer()).partitionColumns();
+ if (partitionColumns.size() > 0) {
+ // Collect the column names from the predicate
+ Set<String> filterColumns = Sets.newHashSet();
+ columns(syntheticFilterPredicate, filterColumns);
+
+ // While Iceberg could handle multiple columns the current pruning
only able to handle filters for a
+ // single column. We keep the logic below to handle multiple columns
so if pruning is available on executor
+ // side the we can easily adapt to it as well.
+ if (filterColumns.size() > 1) {
Review comment:
We collect every column name in the filterColumns set through the
columns() method. That method is traversing every node recursively, so it might
be time-consuming. After that, the size of the set is validated, and if it's
greater than 1, return false.
Can we introduce some logic, to fail fast, without the need of traversing
every node? I'm just thinking aloud, I don't know whether it is feasible or not.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]