rdettai commented on a change in pull request #965:
URL: https://github.com/apache/arrow-datafusion/pull/965#discussion_r706865564
##########
File path: datafusion/src/physical_plan/limit.rs
##########
@@ -213,6 +234,30 @@ impl ExecutionPlan for LocalLimitExec {
}
}
}
+
+ fn statistics(&self) -> Statistics {
Review comment:
This is a "funny" case because the code is the same but not for the same
reasons. In particular, in the `LocalLimitExec` there is something very
unconfortable: If we know that the number of rows is greater than the limit, we
know that the limit operator might kick in on one of the partitions and thus
change the number of rows in the output. But the total number of rows might
still be greater than the limit as this is only applied locally at the
partition level. The right thing to do in that case would be to say that the
output statistics is inexact. But if we do so, we loose too much information,
and the `GlobalLimitExec` that comes right after it will not be able to
conclude what it should: that the total number of row is exceeding the limit
and thus its output will be *exactly* the limit value.
This is one case where having an `interval` kind of inexact statistics would
be very valuable (see
https://github.com/apache/arrow-datafusion/pull/965#issuecomment-917665454).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]