[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6134: Improve `compare.py` output to use `min` times and better column titles

via GitHub Fri, 28 Apr 2023 09:22:59 -0700


alamb commented on code in PR #6134:
URL: https://github.com/apache/arrow-datafusion/pull/6134#discussion_r1180601182



##########
benchmarks/compare.py:
##########
@@ -64,14 +61,9 @@ def load_from(cls, data: Dict[str, Any]) -> QueryRun:
     def execution_time(self) -> float:
         assert len(self.iterations) >= 1
 
-        # If we don't have enough samples, median() is probably
-        # going to be a worse measure than just an average.
-        if len(self.iterations) < MEAN_THRESHOLD:
-            method = statistics.mean
-        else:
-            method = statistics.median
-
-        return method(iteration.elapsed for iteration in self.iterations)
+        # Use minimum execution time to account for variations / other
+        # things the system was doing
+        return min(iteration.elapsed for iteration in self.iterations)

Review Comment:
   I agree it is misleading -- in terms of measuring a change between 
datafusion versions, I think `min` will give us the least variance between runs 
and represents best case performance.
   
   However, it doesn't really give a sense for how much variation is across 
runs 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6134: Improve `compare.py` output to use `min` times and better column titles

Reply via email to