andygrove commented on PR #4591: URL: https://github.com/apache/datafusion-comet/pull/4591#issuecomment-4855144317
I ran `CometInMemoryCacheBenchmark` from this PR locally to see the numbers. Release build, Apple M3 Ultra, JDK 17, default Spark profile (3.5), 5M-row cached table. **Repeated full scan** (`SELECT sum(id), sum(k), sum(v)`) | Case | Best (ms) | Avg (ms) | Rate (M/s) | Relative | |---|---|---|---|---| | Comet cache disabled | 180 | 201 | 27.7 | 1.0X | | Comet cache enabled | 121 | 128 | 41.3 | **1.5X** | **Selective filter** (`WHERE id >= 4500000 AND id < 4750000`) | Case | Best (ms) | Avg (ms) | Rate (M/s) | Relative | |---|---|---|---|---| | Comet cache disabled | 46 | 53 | 108.2 | 1.0X | | Comet cache enabled | 42 | 48 | 117.9 | **1.1X** | The full repeated scan is about 1.5x faster, which is the case this targets directly since it drops the `CometSparkColumnarToColumnar` conversion on every read. The selective filter is only about 1.1x. That is the workload I'd expect to gain the most from the new stats-based pruning, so the small gap is a bit surprising. My guess is the filtered query spends most of its time in the aggregate rather than the scan, so the pruning win gets diluted. It might be worth a variant that isolates the scan (wider projection, more selective predicate, or larger row count) to show the pruning benefit more clearly. Could you add these numbers (or your own run) to the PR description? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
