pchintar commented on PR #4591: URL: https://github.com/apache/datafusion-comet/pull/4591#issuecomment-4811863625
Thanks @andygrove for your suggestions. So I've updated the PR with all the changes we discussed except the microbenchmarks: * The cache serializer now computes per-batch column statistics (lower bound, upper bound, null count, and row count) in the format expected by Spark's `SimpleMetricsCachedBatchSerializer`, so Spark's existing `buildFilter` can prune cached batches before they are decoded. * Added regression tests that verify the statistics are generated correctly, that `buildFilter` prunes matching batches as expected, and that filtered queries continue to use the native Comet in-memory cache scan path. * Updated the plugin so it only installs the Comet cache serializer when the in-memory cache feature is enabled, while leaving any user-configured cache serializer unchanged. I've also rerun the full test suite for these changes, and everything is passing. But w.r.t the microbenchmarks, I haven't put together Spark microbenchmarks before, so I wanted to ask a couple of questions before I start: * Would you prefer extending an existing benchmark in the repo, or adding a new benchmark specifically for the in-memory cache path? * Which workload would be the most useful to measure here? For example, repeated filtered scans over a cached table, or something else? * Are there any particular metrics you'd like to compare (execution time, throughput, cache scan time, etc.)? I can work on those next once I know what would be most useful for evaluating this PR. Thnx again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
