alamb opened a new pull request, #9017: URL: https://github.com/apache/arrow-datafusion/pull/9017
## Which issue does this PR close? Closes https://github.com/apache/arrow-datafusion/issues/8791 ## Rationale for this change There are usecases for several DataFusion users (like IOx) that store observability data, that is often characterized by low cardinality string data encoded as dictionaries. While the current parquet_filter pushdown benchmarks (TODO LINK) cover this example, we don't have an end to end test that does. This has caused problems when have made changes such as https://github.com/apache/arrow-datafusion/issues/7647 that should improve the performance of these queries but we had no reproducible way to measure the impact, and couldn't evaluate if the change was beneficial enough to warrant additional code complexity There in systems such as IOx the data is very often sorted and the sort order is quite important for performance. However, DataFusion's existing benchmark coverage does not have any pre-sorted data ## What changes are included in this PR? 1. Add a datafusion specific data set to to model common patterns in timeseries data -- http access logs / metrics and tracing data specifically. This uses the same generator as used in several other parts of DataFusion 2. Add a XXX benchmark to dfbench, runnable by `bench.sh` along with several queries ## Are these changes tested? All tests ## Are there any user-facing changes? No ## TODO - [ ] add ticket / extend to model logging data as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
