alamb opened a new pull request, #9017:
URL: https://github.com/apache/arrow-datafusion/pull/9017

   ## Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/8791
   
   
   ## Rationale for this change
   
   There are usecases for several DataFusion users (like IOx) that store 
observability data, that is often characterized by low cardinality string data 
encoded as dictionaries. While the current parquet_filter pushdown benchmarks 
(TODO LINK) cover this example, we don't have an end to end test that does. 
   
   This has caused problems when have made changes such as 
https://github.com/apache/arrow-datafusion/issues/7647  that should improve the 
performance of these queries but we had no reproducible way to measure the 
impact, and couldn't evaluate if the change was beneficial enough to warrant 
additional code complexity
   
   There in systems such as IOx the data is very often sorted and the sort 
order is quite important for performance. However, DataFusion's existing 
benchmark coverage does not have any pre-sorted data
   
   ## What changes are included in this PR?
   1. Add  a datafusion specific data set to to model common patterns in 
timeseries data -- http access logs / metrics and tracing data specifically. 
This uses the same generator as used in several other parts of DataFusion
   2. Add a XXX benchmark to dfbench, runnable by `bench.sh` along with several 
queries
   
   
   ## Are these changes tested?
   
   All tests
   
   ## Are there any user-facing changes?
   No
   
   ## TODO
   - [ ] add ticket / extend to model logging data as well
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to