jtuglu1 opened a new issue, #19039: URL: https://github.com/apache/druid/issues/19039
### Motivation I'm making a running list of Druid modernization ideas that I see other engines in the space have adopted/are already adopting. Please feel free to comment/add your own ideas. 1. Query Latency/Speed Without hw-native implementation of physical operators, Druid lags behind other engines that implement query processing codepaths using SIMD/pipelining instructions and other native code speed-ups. Another factor to this conversation is avoiding garbage collection in high allocation/spilling scenarios. Given JDK 22's support for FFI, I wonder if it makes sense to consider adding support for these accelerations in Rust and plugging them into the existing Druid query processing path. The realtime streaming path could also potentially benefit from these changes where things like GC spikes can sink your p99 ingest throughput/increase query latencies. This kind of split between parsing/planning and processing/execution is already being adopted by initiatives like https://github.com/facebookincubator/velox and https://github.com/StarRocks/starrocks. 2. Data ETL in/out of Druid The Druid segment format, while hyper-optimized for workloads within Druid, serves little value to external ETL services (Spark, etc.) and data manipulation libraries that practitioners are familiar with (pandas, arrow, etc.). I think it would be a good idea to add Apache Arrow reader/writer support for Druid segments (that would allow any 3p system that speaks Arrow to integrate with Druid). It would also open up a path for switching the internal data transfer path (peon/historical -> broker -> router -> client) to use Arrow (instead of json) as well which could potentially speed up queries significantly. 3. CBO for MSQe Currently the querying setup is split between native engine and MSQE. As MSQE matures, I believe the plan is to deprecate the native engine. To be competitive with other engines like Starrocks, etc. who have CBO/statistics-based planning for queries, I think it would be a good idea to add this to MSQ (this would involve tracking things like query/datasource-level statistics, etc.). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
