jtuglu1 opened a new issue, #19039:
URL: https://github.com/apache/druid/issues/19039

   ### Motivation
   
   I'm making a running list of Druid modernization ideas that I see other 
engines in the space have adopted/are already adopting. Please feel free to 
comment/add your own ideas.
   
   1. Query Latency/Speed
   
   Without hw-native implementation of physical operators, Druid lags behind 
other engines that implement query processing codepaths using SIMD/pipelining 
instructions and other native code speed-ups. Another factor to this 
conversation is avoiding garbage collection in high allocation/spilling 
scenarios. Given JDK 22's support for FFI, I wonder if it makes sense to 
consider adding support for these accelerations in Rust and plugging them into 
the existing Druid query processing path. The realtime streaming path could 
also potentially benefit from these changes where things like GC spikes can 
sink your p99 ingest throughput/increase query latencies.
   
   This kind of split between parsing/planning and processing/execution is 
already being adopted by initiatives like 
https://github.com/facebookincubator/velox and 
https://github.com/StarRocks/starrocks.
   
   2. Data ETL in/out of Druid
   
   The Druid segment format, while hyper-optimized for workloads within Druid, 
serves little value to external ETL services (Spark, etc.) and data 
manipulation libraries that practitioners are familiar with (pandas, arrow, 
etc.). I think it would be a good idea to add Apache Arrow reader/writer 
support for Druid segments (that would allow any 3p system that speaks Arrow to 
integrate with Druid). It would also open up a path for switching the internal 
data transfer path (peon/historical -> broker -> router -> client) to use Arrow 
(instead of json) as well which could potentially speed up queries 
significantly.
   
   3. CBO for MSQe
   
   Currently the querying setup is split between native engine and MSQE. As 
MSQE matures, I believe the plan is to deprecate the native engine. To be 
competitive with other engines like Starrocks, etc. who have 
CBO/statistics-based planning for queries, I think it would be a good idea to 
add this to MSQ (this would involve tracking things like query/datasource-level 
statistics, etc.).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to