Re: [PR] Improve documentation (and ASCII art) about streaming execution, and thread pools [datafusion]

via GitHub Thu, 14 Nov 2024 22:08:07 -0800


jonahgao commented on code in PR #13423:
URL: https://github.com/apache/datafusion/pull/13423#discussion_r1843129720



##########
datafusion/core/src/lib.rs:
##########
@@ -406,19 +406,172 @@
 //! [DataFusion paper submitted SIGMOD]: 
https://github.com/apache/datafusion/files/13874720/DataFusion_Query_Engine___SIGMOD_2024.pdf
 //! [implementors of `ExecutionPlan`]: 
https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#implementors
 //!
-//! ## Thread Scheduling
+//! ## Streaming Execution
 //!
-//! DataFusion incrementally computes output from a 
[`SendableRecordBatchStream`]
-//! with `target_partitions` threads. Parallelism is implementing using 
multiple
-//! [Tokio] [`task`]s, which are executed by threads managed by a tokio 
Runtime.
-//! While tokio is most commonly used
-//! for asynchronous network I/O, its combination of an efficient, 
work-stealing
-//! scheduler, first class compiler support for automatic continuation 
generation,
-//! and exceptional performance makes it a compelling choice for CPU intensive
-//! applications as well. This is explained in more detail in [Using 
Rustlang’s Async Tokio
-//! Runtime for CPU-Bound Tasks].
+//! DataFusion is a "streaming" query engine which means `ExecutionPlan`s 
incrementally
+//! read from their input(s) and compute output one [`RecordBatch`] at a time
+//! by continually polling [`SendableRecordBatchStream`]s. Output and
+//! intermediate `RecordBatch`s each have approximately `batch_size` rows,
+//! which amortizes per-batch overhead of execution.
+//!
+//! Note that certain operations, sometimes called "pipeline breakers",
+//! (for example full sorts or hash aggregations) are fundamentally non 
streaming and
+//! must read their input fully before producing **any** output. As much as 
possible,
+//! other operators read a single [`RecordBatch`] from their input to produce a
+//! single `RecordBatch` as output.
+//!
+//! For example, given this SQL query:
+//!
+//! ```sql
+//! SELECT date_trunc('month', time) FROM data WHERE id IN (10,20,30);
+//! ```
+//!
+//! The diagram below shows the call sequence when a consumer calls [`next()`] 
to
+//! get the next `RecordBatch` of output. While it is possible that some
+//! steps run a different threads, typically tokio will use the same thread

Review Comment:
   ```suggestion
   //! steps run on different threads, typically tokio will use the same thread
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Improve documentation (and ASCII art) about streaming execution, and thread pools [datafusion]

Reply via email to