sergiimk commented on code in PR #10033:
URL:
https://github.com/apache/arrow-datafusion/pull/10033#discussion_r1569902481
##########
datafusion/core/src/execution/context/mod.rs:
##########
@@ -471,24 +471,37 @@ impl SessionContext {
/// [`SQLOptions::verify_plan`].
pub async fn execute_logical_plan(&self, plan: LogicalPlan) ->
Result<DataFrame> {
match plan {
- LogicalPlan::Ddl(ddl) => match ddl {
- DdlStatement::CreateExternalTable(cmd) => {
- self.create_external_table(&cmd).await
- }
- DdlStatement::CreateMemoryTable(cmd) => {
- self.create_memory_table(cmd).await
- }
- DdlStatement::CreateView(cmd) => self.create_view(cmd).await,
- DdlStatement::CreateCatalogSchema(cmd) => {
- self.create_catalog_schema(cmd).await
+ LogicalPlan::Ddl(ddl) => {
+ // Box::pin avoids allocating the stack space within this
function's frame
+ // for every one of these individual async functions,
decreasing the risk of
+ // stack overflows.
+ match ddl {
+ DdlStatement::CreateExternalTable(cmd) => {
+ Box::pin(async move {
self.create_external_table(&cmd).await })
+ as std::pin::Pin<Box<dyn futures::Future<Output =
_> + Send>>
Review Comment:
To be honest I didn't have time to generate disassembly to confirm exactly
what Rust+LLVM is doing, so take my conclusions with a grain of salt.
With `large_futures` lint enabled you unfortunately get warnings for
**top-level** `await`s, not inner futures where the problem starts. So if you
try the lint before this fix you'll see thousands of warnings...
I started navigating the calls up the chain to see the similarities between
them, and this lead me to this function. As soon as I `Box`ed
`execute_logical_plan` call - all warnings disappeared.
Coming from languages like C/C++ my understanding is that a function will
typically increment stack pointer to fit all of its local variables (although
afaik this is not strictly specified and left up to compiler). My intuition
thus was that every `await` in this function creates a local variable to store
Future's state and this makes stack frame really big.
I started `Box`ing individual calls and saw the size of futures in lint
errors progressively go down - this seemed consistent with my theory.
After that I though that instead of boxing and awaiting individual futures I
could make different branches return a `Box<dyn Future>` and await for in in
one place. Less verbose code at the cost of dynamic dispatch.
`Box::pin` is of course necessary because tokio cannot tolerate Future state
to be moved between awaits.
And `async move { self.create_external_table(&cmd).await }` is an
unfortunate ugliness to make `cmd` live as long as the future. This is not
necessary anywhere else because other calls take elements of the plan by value,
only `create_external_table` takes a reference (and I didn't want to modify
public API).
I did second-guess this change a lot ... as I expected Rust to allocate
enough stack space not for all futures, but for **the biggest** future out of
all, as only one of them will actually be called. But then I don't know why
would memory go down progressively with every future that I boxed.
So **based purely on clippy stats** this change did help a lot... but I'd be
glad to be proven incorrect if my conclusions were wrong.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]