Re: Questions about the parallel execution mode and the perf tool

Hongze Zhang Tue, 13 May 2025 06:10:18 -0700

Hi Yong,

Nice topic, thank you for bringing it up. To answer your first question:

> My question is whether we have plan to support parallel mode in future, and 
> when if it is in the feature list?

We had some discussions[1] about introducing Velox parallel execution
to Gluten Velox backend, though no progress was made yet. The main
reason it didn't go forward is, the effort for adopting Velox parallel
execution could be huge, and the benefit of doing that is uncertain.
As we know, query execution in vanilla Spark is also always
parallelized as well, within the in-thread iterator execution model
and shuffle (reparallization). So it's basically about moving from one
parallel execution strategy to another.

Nowadays there are also plenty of debates around pull model vs push
model in the database area, which is similar to the serial vs parallel
comparison we are talking about here. From my limited perspective,
these debates haven't led to a clear conclusion either. While we know
with Velox's parallel execution the query plan could be broken into
even smaller pipelines, so based on the push model theory so far,
there might be a chance that query execution could have better
resource utilization rate. But speaking of integrating that model with
Spark, that will be another story because it will start from removing
a bunch of non-trivial engineering blockers. A better reason for
switching to the parallel model could be it's more Velox-native than
the serial model we are currently using, because Meta develops Velox
for replacing Presto's parallel executor from the very beginning.
However, over time the serial model in Velox is also getting more
serious usages as well including Gluten itself's use case.

Hence, I think proposals like yours are not totally invalid but the
community doesn't have a specific plan so far. But research or PoCs
are definitely welcomed if anyone is interested.

Moreover, IIUC, some folks in the community had some attempts for the
CH backend around the similar topic, they may also be able to give
some inputs here.

Hongze

[1] https://github.com/apache/incubator-gluten/issues/7810

On Tue, May 13, 2025 at 1:49 AM YONG <[email protected]> wrote:
>
>
>
>
> Sorry. Correct my typo issue below. We use Task::next() now, but not 
> Task::start().
>
>
>
>
> At 2025-05-13 08:43:45, "YONG" <[email protected]> wrote:
>
> Hi all,
>
>
> Happy to be here!  I am a newbie in spark and gluten, and have two questions 
> about gluten to ask.
>
>
> The first question is about the task's execution mode in gluten.
>
>
> From velox's source code, it seems that velox can support two execution modes 
> [velox/exec/Task.h  enum class ExecutionMode]:
>       Serial Execution Mode:      which uses single-thread to process the 
> task, and the API is Task::next()
>       Parallel Execution Mode:   which uses multi-threads to process the 
> task, and the API is Task::start()
>
>
> In gluten's code [WholeStageResultIterator::next()],  we only use velox's 
> serial execution mode [Task::next()] now.
> I guess maybe velox is developed by Meta to replace presto's engine at first, 
> and the presto's task can be run in multi-threads. But in Spark, the task 
> should be run in single-thread, which corresponding to one core in one 
> executor. I am not sure about the effort to implement velox's parallel mode 
> in gluten.
> My question is whether we have plan to support parallel mode in future, and 
> when if it is in the feature list?
>
>
> The second question is about profiling tool.
>
>
> I want to collect the C++ code's hotspot & Flame Graph in one query, and to 
> see which function in velox is the cirtical path in my case. I just find the 
> memo about clickhouse backend 
> (incubator-gluten/docs/developers/UsingGperftoolsInCH.md at main · 
> apache/incubator-gluten · GitHub). Is there any memo which I can follow about 
> velox backend?
>
>
> Thanks a lot
>
>
> Best Regards
> Pan Yong
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Questions about the parallel execution mode and the perf tool

Reply via email to