Tokio has a function `spawn_blocking`
<https://docs.rs/tokio/0.3.2/tokio/task/fn.spawn_blocking.html> that allows
running synchronous / blocking code as a future on the current runtime. You
can finagle pretty much any combination of sync / async using
spawn_blocking and channels, though the resulting code may not be the most
beautiful.

Once you introduce `async` into a project or use an `async` library like
rusto, it feels to me like Rust leads you towards pushing async all the way
down and indeed the easiest thing for you, given your described
usecase would be async all the way down.

I personally think having an async implementation of parquet would be very
valuable, as more and more Rust uses tokio / async IO. Maybe we could
implement an optional async interface on top of the blocking implementation.

Likewise, having a sync api and an async api for DataFusion also seems
valuable to to me.

In my opinion, the biggest benefit from having DataFusion use tokio/async
is a single unified thread pool and execution model for both CPU and IO
work. Prior to being async-ized with the tokio thread pool, DataFusion
spawned / managed threads on its own; Adding additional parallelism without
over subscribing the CPU was likely going to be a significant effort. There
is a thread
<https://lists.apache.org/thread.html/rbc4535613cb9af3467255234b49222bb8d3e57ef91790ebeff66aa74%40%3Cdev.arrow.apache.org%3E>
on this mailing list about a similar challenge in the C++ implementation,
to give you a sense of the kinds of issues we are hoping to avoid in
DataFusion with using async

Andrew


On Fri, Oct 30, 2020 at 4:28 AM Rémi Dettai <rdet...@gmail.com> wrote:

> Hi everyone!
>
> If you are reading this, it means that you felt in the trap of my catchy
> (but meaningless) title!
>
> This discussion somewhat relates to [1].
>
> DataFusion has recently made its top level "actions" (collect, write...)
> async. The problem is that most of the codebase is not async (in particular
> Parquet [2]), which means that you have to make an async context work
> together with a sync one.
>
> This works okay... until it doesn't!
>
> I am trying to read into DataFusion from S3, using the AWS Rust SDK Rusoto.
> The problem is that this SDK is itself async. This means that you end up
> with the following layers:
> DataFusion (async) -> Parquet (sync) -> Rusoto (async)
> As you might now, Tokio does not support blocking on a runtime from within
> a runtime.
>
> This triggers a set of questions:
> - Does anybody know a way to make such a setup work?
> - Making Parquet async is extremely difficult and breaking, should we try
> to do it [2] ?
> - Is the benefit of having DataFusion async really big? Should we maybe
> have both a sync and an async API ?
>
> Thanks everybody and have a wonderful day.
>
> Regards,
>
> Remi
>
> [1] https://issues.apache.org/jira/browse/ARROW-9464
> [2] https://issues.apache.org/jira/browse/ARROW-10307
>

Reply via email to