Hi everyone!

If you are reading this, it means that you felt in the trap of my catchy
(but meaningless) title!

This discussion somewhat relates to [1].

DataFusion has recently made its top level "actions" (collect, write...)
async. The problem is that most of the codebase is not async (in particular
Parquet [2]), which means that you have to make an async context work
together with a sync one.

This works okay... until it doesn't!

I am trying to read into DataFusion from S3, using the AWS Rust SDK Rusoto.
The problem is that this SDK is itself async. This means that you end up
with the following layers:
DataFusion (async) -> Parquet (sync) -> Rusoto (async)
As you might now, Tokio does not support blocking on a runtime from within
a runtime.

This triggers a set of questions:
- Does anybody know a way to make such a setup work?
- Making Parquet async is extremely difficult and breaking, should we try
to do it [2] ?
- Is the benefit of having DataFusion async really big? Should we maybe
have both a sync and an async API ?

Thanks everybody and have a wonderful day.

Regards,

Remi

[1] https://issues.apache.org/jira/browse/ARROW-9464
[2] https://issues.apache.org/jira/browse/ARROW-10307

Reply via email to