milenkovicm opened a new issue, #1142:
URL: https://github.com/apache/datafusion-ballista/issues/1142
First of all, I'm not expert in rust-python (pyo3) integration, if I've
done/said something stupid,
my apologies.
Current implementation of (py)ballista has limitation when it comes to
`DataFrame` operations.
following code will result with an error:
```python
from pyballista import BallistaBuilder
from datafusion import SessionContext
from datafusion import functions as f
# %%
ctx: SessionContext = BallistaBuilder()\
.config("ballista.job.name", "example ballista")\
.config("ballista.shuffle.partitions", "16")\
.standalone()
df = ctx.sql("SELECT 1 as r").aggregate(
[f.col("r")], [f.count_star()]
)
df.show()
```
it will throw exception (similar to):
```text
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File /Users/user/git/arrow-ballista/python/examples/example.py:3
1 # %%
2 # Select 1 to verify its working
----> 3 df = ctx.sql("SELECT 1 as r").aggregate(
4 [f.col("r")], [f.count_star()]
5 )
6 df.show()
TypeError: argument 'group_by': 'Expr' object cannot be converted to 'Expr'
```
Actually previous implementation had the same problem, the same error will
be thrown (`git checkout 2f223db21557c15080bf865ac692d276b8f0b770`)
```python
# %%
from pyballista import SessionContext
from datafusion import functions as f
ctx = SessionContext("localhost", 50050)
df = ctx.sql("SELECT 1 as r").aggregate(
[f.col("r")], [f.count_star()]
)
df.show()
```
The similar issue is there if `SessionConfig` is used:
```python
from ballista import Ballista, RuntimeConfig, SessionConfig
from datafusion import RuntimeConfig, SessionConfig, SessionContext
runtime =
RuntimeConfig().with_disk_manager_os().with_fair_spill_pool(10000000)
config = (
SessionConfig()
.with_create_default_catalog_and_schema(True)
.with_default_catalog_and_schema("foo", "bar")
.with_target_partitions(8)
.with_information_schema(True)
.with_repartition_joins(False)
.with_repartition_aggregations(False)
.with_repartition_windows(False)
.with_parquet_pruning(False)
.set("datafusion.execution.parquet.pushdown_filters", "true")
)
# %%
ctx: SessionContext = Ballista.builder\
.with_runtime(runtime)\ # it will panic at this point, complaining that
`RuntimeConfig` object cannot be converted to `RuntimeConfig`
.with_config(config)\
.standalone()
ctx.sql("SELECT 1").show()
```
problem with `RuntimeConfig`, `SessionConfig` could be solved if they are
re-exported in ballista:
```python
from ballista import Ballista, RuntimeConfig, SessionConfig
from datafusion import SessionContext
```
but the first problem with `DataFrame` would still remain.
My guess is that there is FFI issue as ballista and datafusion is different
package, I'm not sure what the problem is nor how to resolve this issue.
@timsaucer comment
<https://github.com/apache/datafusion-ballista/issues/1091#issuecomment-2436167064>
make more sense to me now.
### Possible Solution (I)
One obvious way would be to move ballista context creation to
datafusion-python. We need one line context creation:
```rust
let ctx =
datafusion::prelude::SessionContext::remote("df://localhost:50050").await?;
```
As ballista context is the `SessionContext` it would be trivial to
integrate, and, I believe, it would avoid previous issues.
We could only provide "remote context" (no standalone), making it optional
feature for which users python datafusion users could to opt in. This would
somewhat limit number of libraries ballista would bring to datafusion-python
(we could split core to core and client-core to further reduce deps)
This proposal would mean that we would have to bring optional dependency to
datafusion-python, and additional complexity in (datafusion-python) release
process.
(py)ballista would stay, it could expose scheduler and executor control as
proposed in <https://github.com/apache/datafusion-ballista/issues/1107>
Big risk for of this proposal is that ballista could block datafusion python
release in case it goes back to unmaintained mode.
### Possible Solution (II)
Another possible solution is to re-export all classes from datafusion-python
in ballista. I'm not sure how complex or practical this is going to be.
I'm not sure if datafusion python applications would need some kind of
re-writing to be able to run on ballista.
This would put additional responsibility to ballista maintainers (not too
many of them).
### Any Other Solution?
I'm not sure, open to suggestions
## Proposal
Short term proposal:
- Remove test used to test (py)ballista code on ballista commit
<https://github.com/apache/datafusion-ballista/blob/81cfa632f94ef794cc9f35c81676cf9a010c1dbe/.github/workflows/rust.yml#L121-L122>
.
- update datafusion to latest
<https://github.com/apache/datafusion-ballista/pull/1125> .
- focus on ballista (rust) release
<https://github.com/apache/datafusion-ballista/issues/974> .
We should release (py)ballista once we figure out the best approach to fix
it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]