[I] Ballista Python Issue(s) [datafusion-ballista]

via GitHub Fri, 29 Nov 2024 01:36:24 -0800


milenkovicm opened a new issue, #1142:
URL: https://github.com/apache/datafusion-ballista/issues/1142


   First of all, I'm not expert in rust-python (pyo3) integration, if I've 
done/said something stupid,
   my apologies.
   
   Current implementation of (py)ballista has limitation when it comes to 
`DataFrame` operations.
   
   following code will result with an error:
   
   ```python
   from pyballista import BallistaBuilder
   from datafusion import SessionContext
   from datafusion import functions as f
   
   # %%
   ctx: SessionContext = BallistaBuilder()\
       .config("ballista.job.name", "example ballista")\
       .config("ballista.shuffle.partitions", "16")\
       .standalone()
       
   df = ctx.sql("SELECT 1 as r").aggregate(
       [f.col("r")], [f.count_star()]
   )
   df.show()
   ```
   
   it will throw exception (similar to):
   
   ```text
   ---------------------------------------------------------------------------
   TypeError                                 Traceback (most recent call last)
   File /Users/user/git/arrow-ballista/python/examples/example.py:3
         1 # %% 
         2 # Select 1 to verify its working
   ----> 3 df = ctx.sql("SELECT 1 as r").aggregate(
         4     [f.col("r")], [f.count_star()]
         5 )
         6 df.show()
   
   TypeError: argument 'group_by': 'Expr' object cannot be converted to 'Expr'
   ```
   
   Actually previous implementation had the same problem, the same error will 
be thrown (`git checkout 2f223db21557c15080bf865ac692d276b8f0b770`)
   
   ```python
   # %%
   from pyballista import SessionContext
   from datafusion import functions as f
   
   ctx = SessionContext("localhost", 50050)
   
   df = ctx.sql("SELECT 1 as r").aggregate(
       [f.col("r")], [f.count_star()]
   )
   df.show()
   ```
   
   The similar issue is there if `SessionConfig` is used:
   
   ```python
   from ballista import Ballista, RuntimeConfig, SessionConfig
   from datafusion import RuntimeConfig, SessionConfig, SessionContext
   
   runtime = 
RuntimeConfig().with_disk_manager_os().with_fair_spill_pool(10000000)
   config = (
       SessionConfig()
       .with_create_default_catalog_and_schema(True)
       .with_default_catalog_and_schema("foo", "bar")
       .with_target_partitions(8)
       .with_information_schema(True)
       .with_repartition_joins(False)
       .with_repartition_aggregations(False)
       .with_repartition_windows(False)
       .with_parquet_pruning(False)
       .set("datafusion.execution.parquet.pushdown_filters", "true")
   )
   
   # %%
   ctx: SessionContext = Ballista.builder\
       .with_runtime(runtime)\ # it will panic at this point, complaining that 
`RuntimeConfig` object cannot be converted to `RuntimeConfig`
       .with_config(config)\
       .standalone()
   
   ctx.sql("SELECT 1").show()
   ```
   
   problem with `RuntimeConfig`, `SessionConfig` could be solved if they are 
re-exported in ballista:
   
   ```python
   from ballista import Ballista, RuntimeConfig, SessionConfig
   from datafusion import SessionContext
   ```
   
   but the first problem with `DataFrame` would still remain.
   
   My guess is that there is FFI issue as ballista and datafusion is different 
package, I'm not sure what the problem is nor how to resolve this issue.
   
   @timsaucer comment 
<https://github.com/apache/datafusion-ballista/issues/1091#issuecomment-2436167064>
 make more sense to me now.
   
   ### Possible Solution (I)
   
   One obvious way would be to move ballista context creation to 
datafusion-python. We need one line context creation:
   
   ```rust
   let ctx = 
datafusion::prelude::SessionContext::remote("df://localhost:50050").await?;
   ```
   
   As ballista context is the `SessionContext` it would be trivial to 
integrate, and, I believe, it would avoid previous issues.
   
   We could only provide "remote context" (no standalone), making it optional 
feature for which users python datafusion users could to opt in. This would 
somewhat limit number of libraries ballista would bring to datafusion-python 
(we could split core to core and client-core to further reduce deps)
   
   This proposal would mean that we would have to bring optional dependency to 
datafusion-python, and additional complexity in (datafusion-python) release 
process.
   
   (py)ballista would stay, it could expose scheduler and executor control as 
proposed in <https://github.com/apache/datafusion-ballista/issues/1107>
   
   Big risk for of this proposal is that ballista could block datafusion python 
release in case it goes back to unmaintained mode.
   
   ### Possible Solution (II)
   
   Another possible solution is to re-export all classes from datafusion-python 
in ballista. I'm not sure how complex or practical this is going to be.
   I'm not sure if datafusion python applications would need some kind of 
re-writing to be able to run on ballista.
   
   This would put additional responsibility to ballista maintainers (not too 
many of them).
   
   ### Any Other Solution?
   
   I'm not sure, open to suggestions
   
   ## Proposal
   
   Short term proposal:
   
   - Remove test used to test (py)ballista code on ballista commit 
<https://github.com/apache/datafusion-ballista/blob/81cfa632f94ef794cc9f35c81676cf9a010c1dbe/.github/workflows/rust.yml#L121-L122>
 .
   - update datafusion to latest 
<https://github.com/apache/datafusion-ballista/pull/1125> .
   - focus on ballista (rust) release 
<https://github.com/apache/datafusion-ballista/issues/974> .
   
   We should release (py)ballista once we figure out the best approach to fix 
it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Ballista Python Issue(s) [datafusion-ballista]

Reply via email to