andygrove opened a new issue, #1398:
URL: https://github.com/apache/datafusion-ballista/issues/1398
## Summary
Improve the Jupyter notebook experience for Ballista by adding SQL magic
commands, example notebooks, and notebook-specific features. While basic
notebook support already works via `_repr_html_`, there's an opportunity to
provide a richer, more integrated experience.
## Current State
PyBallista already supports basic Jupyter usage:
```python
from ballista import BallistaSessionContext
ctx = BallistaSessionContext("df://localhost:50050")
df = ctx.sql("SELECT * FROM my_table LIMIT 10")
df # Renders as HTML table via _repr_html_()
```
**What works today:**
- `_repr_html_()` - DataFrames render as styled HTML tables
- `to_pandas()` / `to_arrow_table()` / `to_polars()` - Data conversion
- `show()` - Terminal-style output
- Example `.py` files with `# %%` cell markers
## Proposed Improvements
### Phase 1: Documentation & Examples (Low Effort)
1. **Add example Jupyter notebooks** to `python/examples/`:
- `getting_started.ipynb` - Basic connection and queries
- `dataframe_api.ipynb` - DataFrame transformations
- `distributed_queries.ipynb` - Multi-stage query examples
2. **Document notebook support** in `python/README.md`
### Phase 2: SQL Magic Commands (Medium Effort)
Add IPython magic commands for a more interactive SQL experience:
```python
%load_ext ballista.jupyter
# Connect to cluster
%ballista connect df://localhost:50050
# Line magic for simple queries
%sql SELECT COUNT(*) FROM orders
# Cell magic for complex queries
%%sql
SELECT
customer_id,
SUM(amount) as total
FROM orders
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10
```
**Implementation sketch:**
```python
# ballista/jupyter.py
from IPython.core.magic import Magics, magics_class, line_magic, cell_magic
@magics_class
class BallistaMagics(Magics):
def __init__(self, shell):
super().__init__(shell)
self.ctx = None
@line_magic
def ballista(self, line):
"""Ballista commands: connect, status, disconnect"""
cmd, *args = line.split()
if cmd == "connect":
self.ctx = BallistaSessionContext(args[0])
return f"Connected to {args[0]}"
elif cmd == "status":
# Show cluster status
pass
@cell_magic
def sql(self, line, cell):
"""Execute SQL query"""
if self.ctx is None:
raise ValueError("Not connected. Use: %ballista connect
df://host:port")
return self.ctx.sql(cell)
def load_ipython_extension(ipython):
ipython.register_magics(BallistaMagics)
```
**Alternative**: Integrate with
[JupySQL](https://github.com/ploomber/jupysql) which provides a mature `%%sql`
magic with features like:
- Query composition
- Result caching
- Plotting integration
- Multiple connection management
### Phase 3: Enhanced Notebook Features (Medium Effort)
1. **Query plan visualization**
```python
df.explain_visual() # Render SVG of execution plan in notebook
```
Leverage existing `/api/job/{job_id}/dot_svg` endpoint.
2. **Progress indicators for long queries**
```python
# Show progress bar during distributed query execution
from ipywidgets import FloatProgress
```
3. **Result size warnings**
```python
# Warn before collecting large results
df.collect() # Warning: Query will return ~1M rows. Use .limit() or
proceed? [y/N]
```
4. **Schema exploration**
```python
%ballista tables # List registered tables
%ballista schema orders # Show schema for table
```
## Benefits
1. **Lower barrier to entry** - SQL magic is familiar to data scientists
2. **Interactive exploration** - Faster iteration in notebooks
3. **Discoverability** - Example notebooks show what's possible
4. **Ecosystem alignment** - Follows patterns from ipython-sql, JupySQL,
DuckDB
## Prior Art
- [JupySQL](https://github.com/ploomber/jupysql) - Modern SQL magic for
Jupyter
- [ipython-sql](https://github.com/catherinedevlin/ipython-sql) - Original
`%%sql` magic
- [DuckDB Jupyter](https://duckdb.org/docs/guides/python/jupyter.html) -
DuckDB's notebook integration
- [Spark magic](https://github.com/baifengbai/sql_magic) - SQL magic for
Spark
## Implementation Checklist
- [ ] Add example `.ipynb` notebooks to `python/examples/`
- [ ] Document notebook support in Python README
- [ ] Create `ballista.jupyter` module with magic commands
- [ ] Add `%ballista connect/status/tables/schema` line magics
- [ ] Add `%%sql` cell magic
- [ ] Add `explain_visual()` method for query plan rendering
- [ ] Consider JupySQL integration as alternative/complement
- [ ] Add progress indicator support for long-running queries
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]