simicd opened a new issue, #158:
URL: https://github.com/apache/arrow-datafusion-python/issues/158
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
When using the Python language bindings, the string representations of the
datafusion's objects could be more useful. Currently they show Python's default
string representation, e.g.:
```python
>>> df
<datafusion.DataFrame object at 0x0000018AA0220ED0>
>>> literal(3.14)
<datafusion.Expression object at 0x0000018AA026A3F0>
>>> f.ceil(column("age"))
<datafusion.Expression object at 0x0000018AA026A4E0>
>>> accum(column("a"))
<datafusion.Expression object at 0x000001FB6C450C60>
>>> Config()
<datafusion.Config object at 0x000001C73AD87230>
>>> ctx = SessionContext()
>>> ctx
<datafusion.SessionContext object at 0x0000020C86742170>
>>> ctx.catalog()
<datafusion.Catalog object at 0x0000020C867A65A0>
>>> ctx.catalog().database()
<datafusion.Database object at 0x0000020C867A66F0>
>>> ctx.catalog().database().table("t")
<datafusion.Table object at 0x0000020C867A67E0>
```
Other packages such as pandas or polars provide more specific outputs:
```python
>>> pandas_df = pd.DataFrame(data={"a": [1, 2, 3], "b": ["Hello", "World",
"!"]})
>>> pandas_df
a b
0 1 Hello
1 2 World
2 3 !
>>> polars_df = pl.DataFrame(data={"a": [1, 2, 3], "b": ["Hello", "World",
"!"]})
>>> polars_df
shape: (3, 2)
┌─────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ Hello │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ World │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ ! │
└─────┴───────┘
```
<br/><br/>
**Describe the solution you'd like**
Ideally customize information displayed in the debugger/cosole. I raised PR
#... which implements Python's `__repr__` methods for the datafusion's classes
listed below. This method gets called to populate the debugger in VS Code,
among others.
<details><summary>Debugging in VS Code - Before</summary>
<img
src="https://user-images.githubusercontent.com/10134699/215352233-db3a600b-44f9-4db1-89cf-bf0a7b41a6d3.png"
width="600">
</details>
<details><summary>Debugging in VS Code - After</summary>
<img
src="https://user-images.githubusercontent.com/10134699/215352522-8548ed78-b1e6-47f8-b4af-55339df4f76b.png"
width="600">
</details>
Below is an overview of the proposed outputs, very curious to hear your
thoughts/feedback:
### Dataframe
Print up to ten rows of the dataframe
```python
>>> df
DataFrame()
+---+-------+
| a | b |
+---+-------+
| 1 | Hello |
| 2 | World |
| 3 | ! |
+---+-------+
```
### Expressions
```python
>>> literal(3.14)
Expr(Float64(3.14))
>>>f.ceil(column("age"))
Expr(ceil(age))
>>> accum(column("a"))
Expr(MissingMethods(a))
```
### Context
Use available identifiers or properties - here it would be nice if Catalog,
Database and Table would have a unique identifier or name but I didn't find
such properties (in the example below the Table object doesn't seem to know
it's labeled "t", only the Database object seems to store that info).
```python
>>> ctx = SessionContext()
>>> ctx
SessionContext(session_id=7f2a7ddc-aa43-4900-a0e5-d22493c947e6)
>>> ctx.catalog()
Catalog(schema_names=[public]) # Ideally `Catalog(name=datafusion,
schema_names=[public])`
>>> ctx.catalog().database()
Database(table_names=[t]) # Ideally `Database(name=public,
table_names=[t])`
>>> ctx.catalog().database().table("t")
Table(kind=physical) # Ideally `Table(name=t, kind=physical)`
```
### Configuration
```python
>>> config = Config()
>>> config
Config({'datafusion.catalog.create_default_catalog_and_schema': 'true',
'datafusion.catalog.default_catalog': 'datafusion',
'datafusion.catalog.default_schema': 'public',
'datafusion.catalog.information_schema': 'false',
'datafusion.catalog.location': None, 'datafusion.catalog.format': None,
'datafusion.catalog.has_header': 'false', 'datafusion.execution.batch_size':
'8192', 'datafusion.execution.coalesce_batches': 'true',
'datafusion.execution.collect_statistics': 'false',
'datafusion.execution.target_partitions': '20',
'datafusion.execution.time_zone': '+00:00',
'datafusion.execution.parquet.enable_page_index': 'false',
'datafusion.execution.parquet.pruning': 'true',
'datafusion.execution.parquet.skip_metadata': 'true',
'datafusion.execution.parquet.metadata_size_hint': None,
'datafusion.execution.parquet.pushdown_filters': 'false',
'datafusion.execution.parquet.reorder_filters': 'false',
'datafusion.optimizer.enable_round_robin_repartition': 'true',
'datafusion.optimizer.filter
_null_join_keys': 'false', 'datafusion.optimizer.repartition_aggregations':
'true', 'datafusion.optimizer.repartition_joins': 'true',
'datafusion.optimizer.repartition_windows': 'true',
'datafusion.optimizer.skip_failed_rules': 'true',
'datafusion.optimizer.max_passes': '3',
'datafusion.optimizer.top_down_join_key_reordering': 'true',
'datafusion.optimizer.prefer_hash_join': 'true',
'datafusion.optimizer.hash_join_single_partition_threshold': '1048576',
'datafusion.explain.logical_plan_only': 'false',
'datafusion.explain.physical_plan_only': 'false'})
```
**Describe alternatives you've considered**
n/a
**Additional context**
n/a
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]