[GitHub] [arrow-datafusion-python] simicd opened a new issue, #158: Improve string representation of datafusion classes (dataframe, context, expression, ...)

via GitHub Sun, 29 Jan 2023 12:00:49 -0800


simicd opened a new issue, #158:
URL: https://github.com/apache/arrow-datafusion-python/issues/158


   
   
   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   When using the Python language bindings, the string representations of the 
datafusion's objects could be more useful. Currently they show Python's default 
string representation, e.g.:
   
   ```python
   >>> df
   <datafusion.DataFrame object at 0x0000018AA0220ED0>
   
   >>> literal(3.14)
   <datafusion.Expression object at 0x0000018AA026A3F0>
   
   >>> f.ceil(column("age"))
   <datafusion.Expression object at 0x0000018AA026A4E0>
   
   >>> accum(column("a"))
   <datafusion.Expression object at 0x000001FB6C450C60>
   
   >>> Config()
   <datafusion.Config object at 0x000001C73AD87230>
   
   >>> ctx = SessionContext()
   >>> ctx
   <datafusion.SessionContext object at 0x0000020C86742170>
   
   >>> ctx.catalog()
   <datafusion.Catalog object at 0x0000020C867A65A0>
   
   >>> ctx.catalog().database()
   <datafusion.Database object at 0x0000020C867A66F0>
   
   >>> ctx.catalog().database().table("t")
   <datafusion.Table object at 0x0000020C867A67E0>
   ```
   
   
   Other packages such as pandas or polars provide more specific outputs:
   
   ```python
   >>> pandas_df = pd.DataFrame(data={"a": [1, 2, 3], "b": ["Hello", "World", 
"!"]})
   >>> pandas_df
      a      b
   0  1  Hello
   1  2  World
   2  3      !
   
   
   >>> polars_df = pl.DataFrame(data={"a": [1, 2, 3], "b": ["Hello", "World", 
"!"]})
   >>> polars_df
   shape: (3, 2)
   ┌─────┬───────┐
   │ a   ┆ b     │
   │ --- ┆ ---   │
   │ i64 ┆ str   │
   ╞═════╪═══════╡
   │ 1   ┆ Hello │
   ├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
   │ 2   ┆ World │
   ├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
   │ 3   ┆ !     │
   └─────┴───────┘
   ``` 
   
   <br/><br/>
   
   
   **Describe the solution you'd like**
   Ideally customize information displayed in the debugger/cosole. I raised PR 
#... which implements Python's `__repr__` methods for the datafusion's classes 
listed below. This method gets called to populate the debugger in VS Code, 
among others. 
   
   <details><summary>Debugging in VS Code - Before</summary>
   <img 
src="https://user-images.githubusercontent.com/10134699/215352233-db3a600b-44f9-4db1-89cf-bf0a7b41a6d3.png";
 width="600">
   </details>
   <details><summary>Debugging in VS Code - After</summary>
   <img 
src="https://user-images.githubusercontent.com/10134699/215352522-8548ed78-b1e6-47f8-b4af-55339df4f76b.png";
 width="600">
   </details>
   
   Below is an overview of the proposed outputs, very curious to hear your 
thoughts/feedback:
   
   ### Dataframe
   Print up to ten rows of the dataframe
   ```python
   >>> df
   DataFrame()
   +---+-------+
   | a | b     |
   +---+-------+
   | 1 | Hello |
   | 2 | World |
   | 3 | !     |
   +---+-------+
   ```
   
   ### Expressions
   ```python
   >>> literal(3.14)
   Expr(Float64(3.14))
   
   >>>f.ceil(column("age"))
   Expr(ceil(age))
   
   >>> accum(column("a"))
   Expr(MissingMethods(a))
   ```
   
   ### Context
   Use available identifiers or properties - here it would be nice if Catalog, 
Database and Table would have a unique identifier or name but I didn't find 
such properties (in the example below the Table object doesn't seem to know 
it's labeled "t", only the Database object seems to store that info).
   
   
   ```python
   >>> ctx = SessionContext()
   >>> ctx
   SessionContext(session_id=7f2a7ddc-aa43-4900-a0e5-d22493c947e6)
   
   >>> ctx.catalog()
   Catalog(schema_names=[public])   # Ideally `Catalog(name=datafusion, 
schema_names=[public])`
   
   >>> ctx.catalog().database()
   Database(table_names=[t])        # Ideally `Database(name=public, 
table_names=[t])`
   
   >>> ctx.catalog().database().table("t")
   Table(kind=physical)             # Ideally `Table(name=t, kind=physical)`
   ```
   
   ### Configuration
   ```python
   >>> config = Config()
   >>> config
   Config({'datafusion.catalog.create_default_catalog_and_schema': 'true', 
'datafusion.catalog.default_catalog': 'datafusion', 
'datafusion.catalog.default_schema': 'public', 
'datafusion.catalog.information_schema': 'false', 
'datafusion.catalog.location': None, 'datafusion.catalog.format': None, 
'datafusion.catalog.has_header': 'false', 'datafusion.execution.batch_size': 
'8192', 'datafusion.execution.coalesce_batches': 'true', 
'datafusion.execution.collect_statistics': 'false', 
'datafusion.execution.target_partitions': '20', 
'datafusion.execution.time_zone': '+00:00', 
'datafusion.execution.parquet.enable_page_index': 'false', 
'datafusion.execution.parquet.pruning': 'true', 
'datafusion.execution.parquet.skip_metadata': 'true', 
'datafusion.execution.parquet.metadata_size_hint': None, 
'datafusion.execution.parquet.pushdown_filters': 'false', 
'datafusion.execution.parquet.reorder_filters': 'false', 
'datafusion.optimizer.enable_round_robin_repartition': 'true', 
'datafusion.optimizer.filter
 _null_join_keys': 'false', 'datafusion.optimizer.repartition_aggregations': 
'true', 'datafusion.optimizer.repartition_joins': 'true', 
'datafusion.optimizer.repartition_windows': 'true', 
'datafusion.optimizer.skip_failed_rules': 'true', 
'datafusion.optimizer.max_passes': '3', 
'datafusion.optimizer.top_down_join_key_reordering': 'true', 
'datafusion.optimizer.prefer_hash_join': 'true', 
'datafusion.optimizer.hash_join_single_partition_threshold': '1048576', 
'datafusion.explain.logical_plan_only': 'false', 
'datafusion.explain.physical_plan_only': 'false'})
   ```
   
   
   
   
   
   **Describe alternatives you've considered**
   n/a
   
   **Additional context**
   n/a
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion-python] simicd opened a new issue, #158: Improve string representation of datafusion classes (dataframe, context, expression, ...)

Reply via email to