[I] [DISCUSSION] Lowering the barrier to new users (Lessons from-799 CMU Optimizer Class) [datafusion]

via GitHub Thu, 30 Jan 2025 08:40:12 -0800


alamb opened a new issue, #14373:
URL: https://github.com/apache/datafusion/issues/14373


   # TLDR
   * This is feedback from CMU on how to make DataFusion more compelling for 
Academic work.
   * Does anyone have suggestions about additional DataFusion projects to 
suggest (see below)?
   
   # Background
   
   CMU's DB group is running [CMU-799 OSpecial Topics in Databases: Query 
Optimization](https://15799.courses.cs.cmu.edu/spring2025/) this Spring 2025.
   
   > This course ... covers the classical and state-of-the-art methods and 
algorithms for converting SQL statements into physical query plans. Additional 
topics include cost models, feedback mechanisms, and adaptive query 
optimization. All class projects will be in the context of an open-source query 
optimizer service using real-world queries. The course is appropriate for 
graduate students in software systems and advanced undergraduates with nasty 
programming skills that are pursuing a database-centric lifestyle.
   
   I would very much like to encourage a virtuous cycle of research/academic 
work on DataFusion --> contributing back and thus want to make DataFusion a 
compelling target for this type of class. I think the more use of DataFusion 
the more contributions we will attract and thus the better it will get for 
everyone.
   
   Andy Pavlo connected me with @lmwnshn, who is helping organize the class, 
and provided the following feedback
   
   # Class Projects
   
   @lmwnshn noted that the class includes a project (see 
[Syllabus](https://15799.courses.cs.cmu.edu/spring2025/syllabus.html) here):
   
   > The main component of this course will be the group programming project. 
Students will organize into groups and to implement a large system / prototype. 
The projects are designed to be (1) relevant to the materials discussed in 
class and (2) require a significant programming effort from all team members.
   
   * Release Date: Feb 17, 2025
   * Due Date: May 01, 2025
   
   They were interested in working on some projects in DataFusion. Here were 
some ideas I had that I think might be good but would love to hear what other 
people think is important. This is also a great opportunity for others to get 
exposed to advanced techniques and learn how to work on large open source 
projects like DataFusion (obviously also how great our community is):
   
   - Dynamic Filter Pushdown: https://github.com/apache/datafusion/issues/7955
   - LATERAL JOINs: https://github.com/apache/datafusion/issues/10048
   - Range Joins / ASOF joins: https://github.com/apache/datafusion/issues/318
   
   **Ask**: Does anyone else know of other interesting potential projects we 
can suggest??
   
   # Improving the Ease of Student Onboarding
   
   @lmwnshn mentioned a few reasons he chose DuckDB over DataFusion for the 
[first 
project(link)](https://15799.courses.cs.cmu.edu/spring2025/project1.html), that 
were related to easier student onboarding
   
   ## "Speed to running TPCH"
   
   TPCH is an important benchmark for analytic systems. The ease of getting to 
the point of running TPCH is important
   
   The **DuckDB Experience**:
   
   1. Download the duckdb binary
   2. Run commands `INSTALL tpch; LOAD tpch; CALL dbgen(sf = 1);`
   
   The **DataFusion Experience**:
   
   1. install Rust,
   2. install datafusion-cli,
   3. give them background knowledge about TPC-H
   4. point them to [bench.sh (~750 
LOC)](https://github.com/apache/datafusion/blob/main/benchmarks/bench.sh)
   
   Ideas to improve the DataFusion experience:
   1. Add built in `dbgen` function to `datafusion-cli` (I will file a ticket)
   2. Add more Pre-built binaries for `datafusion-cli` (in addition to homebrew)
   3. Improve getting started instructions to make the process easier / more 
copy/paste
   
   ## Easier / Better Optimizer Configuration
   
   DuckDB has a dedicated optimizer section, easy to selectively disable 
optimization, see [the duckdb_optimizers() 
function](https://duckdb.org/docs/configuration/pragmas#query-optimization)
   
   DataFusion has many prefixed (`datafusion.optimizer` 
settings)[https://datafusion.apache.org/user-guide/configs.html but it is not 
clear how these can enable/disable optimizer passes.
   
   Ideas to improve the DataFusion experience:
   1. Make it easier to disable just the optimizers from datafusion-cli (I 
think this would require making a better distinction between passes required to 
run like `EnforceDistribution` and optimizations like `PushDownProjection). I 
will file a ticket / find one
   
   
   ## Better explain plan
   
   EXPLAIN visualization. I think DuckDB's is easier for students to look at.
   
   Compare: DuckDB
   ```
   
   ┌─────────────────────────────┐
   │┌───────────────────────────┐│
   ││       Physical Plan       ││
   │└───────────────────────────┘│
   └─────────────────────────────┘
   ...
   ┌─────────────┴─────────────┐
   │         PROJECTION        │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │             k             │
   │      length(Referer)      │
   │          Referer          │
   └─────────────┬─────────────┘
   ┌─────────────┴─────────────┐
   │           FILTER          │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │      (Referer != '')      │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │        EC: 99997497       │
   └─────────────┬─────────────┘
   ┌─────────────┴─────────────┐
   │       PARQUET_SCAN        │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │          Referer          │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │        EC: 99997497       │
   └───────────────────────────┘
   ```
   
   
   DataFusion
   ```sql
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                           |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Limit: skip=0, fetch=25                                    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                           |
   |               |   Sort: l DESC NULLS FIRST, fetch=25                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                           |
   |               |     Projection: 
regexp_replace(hits.parquet.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1"))
 AS k, AVG(character_length(hits.parquet.Referer)) AS l, COUNT(*) AS c, 
MIN(hits.parquet.Referer)                                                       
                                                                                
                                                                                
                                                                                
                                       |
   |               |       Filter: COUNT(*) > Int64(100000)                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                           |
   |               |         Aggregate: 
groupBy=[[regexp_replace(hits.parquet.Referer, 
Utf8("^https?://(?:www\.)?([^/]+)/.*$"), Utf8("\1"))]], 
aggr=[[AVG(CAST(character_length(hits.parquet.Referer) AS Float64)), 
COUNT(UInt8(1)) AS COUNT(*), MIN(hits.parquet.Referer)]]                        
                                                                                
                                                                                
                                                                                
                       |
   |               |           Filter: hits.parquet.Referer != Utf8("")         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                           |
   |               |             TableScan: hits.parquet projection=[Referer], 
partial_filters=[hits.parquet.Referer != Utf8("")]                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                            |
   ...
   ```
   
   We have discussed this previously:
   - https://github.com/apache/datafusion/issues/9371
   
   I also left a note with a suggested way to get started implementing this: 
https://github.com/apache/datafusion/issues/9371#issuecomment-2624551049
   
   
   
   ###  `enable_ident_normalization` doesn't work / paper cut
   
   There appears to (still) be some non trivial confusion about identifier 
normalization
   
   For example, 
https://github.com/apache/datafusion/discussions/9399#discussioncomment-8642217
   
   I have corrected the issue 
https://github.com/apache/datafusion/discussions/9399#discussioncomment-12007694
 but clearly there is more confusion
   
   Ideas to improve DataFusion:
   1. Consolidate examples
   2. Improve documentation and
   3. Examples in `datafusion-python`
   
   # Substait
   
   @lmwnshn reported that for their [first 
project(link)](https://15799.courses.cs.cmu.edu/spring2025/project1.html), they 
tried running substrait plans created by Apache Calcite using DataFusion. The 
idea was to
   * Use Calcite as frontend SQL --> substrait
   * Take the optimized queries --> run on the engine
   * turn on / off optimzer passes
   
   The good news is that DataFusion works well compared to the other system 
they tried, DuckDB. DuckDB could run only 1 query and DataFusion could run 15 
of 21 queries (:clappy: for @vbarua, @BlizzaraB and @Lordworms!). However, 6 
queries still failed so they reverted to a SQL based approach using DuckDB (see 
above)
   
   He also pointed out the the official substrait consumer-testing repo simply 
checks that that Isthmus -> DataFusion throws an exception.:
   - 
https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/snapshots/results/integration/tpch
   
   
   ### Learnings
   Apparently DuckDb can read subtrait json, this was easier for students to 
understand than the protobuf format / BLOB format.
   
   I think DataFusion can read this  format too (thanks @Lordworms!)
   - For example: 
https://github.com/apache/datafusion/blob/0edc3d99ad63399696135d3bb3fc387b38803d1f/datafusion/substrait/tests/testdata/tpch_substrait_plans/query_01_plan.json
 (read by 
https://github.com/apache/datafusion/blob/0edc3d99ad63399696135d3bb3fc387b38803d1f/datafusion/substrait/tests/cases/consumer_integration.rs#L35)
   
   However, it is not super clear how to run such plans with `datafusion-cli`
   
   Ideas to improve DataFusion:
   * Consider adding some way to use substrait plans via `datafusion-cli` 
(follow the [duckdb syntax](https://duckdb.org/docs/extensions/substrait.html))
   * It might make this more apparent if `datafusion-cli` allowed running 
substrait plans via the command line (perhaps a table function?)
   * Implement / update the `substrait-consumer` repo to ensure / show 
DataFusion substrait working
   
   
   *********
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [DISCUSSION] Lowering the barrier to new users (Lessons from-799 CMU Optimizer Class) [datafusion]

Reply via email to