alamb commented on code in PR #21105:
URL: https://github.com/apache/datafusion/pull/21105#discussion_r3002603461


##########
dev/wiki/apache-datafusion.wikitext:
##########
@@ -0,0 +1,113 @@
+<!--
+Draft Wikipedia article.
+-->
+
+{{Short description|Open-source query engine}}
+{{Draft topics|technology|software}}
+{{Infobox software
+| name = Apache DataFusion
+| developer = [[Apache Software Foundation]]
+| programming language = [[Rust (programming language)|Rust]]
+| genre = Query engine
+| license = [[Apache License]]
+| website = {{URL|https://datafusion.apache.org/}}
+}}
+
+'''Apache DataFusion''' is an [[open-source software|open-source]], embeddable 
analytical query engine written in [[Rust (programming language)|Rust]], built 
on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite 
journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres 
|first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet 
Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow 
DataFusion: A Fast, Embeddable, Modular Analytic Query Engine 
|journal=Proceedings of the 2024 International Conference on Management of Data 
|year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite 
web |title=Introduction 
|url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache 
DataFusion |publisher=Apache Software Foundation 
|access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces 
for analytical query execution and is designed to be used as a library by 
develop
 ers building databases, query engines, and analytical tools, rather than as a 
standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" /> 
The project originated in 2017, was donated to the [[Apache Arrow]] project in 
2019, and became a top-level project of the [[Apache Software Foundation]] in 
2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native 
Query Engine for Apache Arrow 
|url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/ 
|website=Apache DataFusion Blog |publisher=Apache Software Foundation 
|date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web 
|title=Apache Software Foundation Announces New Top-Level Project Apache 
DataFusion 
|url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion
 |website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11 
|access-date=2026-03-22}}</ref> As of March 2026, DataFusion exceeded on
 e million monthly downloads on crates.io.<ref name="crates-io">{{cite web 
|title=datafusion |url=https://crates.io/crates/datafusion |website=crates.io 
|access-date=2026-03-26}}</ref>
+
+== History ==
+
+DataFusion originally authored by Andy Grove starting in 2017. It was donated 
to the Apache Arrow Project in February 2019.<ref name="donation-post" /> In 
2024, a paper describing DataFusion was accepted to the industry track of the 
[[ACM SIGMOD]] conference.<ref name="sigmod-accepted">{{cite web |title=SIGMOD 
2024 Industrial Track: Accepted Papers 
|url=https://2024.sigmod.org/industrial-list.shtml |website=SIGMOD 2024 
|access-date=2026-03-22}}</ref><ref name="sigmod-paper" /> In April 2024, the 
project graduated from Apache Arrow and became a top-level Apache project.<ref 
name="asf-tlp" />
+
+== Features ==
+
+DataFusion is a fast, extensible query engine for building data systems. It 
provides a SQL interface and a DataFrame API for constructing queries 
programmatically, a [[query plan|query planner]] and rule-based [[query 
optimization|optimizer]], and a multithreaded vectorized execution engine that 
processes data in columnar batches rather than row by row.<ref 
name="sigmod-paper" /><ref name="intro-docs" />
+
+The engine reads common analytical file formats natively, including [[Apache 
Parquet]], [[comma-separated values|CSV]], [[JSON]], [[Apache Avro|Avro]], and 
Arrow IPC, and uses [[Apache Arrow]]'s columnar memory format throughout 
execution, avoiding [[serialization]] overhead between stages.<ref 
name="sigmod-paper" />
+
+DataFusion is designed for in-process embedding: it runs within the host 
application's process rather than as a separate server, using threads for 
parallel query execution. Its extension points allow downstream systems to add 
[[user-defined function|user-defined functions]], custom data sources, custom 
query languages, and new optimizer rules, enabling developers to build 
specialized database systems on top of DataFusion's planning and execution 
components without reimplementing them.<ref name="sigmod-paper" /><ref 
name="intro-docs" />
+
+== Comparison with related systems ==
+
+DataFusion is frequently compared with other columnar analytical systems 
including [[DuckDB]], [[Polars (software)|Polars]], and Velox, but these 
systems differ significantly in scope and intended use.<ref 
name="composable-dbms">{{cite journal |last1=Pedreira |first1=Pedro 
|last2=Erling |first2=Orri |last3=Mühleisen |first3=Hannes |last4=Muñoz 
|first4=Ruben |last5=Khaled |first5=Wael |last6=Dürsch |first6=Peter |title=The 
Composable Data Management System Manifesto |journal=Proceedings of the VLDB 
Endowment |volume=16 |issue=10 |year=2023 |doi=10.14778/3603581.3603604}}</ref>
+
+=== [[DuckDB]] ===
+
+[[DuckDB]] is an in-process [[online analytical processing|OLAP]] database for 
direct use by end users, with its own storage format and catalog.<ref 
name="duckdb-official">{{cite web |title=DuckDB |url=https://duckdb.org/ 
|website=DuckDB |access-date=2026-03-22}}</ref> DataFusion is a library for 
building such systems, providing query planning and execution components that 
other software can embed without a bundled persistent storage format.<ref 
name="bauplan">{{cite web |title=Duck Hunt: Moving Bauplan from DuckDB to 
DataFusion 
|url=https://www.bauplanlabs.com/post/duck-hunt-moving-bauplan-from-duckdb-to-datafusion
 |website=Bauplan |date=2025-11-05 |access-date=2026-03-22}}</ref>
+
+=== [[Polars (software)|Polars]] ===
+
+[[Polars (software)|Polars]] is also written in [[Rust (programming 
language)|Rust]] and uses the [[Apache Arrow]] memory model, but is designed as 
a self-contained DataFrame library for data manipulation rather than an 
embeddable query engine for building other systems.<ref 
name="polars-official">{{cite web |title=Polars |url=https://pola.rs/ 
|website=Polars |access-date=2026-03-22}}</ref><ref name="faq">{{cite web 
|title=Frequently Asked Questions 
|url=https://datafusion.apache.org/user-guide/faq.html |website=Apache 
DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref>
+
+=== [[Apache Spark]] ===
+
+[[Apache Spark]] is a distributed analytics framework for processing data at 
cluster scale.<ref name="spark-sql">{{cite web |title=Spark SQL & DataFrames 
|url=https://spark.apache.org/sql/ |website=Apache Spark 
|access-date=2026-03-22}}</ref> DataFusion executes queries within a single 
process and is aimed at building embedded analytics systems rather than 
distributed workloads.<ref name="sigmod-paper" /> Apache projects that use 
DataFusion to accelerate Spark include Apache DataFusion Comet, a native 
execution plugin for Spark's [[Java virtual machine|JVM]]-based SQL execution 
engine,<ref name="comet-donation">{{cite web |title=Announcing Apache Arrow 
DataFusion Comet |url=https://arrow.apache.org/blog/2024/03/06/comet-donation/ 
|website=Apache Arrow Blog |publisher=Apache Software Foundation 
|date=2024-03-06 |access-date=2026-03-22}}</ref> and [https://auron.apache.org/ 
Apache Auron], a Spark accelerator that combines the Apache Arrow-DataFusion 
library with the Spark distributed 
 computing framework.<ref name="auron-intro">{{cite web |title=Introduction 
|url=https://auron.apache.org/introduction.html |website=Apache Auron 
|publisher=Apache Software Foundation |access-date=2026-03-23}}</ref>
+
+=== Velox ===
+
+[https://velox-lib.io/ Velox] is an execution engine library developed at 
[[Meta Platforms|Meta]].<ref name="velox-vldb">{{cite journal |last1=Pedreira 
|first1=Pedro |last2=Tan |first2=Wei |last3=Narayanan |first3=Deepak 
|last4=Chattopadhyay |first4=Bikramjit |last5=Erling |first5=Orri |last6=Melnik 
|first6=Sergey |last7=Bhagwan |first7=Ranjita |last8=Dumoulin |first8=Franck 
|title=Velox: Meta's Unified Execution Engine |journal=Proceedings of the VLDB 
Endowment |volume=15 |issue=12 |year=2022 |doi=10.14778/3554821.3554829}}</ref> 
Unlike DataFusion, Velox does not include a SQL frontend or query planning 
framework; it takes an already-optimized query plan as input and handles only 
execution.<ref name="velox-docs">{{cite web |title=Velox in 10 Minutes 
|url=https://facebookincubator.github.io/velox/velox-in-10-min.html 
|website=Velox |access-date=2026-03-22}}</ref>
+
+== Adoption and reception ==
+
+DataFusion has been adopted across a range of analytics and database products. 
[[Cloudflare]] used DataFusion in its Log Explorer product to execute SQL 
queries over log data stored in Cloudflare R2.<ref name="cloudflare">{{cite web 
|title=Cloudflare Log Explorer is now GA, providing native observability and 
forensics |url=https://blog.cloudflare.com/logexplorer-ga/ |website=The 
Cloudflare Blog |publisher=Cloudflare |date=2025-06-18 
|access-date=2026-03-22}}</ref> [[Palantir Technologies|Palantir]] Lightweight 
Pipelines are powered by DataFusion.<ref name="palantir-2025">{{cite web 
|title=Announcements: July 2025 
|url=https://www.palantir.com/docs/foundry/announcements/2025-07 
|website=Palantir Foundry Documentation |publisher=Palantir Technologies 
|date=2025-07-29 |access-date=2026-03-22}}</ref><ref 
name="palantir-2024">{{cite web |title=Announcements: February 2024 
|url=https://www.palantir.com/docs/foundry/announcements/2024-02 
|website=Palantir Foundry Documentation |publisher=P
 alantir Technologies |date=February 2024 |access-date=2026-03-22}}</ref> 
[[InfluxDB]] 3.0 uses DataFusion as part of the FDAP stack: Apache Flight, 
DataFusion, Arrow, and Parquet.<ref name="influx-fdap">{{cite web 
|title=Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to 
build InfluxDB 3.0 
|url=https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/
 |website=InfluxData |date=2023-10-25 |access-date=2026-03-22}}</ref> Other 
users described in public sources include EDB Postgres AI,<ref 
name="siliconangle-edb">{{cite web |title=Enterprise DB begins rolling AI 
features into PostgreSQL 
|url=https://siliconangle.com/2024/05/23/enterprise-db-begins-rolling-ai-features-postgresql/
 |website=SiliconANGLE |date=2024-05-23 |access-date=2026-03-22}}</ref> 
Cube,<ref name="cube-pushdown">{{cite web |title=Query pushdown in Cube's 
semantic layer 
|url=https://cube.dev/blog/query-push-down-in-cubes-semantic-layer 
|website=Cube |date=2024-06-03 
 |access-date=2026-03-22}}</ref> Spice AI,<ref name="spice">{{cite web 
|title=How we use Apache DataFusion at Spice AI 
|url=https://spice.ai/blog/how-we-use-apache-datafusion-at-spice-ai 
|website=Spice AI |date=2026-01-17 |access-date=2026-03-22}}</ref> Pydantic 
Logfire,<ref name="logfire">{{cite web |title=We're changing database 
|url=https://github.com/pydantic/logfire/issues/408 |website=GitHub 
|date=2024-08-29 |access-date=2026-03-22}}</ref> and Kamu.<ref 
name="kamu">{{cite web |title=100X faster ingestion, and FlightSQL support for 
connecting BI tools 
|url=https://www.kamu.dev/blog/2023-09-datafusion-flightsql/ |website=Kamu Data 
|date=2023-09-26 |access-date=2026-03-22}}</ref>

Review Comment:
   > Will work on that.
   
   that is the ideal answer!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to