alamb opened a new issue, #21076:
URL: https://github.com/apache/datafusion/issues/21076
### Is your feature request related to a problem or challenge?
A Wikipedia article would be useful for Apache DataFusion to make the
project easier to discover, easier to explain, and easier to cite from a
neutral source.
The main benefit is not “marketing copy”; it is legitimacy and
referenceability.
This is even more important these days when Wikipedia is a core training
corpus for LLMs and search engine results
- It gives newcomers a neutral landing page distinct from
https://datafusion.apache,org,
- It makes the project easier for journalists, analysts, conference
organizers, students, and procurement people to cite quickly.
- It strengthens search visibility and entity recognition. In practice
Wikipedia pages often feed search summaries, knowledge panels, mirrors, and LLM
retrieval.
- It signals that the project is notable beyond its own community because
the article must be supported by independent reliable sources.
- It gives a durable place to document ecosystem facts like history,
governance, and adoption that do not fit cleanly into product docs.
### Describe the solution you'd like
I would like a neutral wikipedia page for Apache DataFusion
Here are some similar pages
- https://en.wikipedia.org/wiki/DuckDB
- https://en.wikipedia.org/wiki/Apache_Spark
- https://en.wikipedia.org/wiki/Polars_(software)
DuckDB’s page shows the pattern clearly: a short neutral definition,
history, architecture, language bindings, commercial use, and
foundation/governance in one place, with references to papers and third-party
coverage
### Describe alternatives you've considered
I think a strong article will include many citations. Here are a bunch I
found with the help of codex
Some third-party citations that are probably useful for this article:
- A standalone Apache top-level project as of April 16, 2024, announced
publicly by the Apache Arrow PMC and ASF (Apache Arrow blog
(https://arrow.apache.org/blog/2024/05/07/datafusion-tlp/), ASF announcement
(https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion)).
SIGMOD 2024 technical paper
- It appears in the SIGMOD 2024 program as an accepted industry-track
paper: SIGMOD accepted papers
(https://2024.sigmod.org/industrial-list.shtml), SIGMOD session listing
(https://2024.sigmod.org/program_sigmod.shtml).
- The DOI is 10.1145/3626246.3653368
(https://dl.acm.org/doi/10.1145/3626246.3653368).
Citations for technical importance
- crates.io: 17,668,287 all time downloads
(https://crates.io/crates/datafusion)
- CRN: “The 10 Coolest Open-Source Software Tools Of 2024”
(https://www.crn.com/news/software/2024/the-10-coolest-open-source-software-tools-of-2024?page=3)
It explicitly includes Apache DataFusion and describes it as a fast
extensible query engine, notes
its Rust/Arrow basis, and mentions its 2024 top-level-project milestone.
This is a strongest source on that page for general
notability.
- Datanami: “How the FDAP Stack Gives InfluxDB 3.0 Real-Time Speed,
Efficiency”
(https://www.datanami.com/2024/03/15/how-the-fdap-stack-gives-influxdb-3-0-real-time-speed-efficiency/)
This quotes Paul Dix saying DataFusion had matured substantially and had
best-in-class performance on a number of queries versus other
columnar query engines. It is not a ranking article, but it is
meaningful third-party validation of technical importance.
Third-party citations for usage in products
- SiliconANGLE: “Enterprise DB begins rolling AI features into PostgreSQL”
(https://siliconangle.com/2024/05/23/enterprise-db-begins-rolling-ai-features-postgresql/)
Independent coverage stating EDB combined Apache DataFusion, Arrow, and
Delta Lake in its analytics/lakehouse capability.
- Spice AI: “How we use Apache DataFusion at Spice AI”
(https://spice.ai/blog/how-we-use-apache-datafusion-at-spice-ai)
This says Spice uses DataFusion as its SQL query engine and extends it
with custom TableProviders, optimizer rules, and UDFs for
federated SQL workloads.
- Cloudflare Log Explorer GA announcement
(https://blog.cloudflare.com/logexplorer-ga/) from June 10, 2025.
Queriers fetch matching files from R2 and “process SQL queries using
Apache DataFusion.”
- InfluxData: “Flight, DataFusion, Arrow, and Parquet: Using the FDAP
Architecture to build InfluxDB 3.0”
(https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/)
Clearly states InfluxDB 3.0 chose DataFusion as its query engine
foundation and explains why.
- Pydantic Logfire issue: “We’re changing database”
(https://github.com/pydantic/logfire/issues/408)
Usable as a primary source for adoption only. It says Logfire is moving
from Timescale to a custom database built on DataFusion and
gives reasons.
- Palantir Foundry announcements for July 2025
(https://www.palantir.com/docs/foundry/announcements/2025-07)
This says lightweight pipelines are “powered by DataFusion,”
- Cube: “Query pushdown in Cube’s semantic layer”
(https://cube.dev/blog/query-push-down-in-cubes-semantic-layer)
Good third-party primary source for “used in production by Cube” and for
describing how Cube uses DataFusion internally.
- Kamu: “100X faster ingestion, and FlightSQL support for connecting BI
tools” (https://www.kamu.dev/blog/2023-09-datafusion-flightsql/)
Good third-party primary source for ecosystem adoption. It explicitly
says Kamu added support for Apache DataFusion and reports
performance claims in its own product.
- LanceDB: “Columnar File Readers in Depth: APIs and Fusion”
(https://lancedb.com/blog/columnar-file-readers-in-depth-apis-and-fusion/)
Usable for ecosystem context. It says Lance uses DataFusion extensively
and demonstrates integration with it.
- Bauplan Labs: “Duck Hunt: moving Bauplan from DuckDB to DataFusion”
(https://www.bauplanlabs.com/post/duck-hunt-moving-bauplan-from-duckdb-to-datafusion)
Bauplan explains the migration as driven by DataFusion’s Arrow-first
architecture, extensibility, and community-driven development.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]