milenkovicm commented on code in PR #6: URL: https://github.com/apache/datafusion-site/pull/6#discussion_r1670716169
########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,248 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the major +improvements since we [released DataFusion 34.0.0] (spoiler alert there are many) +and a preview of where the community plans to focus in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusion’s primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusion’s primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://arrow.apache.org/datafusion/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + + +DataFusion is very much a community endeavor. Our core thesis is that as a +community we can build much more advanced technology than any of us as +individuals or companies could alone. In the last 6 months between `26.0.0` and +`34.0.0`, community growth has been strong. We accepted and reviewed over a +thousand PRs from 124 different committers, created over 650 issues and closed 517 +of them. +You can find a list of all changes in the detailed [CHANGELOG]. + +<!-- +$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l + 1009 + +$ git shortlog -sn 26.0.0..34.0.0 . | wc -l + 124 + +https://crates.io/crates/datafusion/26.0.0 +DataFusion 26 released June 7, 2023 + +https://crates.io/crates/datafusion/34.0.0 +DataFusion 34 released Dec 17, 2023 + +Issues created in this time: 214 open, 437 closed +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17 + +Issues closes: 517 +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+ + +PRs merged in this time 908 +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17 +--> + + + +# Community Growth 📈 + +* DataFusion is now a new top level project (TODO LINK) +* DataFusion Comet has been donated and is nearing its first release (TODO LINK) +* We published a paper in SIGMOD [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine](https://dl.acm.org/doi/10.1145/3626246.3653368) +* Many interesting posts: TODO find some +* DataFusion mentioned as a candidate to be the POSIX of databases in ["What Goes Around Comes Around... And Around...](https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf) +* ["Why you should keep an eye on Apache DataFusion and its community."](https://www.cpard.xyz/posts/datafusion/) + + +Meetups: +Austin https://github.com/apache/datafusion/discussions/8522 +San Francisco +Hangzhou +NYC https://github.com/apache/datafusion/discussions/11213 +Belgrade + +[CHANGELOG]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md + +# Improved Performance 🚀 + +Performance is a key feature of DataFusion, DataFusion is + +TODO: Planning speed improvements: + +Implement specialized group values for single Uft8/LargeUtf8/Binary/LargeBinary column #8827 + + +https://github.com/apache/datafusion/pull/8827 + +## In progress +StringView (TODO get links) + + + +# Improved quality + +TODO mention the SQLancer + + +# New Features ✨ + +Improvements: unnest, null handling improvements for lead/lag + +## + +## TreeNode API + + +## Other Features + +Recursive CTEs https://github.com/apache/datafusion/pull/9619 / https://github.com/apache/datafusion/issues/462 + +# Building Systems is Easier with DataFusion 🛠️ + +Large scale "extract scalar functions from the core": all scalar functions are now exactly the same API as built in https://github.com/apache/datafusion/issues/9285 + +We are close to completing the same treatment for aggregates and then window functions + +SQL to String (both exprs and plans): https://github.com/apache/datafusion/issues/9494 + +WASM builds: https://github.com/apache/datafusion/discussions/9834 + +## Documentation +It is easier than ever to get started using DataFusion with the +new [Library Users Guide] as well as significantly improved the [API documentation]. + +[Library Users Guide]:https://arrow.apache.org/datafusion/library-user-guide/index.html +[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html + +## User Defined SQL Extensiions + +## Support for `CREATE FUNCTION` +https://github.com/apache/datafusion/pull/9333 + +Let's you build systems that support user defined functions + +Huge thanks to [@milenkovicm](https://github.com/milenkovicm) + +```sql +CREATE FUNCTION my_func(DOUBLE, DOUBLE) + RETURNS DOUBLE + RETURN $1 $3 +"#; +``` + +And + +```sql +CREATE FUNCTION iris(FLOAT[]) +RETURNS FLOAT[] +LANGUAGE TORCH +AS 'models:/iris@champion' +``` + +```sql +CREATE FUNCTION func(FLOAT[]) +RETURNS FLOAT[] +LANGUAGE WASM +AS 'func.wasm' +``` + +BTW it would be great if someone made a demo showing how to do this (see https://github.com/apache/datafusion/issues/9326 ) Review Comment: What would you need for this @alamb ? I have examples for both on my GitHub, would them suit ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org