This is an automated email from the ASF dual-hosted git repository. xudong963 pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/datafusion.git
The following commit(s) were added to refs/heads/main by this push: new 40015a8345 Docs: Update the crate configuration / build settings page (#17038) 40015a8345 is described below commit 40015a8345c6c20f439cb3829897d20ebce70cc9 Author: Andrew Lamb <and...@nerdnetworks.org> AuthorDate: Tue Aug 5 08:06:13 2025 -0400 Docs: Update the crate configuration / build settings page (#17038) * Docs: Update the crate configuration / build settings page * Update docs/source/user-guide/crate-configuration.md Co-authored-by: Oleks V <comph...@users.noreply.github.com> --------- Co-authored-by: Oleks V <comph...@users.noreply.github.com> --- docs/source/user-guide/crate-configuration.md | 78 ++++++++++++++++++--------- 1 file changed, 54 insertions(+), 24 deletions(-) diff --git a/docs/source/user-guide/crate-configuration.md b/docs/source/user-guide/crate-configuration.md index f4a1910f5f..eecf7f5bde 100644 --- a/docs/source/user-guide/crate-configuration.md +++ b/docs/source/user-guide/crate-configuration.md @@ -19,18 +19,19 @@ # Crate Configuration -This section contains information on how to configure DataFusion in your Rust -project. See the [Configuration Settings] section for a list of options that -control DataFusion's behavior. +This section contains information on how to configure builds of DataFusion in +your Rust project. The [Configuration Settings] section lists options that +control additional aspects DataFusion's runtime behavior. [configuration settings]: configs.md -## Add latest non published DataFusion dependency +## Using the nightly DataFusion builds DataFusion changes are published to `crates.io` according to the [release schedule](https://github.com/apache/datafusion/blob/main/dev/release/README.md#release-process) -If you would like to test out DataFusion changes which are merged but not yet -published, Cargo supports adding dependency directly to GitHub branch: +If you would like to use or test versions of the DataFusion code which are +merged but not yet published, you can use Cargo's [support for adding +dependencies] directly to a GitHub branch: ```toml datafusion = { git = "https://github.com/apache/datafusion", branch = "main"} @@ -50,22 +51,58 @@ datafusion = { git = "https://github.com/apache/datafusion", branch = "main", de More on [Cargo dependencies](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies) -## Optimized Configuration +## Optimizing Builds -For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is -worth noting that using the settings in the `[profile.release]` section will significantly increase the build time. +Here are several suggestions to get the Rust compler to produce faster code when +compiling DataFusion. Note that these changes may increase compile time and +binary size. -```toml -[dependencies] -datafusion = { version = "22.0" } -tokio = { version = "^1.0", features = ["rt-multi-thread"] } -snmalloc-rs = "0.3" +### Generate Code with CPU Specific Instructions + +By default, the Rust compiler produces code that runs on a wide range of CPUs, +but may not take advantage of all the features of your specific CPU (such as +certain [SIMD instructions]). This is especially true for x86_64 CPUs, where the +default target is `x86_64-unknown-linux-gnu`, which only guarantees support for +the `SSE2` instruction set. DataFusion can benefit from the more advanced +instructions in the `AVX2` and `AVX512` to speed up operations like filtering, +aggregation, and joins. To tell the Rust compiler to use these instructions, set +the `RUSTFLAGS` environment variable to specify a more specific target CPU. +We recommend setting `target-cpu` or at least `avx2`, or preferably at least +`native` (whatever the current CPU is). For example, to build and run DataFusion +with optimizations for your current CPU: + +```shell +RUSTFLAGS='-C target-cpu=native' cargo run --release +``` + +[simd instructions]: https://en.wikipedia.org/wiki/SIMD + +### Enable Link Time Optimization / Single Codegen Unit + +You can potentially improve your performance by compiling DataFusion into a +single codegen unit which gives the Rust compiler more opportunity to optimize +across crate boundaries. To do so, modify your projects' `Cargo.toml` to include +`lto = true` and `codegen-units = 1` as shown below. Beware that using a single +codegen unit _significantly_ increases `--release` build times. + +```toml [profile.release] lto = true codegen-units = 1 ``` +### Alternate Allocator: `snmalloc` + +You can also use [snmalloc-rs](https://crates.io/crates/snmalloc-rs) crate as +the memory allocator for DataFusion to improve performance. To do so, add the +dependency to your `Cargo.toml` as shown below. + +```toml +[dependencies] +snmalloc-rs = "0.3" +``` + Then, in `main.rs.` update the memory allocator with the below after your imports: <!-- Note can't include snmalloc-rs in a runnable example, because it takes over the global allocator --> @@ -82,17 +119,10 @@ async fn main() -> datafusion::error::Result<()> { } ``` -Based on the instruction set architecture you are building on you will want to configure the `target-cpu` as well, ideally -with `native` or at least `avx2`. - -```shell -RUSTFLAGS='-C target-cpu=native' cargo run --release -``` - -## Enable backtraces +## Enable Backtraces -By default Datafusion returns errors as a plain message. There is option to enable more verbose details about the error, -like error backtrace. To enable a backtrace you need to add Datafusion `backtrace` feature to your `Cargo.toml` file: +By default, Datafusion returns errors as a plain text message. You can enable more verbose details about the error, +such as backtraces by enabling the `backtrace` feature to your `Cargo.toml` file like this: ```toml datafusion = { version = "31.0.0", features = ["backtrace"]} --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org