Re: [PR] chore: add docs for how to use TrackConsumersPool [datafusion]

via GitHub Fri, 15 Aug 2025 22:38:33 -0700


wiedld commented on code in PR #17213:
URL: https://github.com/apache/datafusion/pull/17213#discussion_r2280255815



##########
datafusion/execution/src/memory_pool/pool.rs:
##########
@@ -295,9 +295,228 @@ impl TrackedConsumer {
 ///
 /// By tracking memory reservations more carefully this pool
 /// can provide better error messages on the largest memory users
+/// when memory allocation fails.
 ///
 /// Tracking is per hashed [`MemoryConsumer`], not per [`MemoryReservation`].
 /// The same consumer can have multiple reservations.
+///
+/// # Automatic Usage with RuntimeEnvBuilder
+///
+/// The easiest way to use `TrackConsumersPool` is through 
[`RuntimeEnvBuilder::with_memory_limit()`](crate::runtime_env::RuntimeEnvBuilder::with_memory_limit),
+/// which automatically creates a `TrackConsumersPool` wrapping a 
[`GreedyMemoryPool`] with tracking
+/// for the top 5 memory consumers:
+///
+/// ```no_run
+/// # use datafusion_execution::runtime_env::RuntimeEnvBuilder;
+/// # use datafusion_execution::config::SessionConfig;
+/// # use std::sync::Arc;
+/// # async fn example() -> datafusion_common::Result<()> {
+/// // Create a runtime with 20MB memory limit and consumer tracking
+/// let runtime = RuntimeEnvBuilder::new()
+///     .with_memory_limit(20_000_000, 1.0)  // 20MB, 100% utilization
+///     .build_arc()?;
+///
+/// let config = SessionConfig::new();
+///
+/// // Note: In real usage, you would use datafusion::prelude::SessionContext
+/// // let ctx = SessionContext::new_with_config_rt(config, runtime);
+///
+/// // Register your table
+/// // ctx.register_table("t", table)?;
+///
+/// // Run a memory-intensive query
+/// let query = "
+///     COPY (select * from t)
+///     TO '/tmp/output.parquet'
+///     STORED AS PARQUET OPTIONS (compression 'uncompressed')";
+///
+/// // If memory is exhausted, you'll get detailed error messages like:
+/// // "Additional allocation failed with top memory consumers (across 
reservations) as:
+/// //   ParquetSink(ArrowColumnWriter)#123(can spill: false) consumed 15.2 MB,
+/// //   HashJoin#456(can spill: false) consumed 3.1 MB,
+/// //   Sort#789(can spill: true) consumed 1.8 MB,
+/// //   Aggregation#101(can spill: false) consumed 892.0 KB,
+/// //   Filter#202(can spill: false) consumed 156.0 KB.
+/// // Error: Failed to allocate additional 2.5 MB..."
+///
+/// // let result = ctx.sql(query).await?.collect().await;
+/// # Ok(())
+/// # }
+/// ```
+///
+/// # How to use in ExecutionPlan implementations
+///
+/// When implementing custom ExecutionPlans that need to use significant 
memory, you should
+/// use the memory pool to track and limit memory usage:

Review Comment:
   I wasn't sure about this section. It's explaining how the 
`MemoryReservation` should be used; which applies to any `MemoryPool`.
   
   It's just that for our specific memory pool implementation, 
`TrackConsumersPool`, it will be aggregating the reservations per memory 
consumer. So it felt like added this part helped draw the pieces together for 
other contributors, maybe?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: add docs for how to use TrackConsumersPool [datafusion]

Reply via email to