alamb commented on issue #5108: URL: https://github.com/apache/arrow-datafusion/issues/5108#issuecomment-1414320116
My measurements actually suggest that DataFusion 17.0.0 is better in this regards than DataFusion 16.0.0 Using this input file: ```shell !curl -L 'https://drive.google.com/uc?export=download&id=18gv0Yd_a-Zc7CSolol8qeYVAAzSthnSN&confirm=t' > lineitem.parquet ``` Using this program: ```rust use datafusion::{prelude::{SessionContext, SessionConfig}, error::Result, execution::{runtime_env::{RuntimeConfig, RuntimeEnv}, memory_pool::{GreedyMemoryPool, FairSpillPool}, disk_manager::DiskManagerConfig}}; #[tokio::main(flavor = "multi_thread", worker_threads = 10)] async fn main() -> Result<()> { let runtime_config = RuntimeConfig::new() //.with_memory_pool(Arc::new(GreedyMemoryPool::new(1024*1024*1024))) .with_memory_pool(Arc::new(FairSpillPool::new(1024*1024*1024))) .with_disk_manager(DiskManagerConfig::new_specified(vec!["/tmp/".into()])); let runtime = Arc::new(RuntimeEnv::new(runtime_config).unwrap()); let ctx = SessionContext::with_config_rt(SessionConfig::new(), runtime); ctx.register_parquet("lineitem", "/Users/alamb/Downloads/lineitem.parquet", Default::default()) .await.unwrap(); let df = ctx.sql("select * from lineitem order by l_shipdate").await.unwrap(); df.write_parquet("/Users/alamb/Downloads/lineitem_Datafusion.parquet", None) .await .unwrap(); Ok(()) } ``` I tested with both DataFusion `16.0.0` / `17.0.0` and FairSpillPill / GreedyMemoryPool ```toml datafusion = { version = "16.0.0" } ``` or ```toml datafusion = { version = "17.0.0" } ``` And this: ```rust .with_memory_pool(Arc::new(FairSpillPool::new(1024*1024*1024))) ``` Or ```rust .with_memory_pool(Arc::new(FairSpillPool::new(1024*1024*1024))) ``` ## Datafusion 16.0.0 with FairSpillPool: ``` Running `/Users/alamb/Software/target-df/release/rust_arrow_playground` thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParquetError(ArrowError("underlying Arrow error: External error: Arrow error: External error: Resources exhausted: Failed to allocate additional 1419488 bytes for RepartitionExec[14] with 2837440 bytes already allocated - maximum available is 0"))', src/main.rs:26:6 stack backtrace: ``` ## DataFusion 16.0.0 and GreedyMemoryPool ``` thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParquetError(ArrowError("underlying Arrow error: External error: Arrow error: External error: Resources exhausted: Failed to allocate additional 1419168 bytes for RepartitionExec[4] with 0 bytes already allocated - maximum available is 552160"))', src/main.rs:26:6 ``` ## DataFusion `17.0.0` and `FairMemoryPool` I got: The program completed successfully 🎉 ## DataFusion `17.0.0` and GreedyMemoryPool I got: ``` warning: `rust_arrow_playground` (bin "rust_arrow_playground") generated 1 warning Finished release [optimized] target(s) in 3m 35s Running `/Users/alamb/Software/target-df/release/rust_arrow_playground` thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParquetError(ArrowError("underlying Arrow error: External error: Arrow error: External error: Resources exhausted: Failed to allocate additional 1419168 bytes for RepartitionExec[4] with 0 bytes already allocated - maximum available is 552160"))', src/main.rs:26:6 stack backtrace: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
