gruuya opened a new issue, #7149: URL: https://github.com/apache/arrow-datafusion/issues/7149
### Describe the bug It seems like the Top-K query optimization is somehow conditional on the usage of a custom allocator (`mimalloc`/`snmalloc`), while in principle that shouldn't be the case? ### To Reproduce 1. Grab and build bytehound: https://github.com/koute/bytehound 2. Prepare some large-ish Parquet file, e.g. `https://seafowl-public.s3.eu-west-1.amazonaws.com/tutorial/trase-supply-chains.parquet`: ``` $ du -h ~/supply-chains.parquet 146M /home/ubuntu/supply-chains.parquet ``` 3. Remove the custom allocator and build ```diff diff --git a/datafusion-cli/src/main.rs b/datafusion-cli/src/main.rs index aea499d60..a92957730 100644 --- a/datafusion-cli/src/main.rs +++ b/datafusion-cli/src/main.rs @@ -24,13 +24,13 @@ use datafusion_cli::catalog::DynamicFileCatalog; use datafusion_cli::{ exec, print_format::PrintFormat, print_options::PrintOptions, DATAFUSION_CLI_VERSION, }; -use mimalloc::MiMalloc; +// use mimalloc::MiMalloc; use std::env; use std::path::Path; use std::sync::Arc; -#[global_allocator] -static GLOBAL: MiMalloc = MiMalloc; +// #[global_allocator] +// static GLOBAL: MiMalloc = MiMalloc; #[derive(Debug, Parser, PartialEq)] #[clap(author, version, about, long_about= None)] ``` 4. Profile a Top-K query ```sql $ LD_PRELOAD=~/bytehound/target/release/libbytehound.so ./target/debug/datafusion-cli DataFusion CLI v28.0.0 ❯ CREATE EXTERNAL TABLE supply_chains STORED AS PARQUET LOCATION '/home/ubuntu/supply-chains.parquet'; 0 rows in set. Query took 0.445 seconds. ❯ SELECT * FROM supply_chains ORDER BY flow_id DESC LIMIT 1; ... ``` The profile I get is  ### Expected behavior With the custom allocator present the memory profile I see is like this  ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
