gruuya opened a new issue, #7149:
URL: https://github.com/apache/arrow-datafusion/issues/7149

   ### Describe the bug
   
   It seems like the Top-K query optimization is somehow conditional on the 
usage of a custom allocator (`mimalloc`/`snmalloc`), while in principle that 
shouldn't be the case?
   
   ### To Reproduce
   
   1. Grab and build bytehound: https://github.com/koute/bytehound
   2. Prepare some large-ish Parquet file, e.g. 
`https://seafowl-public.s3.eu-west-1.amazonaws.com/tutorial/trase-supply-chains.parquet`:
    ```
    $ du -h ~/supply-chains.parquet 
    146M /home/ubuntu/supply-chains.parquet
    ```
   3. Remove the custom allocator and build
   ```diff
   diff --git a/datafusion-cli/src/main.rs b/datafusion-cli/src/main.rs
   index aea499d60..a92957730 100644
   --- a/datafusion-cli/src/main.rs
   +++ b/datafusion-cli/src/main.rs
   @@ -24,13 +24,13 @@ use datafusion_cli::catalog::DynamicFileCatalog;
    use datafusion_cli::{
        exec, print_format::PrintFormat, print_options::PrintOptions, 
DATAFUSION_CLI_VERSION,
    };
   -use mimalloc::MiMalloc;
   +// use mimalloc::MiMalloc;
    use std::env;
    use std::path::Path;
    use std::sync::Arc;
   
   -#[global_allocator]
   -static GLOBAL: MiMalloc = MiMalloc;
   +// #[global_allocator]
   +// static GLOBAL: MiMalloc = MiMalloc;
   
    #[derive(Debug, Parser, PartialEq)]
    #[clap(author, version, about, long_about= None)]
   ```
   4. Profile a Top-K query
   ```sql
   $ LD_PRELOAD=~/bytehound/target/release/libbytehound.so 
./target/debug/datafusion-cli
   DataFusion CLI v28.0.0
   ❯ CREATE EXTERNAL TABLE supply_chains STORED AS PARQUET LOCATION 
'/home/ubuntu/supply-chains.parquet';
   0 rows in set. Query took 0.445 seconds.
   ❯ SELECT * FROM supply_chains ORDER BY flow_id DESC LIMIT 1;
   ...
   ```
   
   The profile I get is
   
![slika](https://github.com/apache/arrow-datafusion/assets/45558892/98e33f30-a2a1-4100-bbb1-caf56a1ba6b5)
   
   
   ### Expected behavior
   
   With the custom allocator present the memory profile I see is like this
   
![slika](https://github.com/apache/arrow-datafusion/assets/45558892/92aa81d7-5784-4d13-a3f0-818d9a3d2176)
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to