Re: [I] External sorting not working for (maybe only for string columns??) [datafusion]

via GitHub Fri, 07 Feb 2025 23:05:36 -0800


zhuqi-lucas commented on issue #12136:
URL: https://github.com/apache/datafusion/issues/12136#issuecomment-2644553511


   > > [@2010YOUY01](https://github.com/2010YOUY01) Your solution does not work 
for me. I did play around with `sort_spill_reservation_bytes` when I originally 
encountered this but couldn't find a way to get it to work. 
[Here](https://github.com/apache/datafusion/commit/6acc39e6e6edae0c33f242d80754ee3b5690614c)
 is a reproducer that uses a 100MB memory budget and fails for all values of 
`sort_spill_reservation_bytes` between 1 and 100 MB. It might be that I am 
configuring things or using something improperly however.
   > 
   > This reproducer can also run successfully by disabling string view via 
`session_config.options_mut().execution.parquet.schema_force_view_types = 
false;` and set `sort_spill_reservation_bytes` to 16MB.
   
   
   
   It still failed for me for the setting, the reproduce code:
   
   ```rust
   use arrow::array::{RecordBatch, StringBuilder};
   use arrow_schema::{DataType, Field, Schema};
   use datafusion::execution::disk_manager::DiskManagerConfig;
   use datafusion::execution::memory_pool::FairSpillPool;
   use datafusion::execution::runtime_env::RuntimeEnvBuilder;
   use datafusion::parquet::arrow::ArrowWriter;
   use datafusion::prelude::{ParquetReadOptions, SessionConfig, SessionContext};
   use futures::TryStreamExt;
   use std::sync::Arc;
   
   #[tokio::main(flavor = "multi_thread", worker_threads = 1)]
   pub async fn main() {
       build_parquet();
   
       let env = RuntimeEnvBuilder::new()
           .with_disk_manager(DiskManagerConfig::default())
           .with_memory_pool(Arc::new(FairSpillPool::new(100 * 1024 * 1024)))
           .build_arc()
           .unwrap();
   
       let mut config = 
SessionConfig::new().with_sort_spill_reservation_bytes(16 * 1024 * 1024);
       config.options_mut().execution.parquet.schema_force_view_types = false;
   
       let ctx = SessionContext::new_with_config_rt(config, env);
   
       ctx.register_parquet(
           "big_strings",
           "/tmp/big_strings.parquet",
           ParquetReadOptions::default(),
       )
       .await
       .unwrap();
   
       let sql = "SELECT * FROM big_strings ORDER BY strings";
       println!("Sorting strings");
       ctx.sql(sql)
           .await
           .unwrap()
           .execute_stream()
           .await
           .unwrap()
           .try_for_each(|_| std::future::ready(Ok(())))
           .await
           .unwrap();
   }
   
   fn build_parquet() {
       if std::fs::File::open("/tmp/big_strings.parquet").is_ok() {
           println!("Using existing file at /tmp/big_strings.parquet");
           return;
       }
       println!("Generating test file at /tmp/big_strings.parquet");
       let file = std::fs::File::create("/tmp/big_strings.parquet").unwrap();
       let schema = Arc::new(Schema::new(vec![Field::new(
           "strings",
           DataType::Utf8,
           false,
       )]));
       let mut writer = ArrowWriter::try_new(file, schema.clone(), 
None).unwrap();
   
       for batch_idx in 0..100 {
           println!("Generating batch {} of 100", batch_idx);
           let mut string_array_builder =
               StringBuilder::with_capacity(1024 * 1024, 1024 * 1024 * 3 * 14);
           for i in 0..(1024 * 1024) {
               string_array_builder
                   .append_value(format!("string-{}string-{}string-{}", i, i, 
i));
           }
           let array = Arc::new(string_array_builder.finish());
           let batch = RecordBatch::try_new(schema.clone(), 
vec![array]).unwrap();
           writer.write(&batch).unwrap();
       }
       writer.close().unwrap();
   }
   ```
   
   
   And when disable stringView, a new finding is the sorted data is smaller 
than the original data:
   
   ```rust
   Before sorting we have 6233344 bytes of unsorted data
   After sorting we now have 4932480 bytes of sorted data
   Spilling
   Before sorting we have 6233344 bytes of unsorted data
   After sorting we now have 4932480 bytes of sorted data
   Spilling
   Before sorting we have 6233344 bytes of unsorted data
   After sorting we now have 4932480 bytes of sorted data
   Spilling
   Before sorting we have 6233344 bytes of unsorted data
   After sorting we now have 4932480 bytes of sorted data
   Spilling
   Before sorting we have 6233344 bytes of unsorted data
   After sorting we now have 4932480 bytes of sorted data
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] External sorting not working for (maybe only for string columns??) [datafusion]

Reply via email to