JasonLi-cn commented on issue #2199:
URL:
https://github.com/apache/arrow-datafusion/issues/2199#issuecomment-1263228061
1. binary code
```rust
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::arrow::util::pretty::print_batches;
use datafusion::error::Result;
use datafusion::prelude::*;
use datafusion::scheduler::Scheduler;
use futures::{StreamExt, TryStreamExt};
use std::env;
#[tokio::main]
async fn main() -> Result<()> {
let name = "test_table";
let mut args = env::args();
args.next();
let table_path = args.next().expect("parquet file");
let sql = &args.next().expect("sql");
let using_scheduler = args.next().is_some();
// create local session context
let config = SessionConfig::new()
.with_information_schema(true)
.with_target_partitions(4);
let context = SessionContext::with_config(config);
// register parquet file with the execution context
context
.register_parquet(name, &table_path, ParquetReadOptions::default())
.await?;
let task = context.task_ctx();
let query = context.sql(sql).await.unwrap();
let plan = query.create_physical_plan().await.unwrap();
println!("Start query, using scheduler {}", using_scheduler);
let now = std::time::Instant::now();
let results = if using_scheduler {
let scheduler = Scheduler::new(4);
let stream = scheduler.schedule(plan, task).unwrap().stream();
let results: Vec<RecordBatch> = stream.try_collect().await.unwrap();
results
} else {
context.sql(sql).await?.collect().await?
};
let elapsed = now.elapsed().as_millis();
println!("End query, elapsed {} ms", elapsed);
print_batches(&results)?;
Ok(())
}
/// Execute sql
async fn plan_and_collect(
context: &SessionContext,
sql: &str,
) -> Result<Vec<RecordBatch>> {
context.sql(sql).await?.collect().await
}
```
2. test data
- format: parquet
- number of files: 4
- rows: 16405852 * 4 = 65623408
- number of columns: 6
- schema: uint32, uint32, uint32, uint32, string, uint32
3. test result
SQLs:
```sql
select count(distinct column0) from test_table;
select * from test_table order by column5 limit 10;
```
The performance is similar with and without the Scheduler! Is there a
problem with where I use it?
@tustvold
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]