luckylsk34 opened a new issue, #714:
URL: https://github.com/apache/arrow-ballista/issues/714

   **Describe the bug**
   In Ballista, when I submit the same DF for execution, it takes about .5 
Seconds for each computation and 5 Seconds for 10 iterations. This happens in 
cluster mode with 1 executor mode also. The random.parquet contains a single 
column with 1 million random integers populated. When I did the same with just 
DataFusion, It is very fast. This is not the case with Spark. Am I doing 
something wrong.
   
   **To Reproduce**
   Run the below code:
   ```Rust
       // For doing the same with DataFusion
       // let ctx = SessionContext::new();
   
       // Ballista
       let config = BallistaConfig::builder().build().expect("");
   
       // connect to Ballista scheduler
       // let ctx = BallistaContext::remote("localhost", 50050, 
&config).await.expect("");
       let ctx = BallistaContext::standalone(&config, 4).await.expect("");
   
       let df = ctx.read_parquet("./testdata/random.parquet", 
ParquetReadOptions::default()).await?;
       let args: Vec<String> = env::args().collect();
       let n = args[1].parse::<i32>().expect("");
       let start = SystemTime::now();
       for _ in 0..n {
           df.clone().aggregate(vec![], 
vec![sum(col("random_integers"))])?.collect().await?.get(0);
       }
       let end = SystemTime::now();
       println!("{:?}", end.duration_since(start).expect("").as_millis());
   
       // println!("{}", df.count().await?);
       // df.show_limit(10).await?;
       Ok(())
   ```
   
   ```Scala
   val df = spark.read.parquet("./testdata/random.parquet")
   def run_loop(n: Int) = {
       val t0 = System.nanoTime()
       for (a <- 1 to n) {
           df.agg(sum("random_integers")).first
       }
       val t1 = System.nanoTime()
       println("Elapsed time: " + (t1 - t0) / 1000000 + "ms")
   }
   ```
   
   **Expected behavior**
   Don't take time between 2 calls.
   
   **Additional context**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to