HeartSaVioR commented on pull request #35362:
URL: https://github.com/apache/spark/pull/35362#issuecomment-1029585818


   cc. @brkyvz since he's an author of the code, although the code was 
committed 5+ years ago.
   
   @nyingping 
   If you don't mind, could you please try out micro-benchmark against the 
change?
   
   Benchmarks in SQL are located in sql/core/test/scala, with package 
`org.apache.spark.sql.execution.benchmark`. Since it's really about creating 
time window, you don't need to deal with streaming query and aggregation. You 
can start with batch query (say, starting your Dataset via 
`spark.range(10000000)`) and convert these values to timestamp, and call 
`window` in select, and write to "noop" format of sink, and done.
   
   Below is the simple benchmark code from #18364 - it didn't leverage the 
benchmark framework, but you can get some sense on creating benchmark code. In 
benchmark framework you'd like to remove spark.time and leverage the 
functionality of benchmark framework.
   
   ```
   import org.apache.spark.sql.functions._
   
   spark.time { 
     spark.range(numRecords)
       .select(from_unixtime((current_timestamp().cast("long") * 1000 + 'id / 
1000) / 1000) as 'time)
       .select(window('time, "10 seconds"))
       .count()
   }
   ```
   
   If you feel too much bootstrapping on learning benchmark framework, please 
start with above code (with tumble/sliding) and if the code can show the 
difference, it would be sufficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to