HeartSaVioR commented on pull request #35362:
URL: https://github.com/apache/spark/pull/35362#issuecomment-1029585818
cc. @brkyvz since he's an author of the code, although the code was
committed 5+ years ago.
@nyingping
If you don't mind, could you please try out micro-benchmark against the
change?
Benchmarks in SQL are located in sql/core/test/scala, with package
`org.apache.spark.sql.execution.benchmark`. Since it's really about creating
time window, you don't need to deal with streaming query and aggregation. You
can start with batch query (say, starting your Dataset via
`spark.range(10000000)`) and convert these values to timestamp, and call
`window` in select, and write to "noop" format of sink, and done.
Below is the simple benchmark code from #18364 - it didn't leverage the
benchmark framework, but you can get some sense on creating benchmark code. In
benchmark framework you'd like to remove spark.time and leverage the
functionality of benchmark framework.
```
import org.apache.spark.sql.functions._
spark.time {
spark.range(numRecords)
.select(from_unixtime((current_timestamp().cast("long") * 1000 + 'id /
1000) / 1000) as 'time)
.select(window('time, "10 seconds"))
.count()
}
```
If you feel too much bootstrapping on learning benchmark framework, please
start with above code (with tumble/sliding) and if the code can show the
difference, it would be sufficient.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]