[
https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jungtaek Lim updated SPARK-37062:
---------------------------------
Labels: releasenotes (was: )
> Introduce a new data source for providing consistent set of rows per
> microbatch
> -------------------------------------------------------------------------------
>
> Key: SPARK-37062
> URL: https://issues.apache.org/jira/browse/SPARK-37062
> Project: Spark
> Issue Type: New Feature
> Components: Structured Streaming
> Affects Versions: 3.3.0
> Reporter: Jungtaek Lim
> Assignee: Jungtaek Lim
> Priority: Major
> Labels: releasenotes
> Fix For: 3.3.0
>
>
> The "rate" data source has been known to be used as a benchmark for streaming
> query.
> While this helps to put the query to the limit (how many rows the query could
> process per second), the rate data source doesn't provide consistent rows per
> batch into stream, which leads two environments be hard to compare with.
> For example, in many cases, you may want to compare the metrics in the
> batches between test environments (like running same streaming query with
> different options). These metrics are strongly affected if the distribution
> of input rows in batches are changing, especially a micro-batch has been
> lagged (in any reason) and rate data source produces more input rows to the
> next batch.
> Also, when you test against streaming aggregation, you may want the data
> source produces the same set of input rows per batch (deterministic), so that
> you can plan how these input rows will be aggregated and how state rows will
> be evicted, and craft the test query based on the plan.
> The requirements of new data source would follow:
> * it should produce a specific number of input rows as requested
> * it should also include a timestamp (event time) into each row
> ** to make the input rows fully deterministic, timestamp should be configured
> as well (like start timestamp & amount of advance per batch)
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]