[jira] [Updated] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch

Jungtaek Lim (Jira) Wed, 16 Mar 2022 00:29:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jungtaek Lim updated SPARK-37062:
---------------------------------
    Labels: releasenotes  (was: )

> Introduce a new data source for providing consistent set of rows per 
> microbatch
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-37062
>                 URL: https://issues.apache.org/jira/browse/SPARK-37062
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 3.3.0
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Major
>              Labels: releasenotes
>             Fix For: 3.3.0
>
>
> The "rate" data source has been known to be used as a benchmark for streaming 
> query.
> While this helps to put the query to the limit (how many rows the query could 
> process per second), the rate data source doesn't provide consistent rows per 
> batch into stream, which leads two environments be hard to compare with.
> For example, in many cases, you may want to compare the metrics in the 
> batches between test environments (like running same streaming query with 
> different options). These metrics are strongly affected if the distribution 
> of input rows in batches are changing, especially a micro-batch has been 
> lagged (in any reason) and rate data source produces more input rows to the 
> next batch.
> Also, when you test against streaming aggregation, you may want the data 
> source produces the same set of input rows per batch (deterministic), so that 
> you can plan how these input rows will be aggregated and how state rows will 
> be evicted, and craft the test query based on the plan.
> The requirements of new data source would follow:
> * it should produce a specific number of input rows as requested
> * it should also include a timestamp (event time) into each row
> ** to make the input rows fully deterministic, timestamp should be configured 
> as well (like start timestamp & amount of advance per batch)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch

Reply via email to