Zifei Feng created SPARK-53972:
----------------------------------

             Summary: Fix recentProgress performance regression
                 Key: SPARK-53972
                 URL: https://issues.apache.org/jira/browse/SPARK-53972
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 4.0.1, 4.0.0
            Reporter: Zifei Feng


We have identified a significant performance regression in Apache Spark's 
streaming recentProgress method in python notebook starting from version 4.0.0. 
The time required to fetch recentProgress increases substantially as the number 
of progress records grows, creating a linear or worse scaling issue.

With the following code, it output charts for time it takes to get 
recentProgress before and after changes in [this 
commit|https://github.com/apache/spark/commit/22eb6c4b0a82b9fcf84fc9952b1f6c41dde9bd8d#diff-4d4ed29d139877b160de444add7ee63cfa7a7577d849ab2686f1aa2d5b4aae64]

```
%python
from datetime import datetime
import time

df = spark.readStream.format("rate").option("rowsPerSecond", 10).load()
q = df.writeStream.format("noop").start()
print("begin waiting for progress")

progress_list = []
time_diff_list = []

numProgress = len(q.recentProgress)
while numProgress < 70 and q.exception() is None:
time.sleep(1)
beforeTime = datetime.now()
print(beforeTime.strftime("%Y-%m-%d %H:%M:%S") +": before we got those 
progress: "+str(numProgress))
rep = q.recentProgress
numProgress = len(rep)
afterTime = datetime.now()
print(afterTime.strftime("%Y-%m-%d %H:%M:%S") +": after we got those progress: 
"+str(numProgress))
time_diff = (afterTime - beforeTime).total_seconds()
print("Total Time: "+str(time_diff) +" seconds")
progress_list.append(numProgress)
time_diff_list.append(time_diff)

q.stop()
q.awaitTermination()
assert(q.exception() is None)

import pandas as pd

plot_df = pd.DataFrame(\{'numProgress': progress_list, 'time_diff': 
time_diff_list})
display(spark.createDataFrame(plot_df).orderBy("numProgress").toPandas().plot.line(x="numProgress",
 y="time_diff"))
```
See environment for screenshot
!image-2025-10-21-09-27-01-081.png!
 
!image-2025-10-21-09-27-33-612.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to