[ 
https://issues.apache.org/jira/browse/SPARK-50520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-50520:
-----------------------------------
    Labels: pull-request-available  (was: )

> df.rdd.countApprox() : PySpark API does not respect the timeout.
> ----------------------------------------------------------------
>
>                 Key: SPARK-50520
>                 URL: https://issues.apache.org/jira/browse/SPARK-50520
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.5.0
>            Reporter: Nandini
>            Priority: Major
>              Labels: pull-request-available
>
> The 
> {code:java}
> # Create a large table with dummy data
> df = spark.range(100000000)
> df.write.format("delta").saveAsTable("large_table")
> import time
> df = spark.read.format("delta").table("large_table")
> start_time = time.time()
> df.cache()
> print(f"count = {df.rdd.countApprox(timeout=1)}")
> print(f"Time taken: {time.time() - start_time} seconds")
> {code}
> ===> runs until the count is fetched (2 min)
> {code:java}
> %scala
>    // Read the Delta table into a DataFrame
>     val df = spark.read.format("delta").table("large_table")
>     val startTime = System.nanoTime()
>     df.cache()
>     val approxCount = df.rdd.countApprox(timeout = 10)
>     val endTime = System.nanoTime()
>     val duration = (endTime - startTime) / 1e9 // Convert nanoseconds to 
> seconds
>     println(s"Time taken: $duration seconds")
> {code}
> ===> times out in 10 ms
> This is also commented in the below doc, but I could not find a corresponding 
> bug for this.
> https://mlflow.org/docs/latest/_modules/mlflow/data/spark_dataset.html
>  # Use Spark RDD countApprox to get approximate count since count() may be 
> expensive.
>             # Note that we call the Scala RDD API because the PySpark API 
> does not respect the
>             # specified timeout. Reference code:
>             # 
> https://spark.apache.org/docs/3.4.0/api/python/_modules/pyspark/rdd.html
>             # #RDD.countApprox. This is confirmed to work in all Spark 3.x 
> versions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to