[
https://issues.apache.org/jira/browse/SPARK-50520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-50520:
-----------------------------------
Labels: pull-request-available (was: )
> df.rdd.countApprox() : PySpark API does not respect the timeout.
> ----------------------------------------------------------------
>
> Key: SPARK-50520
> URL: https://issues.apache.org/jira/browse/SPARK-50520
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.5.0
> Reporter: Nandini
> Priority: Major
> Labels: pull-request-available
>
> The
> {code:java}
> # Create a large table with dummy data
> df = spark.range(100000000)
> df.write.format("delta").saveAsTable("large_table")
> import time
> df = spark.read.format("delta").table("large_table")
> start_time = time.time()
> df.cache()
> print(f"count = {df.rdd.countApprox(timeout=1)}")
> print(f"Time taken: {time.time() - start_time} seconds")
> {code}
> ===> runs until the count is fetched (2 min)
> {code:java}
> %scala
> // Read the Delta table into a DataFrame
> val df = spark.read.format("delta").table("large_table")
> val startTime = System.nanoTime()
> df.cache()
> val approxCount = df.rdd.countApprox(timeout = 10)
> val endTime = System.nanoTime()
> val duration = (endTime - startTime) / 1e9 // Convert nanoseconds to
> seconds
> println(s"Time taken: $duration seconds")
> {code}
> ===> times out in 10 ms
> This is also commented in the below doc, but I could not find a corresponding
> bug for this.
> https://mlflow.org/docs/latest/_modules/mlflow/data/spark_dataset.html
> # Use Spark RDD countApprox to get approximate count since count() may be
> expensive.
> # Note that we call the Scala RDD API because the PySpark API
> does not respect the
> # specified timeout. Reference code:
> #
> https://spark.apache.org/docs/3.4.0/api/python/_modules/pyspark/rdd.html
> # #RDD.countApprox. This is confirmed to work in all Spark 3.x
> versions
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]