[PR] [SPARK-53690][SS] Fix exponential formatting of avgOffsetsBehindLatest and estimatedTotalBytesBehindLatest in Kafka sources object in progress json [spark]

via GitHub Wed, 24 Sep 2025 08:39:21 -0700


jayantdb opened a new pull request, #52439:
URL: https://github.com/apache/spark/pull/52439


   ### What changes were proposed in this pull request?
   This PR fixes an issue where `avgOffsetsBehindLatest` and 
`estimatedTotalBytesBehindLatest` from streaming progress metrics JSON were 
displayed in scientific notation (e.g., 2.8366294E8). The fix uses safe Decimal 
casting to ensure values are displayed in a more human-readable format.
   
   Before change:
   ```
   {
     "id" : "d21e9dc9-95be-4548-8b1c-d5a576691abf",
     "runId" : "5023fd98-6e3d-44b1-ba52-8499c24ab8a0",
     "name" : "KafkaMetricsTest",
     "timestamp" : "2025-09-23T06:00:00.000Z",
     "batchId" : 1,
     "batchDuration" : 100,
     "numInputRows" : 800000,
     "inputRowsPerSecond" : 78886.1,
     "processedRowsPerSecond" : 41622.0,
     "durationMs" : {
       "total" : 100
     },
     "stateOperators" : [ ],
     "sources" : [ {
       "description" : "kafkaSource",
       "startOffset" : 100,
       "endOffset" : 200,
       "latestOffset" : 300,
       "numInputRows" : 800000,
       "inputRowsPerSecond" : 78886.1,
       "processedRowsPerSecond" : 41622.0,
       "metrics" : {
         "avgOffsetsBehindLatest" : "2.8366294E8",
         "estimatedTotalBytesBehindLatest" : "7.187828359657416E11",
         "maxOffsetsBehindLatest" : "283662940",
         "minOffsetsBehindLatest" : "283662940"
       }
     } ],
     "sink" : {
       "description" : "sink",
       "numOutputRows" : -1
     }
   }
   ```
   After change:
   ```
   {
     "id" : "d21e9dc9-95be-4548-8b1c-d5a576691abf",
     "runId" : "5023fd98-6e3d-44b1-ba52-8499c24ab8a0",
     "name" : "KafkaMetricsTest",
     "timestamp" : "2025-09-23T06:00:00.000Z",
     "batchId" : 1,
     "batchDuration" : 100,
     "numInputRows" : 800000,
     "inputRowsPerSecond" : 78886.1,
     "processedRowsPerSecond" : 41622.0,
     "durationMs" : {
       "total" : 100
     },
     "stateOperators" : [ ],
     "sources" : [ {
       "description" : "kafkaSource",
       "startOffset" : 100,
       "endOffset" : 200,
       "latestOffset" : 300,
       "numInputRows" : 800000,
       "inputRowsPerSecond" : 78886.1,
       "processedRowsPerSecond" : 41622.0,
       "metrics" : {
         "avgOffsetsBehindLatest" : "283662940.0",
         "estimatedTotalBytesBehindLatest" : "718782835965.7",
         "maxOffsetsBehindLatest" : "283662940",
         "minOffsetsBehindLatest" : "283662940"
       }
     } ],
     "sink" : {
       "description" : "sink",
       "numOutputRows" : -1
     }
   }
   ```
   
   Note: Currently, `estimatedTotalBytesBehindLatest` is available in Spark in 
Databricks. Hence, this fix is safe to be auto-applied in Spark in Databricks.
   
   ### Why are the changes needed?
   Current formatting is not user-friendly. A user can easily interpret 
`2.8366294E8` as `2.8` instead of `283,662,940`, as E can be missed to be 
spotted. This fix will improve the readability of Spark Structured Streaming 
progress metrics JSON.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Run this Maven test:
   ```
   ./build/mvn -pl sql/core,sql/api \
   -am test \
   
-DwildcardSuites=org.apache.spark.sql.streaming.StreamingQueryStatusAndProgressSuite
 \
   -DwildcardTestName="SPARK-53690"
   ```
   Results:
   ```
   Run completed in 10 seconds, 29 milliseconds.
   Total number of tests run: 13
   Suites: completed 2, aborted 0
   Tests: succeeded 13, failed 0, canceled 0, ignored 0, pending 0
   All tests passed.
   [INFO] 
------------------------------------------------------------------------
   [INFO] Reactor Summary for Spark Project Parent POM 4.1.0-SNAPSHOT:
   [INFO] 
   [INFO] Spark Project Parent POM ........................... SUCCESS [  1.083 
s]
   [INFO] Spark Project Tags ................................. SUCCESS [  1.474 
s]
   [INFO] Spark Project Sketch ............................... SUCCESS [  1.401 
s]
   [INFO] Spark Project Common Java Utils .................... SUCCESS [  1.794 
s]
   [INFO] Spark Project Common Utils ......................... SUCCESS [  1.666 
s]
   [INFO] Spark Project Local DB ............................. SUCCESS [  4.317 
s]
   [INFO] Spark Project Networking ........................... SUCCESS [ 54.404 
s]
   [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  5.816 
s]
   [INFO] Spark Project Variant .............................. SUCCESS [  0.797 
s]
   [INFO] Spark Project Unsafe ............................... SUCCESS [  2.816 
s]
   [INFO] Spark Project Connect Shims ........................ SUCCESS [  0.718 
s]
   [INFO] Spark Project Launcher ............................. SUCCESS [  3.100 
s]
   [INFO] Spark Project Core ................................. SUCCESS [ 28.484 
s]
   [INFO] Spark Project SQL API .............................. SUCCESS [ 12.863 
s]
   [INFO] Spark Project Catalyst ............................. SUCCESS [  6.799 
s]
   [INFO] Spark Project SQL .................................. SUCCESS [01:07 
min]
   [INFO] 
------------------------------------------------------------------------
   [INFO] BUILD SUCCESS
   [INFO] 
------------------------------------------------------------------------
   [INFO] Total time:  03:15 min
   [INFO] Finished at: 2025-09-24T20:39:57+05:30
   [INFO] 
------------------------------------------------------------------------
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-53690][SS] Fix exponential formatting of avgOffsetsBehindLatest and estimatedTotalBytesBehindLatest in Kafka sources object in progress json [spark]

Reply via email to