[ 
https://issues.apache.org/jira/browse/SPARK-53690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayant Sharma updated SPARK-53690:
----------------------------------
    Description: 
In Structured Streaming, the Apache Kafka sources object has metrics 
*inputRowsPerSecond* and *processedRowsPerSecond* which ** are currently stored 
as *Double* and passed as Double in the JSON representation. For large values, 
they can be reported as exponential/scientific notation (e.g., 
{*}6.923076923076923E7{*}) or double values with very large scale (e.g., 
{*}638297.8723404256{*})

This formatting is not user-friendly. A user can easily interpret 
*6.923076923076923E7* as *6.92* instead of {*}69,230,769.23{*}, as *E* can be 
missed to be spotted.

Example:
{code:java}
INFO ProgressReporter: [queryId = d21e9] [batchId = 1248] Streaming query made 
progress: {
  "id" : "d21e9dc9-95be-4548-8b1c-d5a576691abf",
  "runId" : "5023fd98-6e3d-44b1-ba52-8499c24ab8a0",
  "name" : null,
  "timestamp" : "2025-09-17T05:59:43.218Z",
  "batchId" : 1248,
  "batchDuration" : 192206,
  "numInputRows" : 8000000,
  "inputRowsPerSecond" : 78886.12787441329,
  "processedRowsPerSecond" : 41622.009718739275,
  "durationMs" : {
    "addBatch" : 191458,
    "commitBatch" : 411,
    "commitOffsets" : 82,
    "getBatch" : 0,
    "latestOffset" : 3,
    "queryPlanning" : 169,
    "triggerExecution" : 192206,
    "walCommit" : 80
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaV2[Subscribe[bigdata_omsTrade_cdc]]",
    "startOffset" : {
      "bigdata_omsTrade_cdc" : {
        "0" : 18408809459
      }
    },
    "endOffset" : {
      "bigdata_omsTrade_cdc" : {
        "0" : 18416809459
      }
    },
    "latestOffset" : {
      "bigdata_omsTrade_cdc" : {
        "0" : 18700472399
      }
    },
    "numInputRows" : 8000000,
    "inputRowsPerSecond" : 78886.12787441329,
    "processedRowsPerSecond" : 41622.009718739275,
    "metrics" : {
      "avgOffsetsBehindLatest" : "2.8366294E8",
      "estimatedTotalBytesBehindLatest" : "7.187828359657416E11",
      "maxOffsetsBehindLatest" : "283662940",
      "minOffsetsBehindLatest" : "283662940"
    }
  } ],
  "sink" : {
    "description" : 
"DeltaSink[s3://<dummy-storage>/__unitystorage/schemas/75bc9f38-9af3-4af9-852b-7077489f93d3/tables/c963dc05-f685-4fe8-a744-500cb40ce28a]",
    "numOutputRows" : -1
  }
}{code}

  was:
In Structured Streaming, the Apache Kafka sources object has metrics 
*inputRowsPerSecond* and *processedRowsPerSecond* which ** are currently stored 
as *Double* and passed as Double in the JSON representation. For large values, 
they can be reported as exponential/scientific notation (e.g., 
{*}6.923076923076923E7{*}) or double values with very large scale (e.g., 
{*}638297.8723404256{*})

This formatting is not user-friendly. A user can easily interpret 
*6.923076923076923E7* as *6.92* instead of {*}69,230,769.23{*}, as *E* can be 
missed to be spotted.

Example:
{code:java}
INFO ProgressReporter: [queryId = d21e9] [batchId = 1248] Streaming query made 
progress: {  "id" : "d21e9dc9-95be-4548-8b1c-d5a576691abf",  "runId" : 
"5023fd98-6e3d-44b1-ba52-8499c24ab8a0",  "name" : null,  "timestamp" : 
"2025-09-17T05:59:43.218Z",  "batchId" : 1248,  "batchDuration" : 192206,  
"numInputRows" : 8000000,  "inputRowsPerSecond" : 78886.12787441329,  
"processedRowsPerSecond" : 41622.009718739275,  "durationMs" : {    "addBatch" 
: 191458,    "commitBatch" : 411,    "commitOffsets" : 82,    "getBatch" : 0,   
 "latestOffset" : 3,    "queryPlanning" : 169,    "triggerExecution" : 192206,  
  "walCommit" : 80  },  "stateOperators" : [ ],  "sources" : [ {    
"description" : "KafkaV2[Subscribe[bigdata_omsTrade_cdc]]",    "startOffset" : 
{      "bigdata_omsTrade_cdc" : {        "0" : 18408809459      }    },    
"endOffset" : {      "bigdata_omsTrade_cdc" : {        "0" : 18416809459      } 
   },    "latestOffset" : {      "bigdata_omsTrade_cdc" : {        "0" : 
18700472399      }    },    "numInputRows" : 8000000,    "inputRowsPerSecond" : 
78886.12787441329,    "processedRowsPerSecond" : 41622.009718739275,    
"metrics" : {      "avgOffsetsBehindLatest" : "2.8366294E8",      
"estimatedTotalBytesBehindLatest" : "7.187828359657416E11",      
"maxOffsetsBehindLatest" : "283662940",      "minOffsetsBehindLatest" : 
"283662940"    }  } ],  "sink" : {    "description" : 
"DeltaSink[s3://<dummy-storage>/__unitystorage/schemas/75bc9f38-9af3-4af9-852b-7077489f93d3/tables/c963dc05-f685-4fe8-a744-500cb40ce28a]",
    "numOutputRows" : -1  }} {code}


> SS | `avgOffsetsBehindLatest` and `estimatedTotalBytesBehindLatest` in 
> progress report show values in exponential notation for large numbers
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-53690
>                 URL: https://issues.apache.org/jira/browse/SPARK-53690
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 3.5.0, 3.5.2, 4.0.0
>            Reporter: Jayant Sharma
>            Priority: Major
>             Fix For: 3.5.2, 4.1.0, 4.0.0
>
>
> In Structured Streaming, the Apache Kafka sources object has metrics 
> *inputRowsPerSecond* and *processedRowsPerSecond* which ** are currently 
> stored as *Double* and passed as Double in the JSON representation. For large 
> values, they can be reported as exponential/scientific notation (e.g., 
> {*}6.923076923076923E7{*}) or double values with very large scale (e.g., 
> {*}638297.8723404256{*})
> This formatting is not user-friendly. A user can easily interpret 
> *6.923076923076923E7* as *6.92* instead of {*}69,230,769.23{*}, as *E* can be 
> missed to be spotted.
> Example:
> {code:java}
> INFO ProgressReporter: [queryId = d21e9] [batchId = 1248] Streaming query 
> made progress: {
>   "id" : "d21e9dc9-95be-4548-8b1c-d5a576691abf",
>   "runId" : "5023fd98-6e3d-44b1-ba52-8499c24ab8a0",
>   "name" : null,
>   "timestamp" : "2025-09-17T05:59:43.218Z",
>   "batchId" : 1248,
>   "batchDuration" : 192206,
>   "numInputRows" : 8000000,
>   "inputRowsPerSecond" : 78886.12787441329,
>   "processedRowsPerSecond" : 41622.009718739275,
>   "durationMs" : {
>     "addBatch" : 191458,
>     "commitBatch" : 411,
>     "commitOffsets" : 82,
>     "getBatch" : 0,
>     "latestOffset" : 3,
>     "queryPlanning" : 169,
>     "triggerExecution" : 192206,
>     "walCommit" : 80
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
>     "description" : "KafkaV2[Subscribe[bigdata_omsTrade_cdc]]",
>     "startOffset" : {
>       "bigdata_omsTrade_cdc" : {
>         "0" : 18408809459
>       }
>     },
>     "endOffset" : {
>       "bigdata_omsTrade_cdc" : {
>         "0" : 18416809459
>       }
>     },
>     "latestOffset" : {
>       "bigdata_omsTrade_cdc" : {
>         "0" : 18700472399
>       }
>     },
>     "numInputRows" : 8000000,
>     "inputRowsPerSecond" : 78886.12787441329,
>     "processedRowsPerSecond" : 41622.009718739275,
>     "metrics" : {
>       "avgOffsetsBehindLatest" : "2.8366294E8",
>       "estimatedTotalBytesBehindLatest" : "7.187828359657416E11",
>       "maxOffsetsBehindLatest" : "283662940",
>       "minOffsetsBehindLatest" : "283662940"
>     }
>   } ],
>   "sink" : {
>     "description" : 
> "DeltaSink[s3://<dummy-storage>/__unitystorage/schemas/75bc9f38-9af3-4af9-852b-7077489f93d3/tables/c963dc05-f685-4fe8-a744-500cb40ce28a]",
>     "numOutputRows" : -1
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to