[
https://issues.apache.org/jira/browse/SPARK-53690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jayant Sharma updated SPARK-53690:
----------------------------------
Description:
In Structured Streaming, the Apache Kafka sources object has metrics
*inputRowsPerSecond* and *processedRowsPerSecond* which ** are currently stored
as *Double* and passed as Double in the JSON representation. For large values,
they can be reported as exponential/scientific notation (e.g.,
{*}6.923076923076923E7{*}) or double values with very large scale (e.g.,
{*}638297.8723404256{*})
This formatting is not user-friendly. A user can easily interpret
*6.923076923076923E7* as *6.92* instead of {*}69,230,769.23{*}, as *E* can be
missed to be spotted.
Example:
{code:java}
INFO ProgressReporter: [queryId = d21e9] [batchId = 1248] Streaming query made
progress: {
"id" : "d21e9dc9-95be-4548-8b1c-d5a576691abf",
"runId" : "5023fd98-6e3d-44b1-ba52-8499c24ab8a0",
"name" : null,
"timestamp" : "2025-09-17T05:59:43.218Z",
"batchId" : 1248,
"batchDuration" : 192206,
"numInputRows" : 8000000,
"inputRowsPerSecond" : 78886.12787441329,
"processedRowsPerSecond" : 41622.009718739275,
"durationMs" : {
"addBatch" : 191458,
"commitBatch" : 411,
"commitOffsets" : 82,
"getBatch" : 0,
"latestOffset" : 3,
"queryPlanning" : 169,
"triggerExecution" : 192206,
"walCommit" : 80
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[bigdata_omsTrade_cdc]]",
"startOffset" : {
"bigdata_omsTrade_cdc" : {
"0" : 18408809459
}
},
"endOffset" : {
"bigdata_omsTrade_cdc" : {
"0" : 18416809459
}
},
"latestOffset" : {
"bigdata_omsTrade_cdc" : {
"0" : 18700472399
}
},
"numInputRows" : 8000000,
"inputRowsPerSecond" : 78886.12787441329,
"processedRowsPerSecond" : 41622.009718739275,
"metrics" : {
"avgOffsetsBehindLatest" : "2.8366294E8",
"estimatedTotalBytesBehindLatest" : "7.187828359657416E11",
"maxOffsetsBehindLatest" : "283662940",
"minOffsetsBehindLatest" : "283662940"
}
} ],
"sink" : {
"description" :
"DeltaSink[s3://<dummy-storage>/__unitystorage/schemas/75bc9f38-9af3-4af9-852b-7077489f93d3/tables/c963dc05-f685-4fe8-a744-500cb40ce28a]",
"numOutputRows" : -1
}
}{code}
was:
In Structured Streaming, the Apache Kafka sources object has metrics
*inputRowsPerSecond* and *processedRowsPerSecond* which ** are currently stored
as *Double* and passed as Double in the JSON representation. For large values,
they can be reported as exponential/scientific notation (e.g.,
{*}6.923076923076923E7{*}) or double values with very large scale (e.g.,
{*}638297.8723404256{*})
This formatting is not user-friendly. A user can easily interpret
*6.923076923076923E7* as *6.92* instead of {*}69,230,769.23{*}, as *E* can be
missed to be spotted.
Example:
{code:java}
INFO ProgressReporter: [queryId = d21e9] [batchId = 1248] Streaming query made
progress: { "id" : "d21e9dc9-95be-4548-8b1c-d5a576691abf", "runId" :
"5023fd98-6e3d-44b1-ba52-8499c24ab8a0", "name" : null, "timestamp" :
"2025-09-17T05:59:43.218Z", "batchId" : 1248, "batchDuration" : 192206,
"numInputRows" : 8000000, "inputRowsPerSecond" : 78886.12787441329,
"processedRowsPerSecond" : 41622.009718739275, "durationMs" : { "addBatch"
: 191458, "commitBatch" : 411, "commitOffsets" : 82, "getBatch" : 0,
"latestOffset" : 3, "queryPlanning" : 169, "triggerExecution" : 192206,
"walCommit" : 80 }, "stateOperators" : [ ], "sources" : [ {
"description" : "KafkaV2[Subscribe[bigdata_omsTrade_cdc]]", "startOffset" :
{ "bigdata_omsTrade_cdc" : { "0" : 18408809459 } },
"endOffset" : { "bigdata_omsTrade_cdc" : { "0" : 18416809459 }
}, "latestOffset" : { "bigdata_omsTrade_cdc" : { "0" :
18700472399 } }, "numInputRows" : 8000000, "inputRowsPerSecond" :
78886.12787441329, "processedRowsPerSecond" : 41622.009718739275,
"metrics" : { "avgOffsetsBehindLatest" : "2.8366294E8",
"estimatedTotalBytesBehindLatest" : "7.187828359657416E11",
"maxOffsetsBehindLatest" : "283662940", "minOffsetsBehindLatest" :
"283662940" } } ], "sink" : { "description" :
"DeltaSink[s3://<dummy-storage>/__unitystorage/schemas/75bc9f38-9af3-4af9-852b-7077489f93d3/tables/c963dc05-f685-4fe8-a744-500cb40ce28a]",
"numOutputRows" : -1 }} {code}
> SS | `avgOffsetsBehindLatest` and `estimatedTotalBytesBehindLatest` in
> progress report show values in exponential notation for large numbers
> --------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-53690
> URL: https://issues.apache.org/jira/browse/SPARK-53690
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 3.5.0, 3.5.2, 4.0.0
> Reporter: Jayant Sharma
> Priority: Major
> Fix For: 3.5.2, 4.1.0, 4.0.0
>
>
> In Structured Streaming, the Apache Kafka sources object has metrics
> *inputRowsPerSecond* and *processedRowsPerSecond* which ** are currently
> stored as *Double* and passed as Double in the JSON representation. For large
> values, they can be reported as exponential/scientific notation (e.g.,
> {*}6.923076923076923E7{*}) or double values with very large scale (e.g.,
> {*}638297.8723404256{*})
> This formatting is not user-friendly. A user can easily interpret
> *6.923076923076923E7* as *6.92* instead of {*}69,230,769.23{*}, as *E* can be
> missed to be spotted.
> Example:
> {code:java}
> INFO ProgressReporter: [queryId = d21e9] [batchId = 1248] Streaming query
> made progress: {
> "id" : "d21e9dc9-95be-4548-8b1c-d5a576691abf",
> "runId" : "5023fd98-6e3d-44b1-ba52-8499c24ab8a0",
> "name" : null,
> "timestamp" : "2025-09-17T05:59:43.218Z",
> "batchId" : 1248,
> "batchDuration" : 192206,
> "numInputRows" : 8000000,
> "inputRowsPerSecond" : 78886.12787441329,
> "processedRowsPerSecond" : 41622.009718739275,
> "durationMs" : {
> "addBatch" : 191458,
> "commitBatch" : 411,
> "commitOffsets" : 82,
> "getBatch" : 0,
> "latestOffset" : 3,
> "queryPlanning" : 169,
> "triggerExecution" : 192206,
> "walCommit" : 80
> },
> "stateOperators" : [ ],
> "sources" : [ {
> "description" : "KafkaV2[Subscribe[bigdata_omsTrade_cdc]]",
> "startOffset" : {
> "bigdata_omsTrade_cdc" : {
> "0" : 18408809459
> }
> },
> "endOffset" : {
> "bigdata_omsTrade_cdc" : {
> "0" : 18416809459
> }
> },
> "latestOffset" : {
> "bigdata_omsTrade_cdc" : {
> "0" : 18700472399
> }
> },
> "numInputRows" : 8000000,
> "inputRowsPerSecond" : 78886.12787441329,
> "processedRowsPerSecond" : 41622.009718739275,
> "metrics" : {
> "avgOffsetsBehindLatest" : "2.8366294E8",
> "estimatedTotalBytesBehindLatest" : "7.187828359657416E11",
> "maxOffsetsBehindLatest" : "283662940",
> "minOffsetsBehindLatest" : "283662940"
> }
> } ],
> "sink" : {
> "description" :
> "DeltaSink[s3://<dummy-storage>/__unitystorage/schemas/75bc9f38-9af3-4af9-852b-7077489f93d3/tables/c963dc05-f685-4fe8-a744-500cb40ce28a]",
> "numOutputRows" : -1
> }
> }{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]