[GitHub] [spark] yijiacui-db commented on pull request #31944: [SPARK-34854][SQL][SS] Expose source metrics via progress report and add Kafka use-case to report delay.

GitBox Wed, 24 Mar 2021 09:21:49 -0700


yijiacui-db commented on pull request #31944:
URL: https://github.com/apache/spark/pull/31944#issuecomment-805966497



   > Yeah I guess it's just a matter of parsing and calculation on user end.
   > 
   > * Do you have specific use case leveraging this information?
   > * Are you planning to integrate this information to Spark UI or somewhere?
   > * Could you please try out recent version of Spark and check the available 
information on SourceProgress, and see whether it could solve the same use case 
despite of some more calculation?
   
   @viirya @HeartSaVioR  I don't think that's a duplicated information in 
source progress. The information recorded in the source progress now is the 
latest consumed offset by the stream, not the latest offset available in the 
source. Take Kafka as an example, we can have read limit while consuming the 
offsets, so we can only consume some certain number of offset, but the 
available data in kafka is more than that. That can be applied to all the other 
streaming sources too. There are some users want to know whether they fall 
behind and want to adjust the cluster size accordingly.
   
   I don't think the current spark progress can calculate the information as i 
mentioned above, because the latest offset available information is internal 
for the source, there's no way to know that with the current source progress. 
   
   I didn't have a plan for integrating this information with spark UI. That's 
something I can work on after @viirya 's PR is merged in. I can refactor and 
adjust accordingly to see whether this metrics information can be exposed 
through spark UI too. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] yijiacui-db commented on pull request #31944: [SPARK-34854][SQL][SS] Expose source metrics via progress report and add Kafka use-case to report delay.

Reply via email to