yijiacui-db edited a comment on pull request #31944: URL: https://github.com/apache/spark/pull/31944#issuecomment-805966497
> Yeah I guess it's just a matter of parsing and calculation on user end. > > * Do you have specific use case leveraging this information? > * Are you planning to integrate this information to Spark UI or somewhere? > * Could you please try out recent version of Spark and check the available information on SourceProgress, and see whether it could solve the same use case despite of some more calculation? @viirya @HeartSaVioR I don't think that's a duplicated information in source progress. The information recorded in the source progress now is the latest consumed offset by the stream, not the latest offset available in the source. Take Kafka as an example, we can have read limit while consuming the offsets, so we can only consume some certain number of offset, but the available data in kafka is more than that. That can be applied to all the other streaming sources too. There are some users want to know whether they fall behind through the listener and want to adjust the cluster size accordingly. I don't think the current spark progress can calculate the information as i mentioned above, because the latest offset available information is internal for the source, there's no way to know that with the current source progress. I didn't have a plan for integrating this information with spark UI. That's something I can work on after @viirya 's PR is merged in. I can refactor and adjust accordingly to see whether this metrics information can be exposed through spark UI too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
