yijiacui-db commented on pull request #31944:
URL: https://github.com/apache/spark/pull/31944#issuecomment-829531520


   > Yeah I agree about the rationalization and benefits of "adding public API 
on custom source metrics", though it'd be even better if we could talk with 
real case which is not covered by #30988.
   > 
   > I feel that the reason the review gets dragging is due to Kafka use-case. 
Your explanation may make sense on "other" data source (hypothetically, as you 
haven't provided actual one), but for Kafka case it's possible for specific 
process to calculate lag with the change of #30988. I agree it's bad for human 
being to calculate the lag per topic partition and summarize by him/herself, 
but it's still not that hard for specific process to do that.
   
   @viirya @HeartSaVioR
   
   A good example is FileStreamSource, which doesn't implement the 
reportLatestOffset, because the latest available source isn't matched with the 
"Offset" representation in the Spark streaming. 
   
   In FileStreamSource, fetchMaxOffsests returns the maximum offset that can be 
retrieved from the source, which can be rate limited.  Only the file source 
itself knows internally that how many files are left to be processed for the 
batch. Possible metrics here to be exposed to the users is the number of files, 
and the number of bytes remaining in the batch to be processed, which is how 
far the application is falling behind the stream.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to