HeartSaVioR commented on issue #27557: [SPARK-30804][SS] Measure and log elapsed time for "compact" operation in CompactibleFileStreamLog URL: https://github.com/apache/spark/pull/27557#issuecomment-589333368 >> For streaming workloads, latency is the first class consideration. > When the query is not running properly. OK I admit my major experience had been with "low-latency", but even Spark runs with "micro-batch", it doesn't mean latency is not important. The latency is the thing in streaming workload to "define" whether the query is running properly or not. Even Spark had to claim that a micro-batch could run in sub-second because one of major downside for Spark Streaming has been the latency, and continuous processing had to be introduced. Higher latency doesn't only mean output will be late. When you turn on "latestFirst" (with maxFilesPerTrigger, as this case we assume we can't process all the inputs) to start reading from latest files, then the latency on a batch defines the boundary of inputs. It's a critical aspect which operators should always observe via their monitoring approaches (alerts, time-series DB and dashboard, etc.), and find out what happens when the latency fluctuates a lot. > I think it's debug information which helps developers to find out what's the issue and not users (INFO is more like to users in my understanding). I'm not sure who do you mean by "users". AFAIK, in many cases (not all cases for sure), users = developers = operators.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
