HeartSaVioR edited a comment on issue #27557: [SPARK-30804][SS] Measure and log elapsed time for "compact" operation in CompactibleFileStreamLog URL: https://github.com/apache/spark/pull/27557#issuecomment-588592849 > I think the information which prints out is not necessary for the users I'm not sure I can agree with. The information is pretty much similar with what InMemoryFileIndex provides the information for listing leaf files in InMemoryFileIndex, which level is set to INFO if I remember correctly. For streaming workloads, latency is the first class consideration. End users would have no idea why the overall latency suddenly increases per N batches unless they know about the details of metadata on FileStreamSource / FileStreamSink. This is completely different user experience they would experience with Kafka streaming source and sink - they may struggle to find the root cause from another spots like their query or so. But I'd agree that the information may not be necessary for the users if the latency being added here is not considerable. We could set a threshold (like 1s or 2s?) and only print when the latency exceeds the threshold (still print it with DEBUG level even it doesn't reach threshold), but then that would deserve to have higher severity, WARN. What do you think?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
