[
https://issues.apache.org/jira/browse/BEAM-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ismaël Mejía updated BEAM-7812:
-------------------------------
Status: Open (was: Triage Needed)
> Work around Stackdriver error reporting double counting worker errors
> ---------------------------------------------------------------------
>
> Key: BEAM-7812
> URL: https://issues.apache.org/jira/browse/BEAM-7812
> Project: Beam
> Issue Type: Bug
> Components: runner-dataflow
> Reporter: Ning Kang
> Priority: Minor
>
> h1. *Objective*
> Work around Stackdriver Error Reporting to count worker errors only once when
> double logging.
> {color:#d04437}*Only applicable to dataflow runner workers in SDK*{color}.
> h1. *Background*
> Stackdriver error reporting will double count worker errors logged to
> Stackdriver, because:
> # workers log errors to Stackdriver;
> # workers report the same errors to dfe and dfe will log them again to
> Stackdriver.
> The double counting is blocking us sending job message logs from dfe to
> Stackdriver because we don't want to change the behavior of any existing log
> and feature.
> There happens to be an inconsistency in Java batch
> ([DataflowWorkerLoggingHandler|[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/logging/DataflowWorkerLoggingHandler.java#L82]])
> and streaming
> ([StreamingDataflowWorker|[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java#L1747]])
> error reporting to dfe that results in reported error from streaming Java
> worker will eventually be ignored by Stackdriver Error Reporting.
> h1. *Details*
> Inspired by the inconsistency, we decide to apply the streaming Java worker
> error reporting logic to batch to both fix the inconsistency and work around
> double counting issue on Stackdriver Error Reporting.
> The change will be when workers reporting errors to dfe,
> * For Java, construct stack trace from StackTrace object instead of using
> printStackTrace;
> * For Python, report the complete error message details exactly the same to
> worker logging instead of only reporting traceback through traceback module.
> Users will not experience change since job message logging to Stackdriver
> hasn’t been launched yet.
> h1. *Test Plan*
> We'll add unit test for public methods changed in the process.
> Google has internal integration tests where we can push worker harness images
> and set worker harness container image to test in sandbox.
> When releasing, we also have integration tests in different releasing stages.
> The workaround needs to be released completely before we can enable job
> message logging.
> We can verify the format of stacktraces in sandbox and release stages by
> executing example pipelines in our projects and directly browse prod
> Stackdriver logging and error reporting consoles. This should be done before
> and after enabling job message logging.
> Run any other existing and required tests before sending PR.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)