[ 
https://issues.apache.org/jira/browse/BEAM-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Kang resolved BEAM-7812.
-----------------------------
       Resolution: Fixed
    Fix Version/s: Not applicable

The change is made to Dataflow service internally in Google instead of Dataflow 
runner since we want the change compatible with older versions of SDKs.

> Work around Stackdriver error reporting double counting worker errors
> ---------------------------------------------------------------------
>
>                 Key: BEAM-7812
>                 URL: https://issues.apache.org/jira/browse/BEAM-7812
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow
>            Reporter: Ning Kang
>            Assignee: Ning Kang
>            Priority: Minor
>             Fix For: Not applicable
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> h1. *Objective*
> Work around Stackdriver Error Reporting to count worker errors only once when 
> double logging.
> {color:#d04437}*Only applicable to dataflow runner workers in SDK*{color}.
> h1. *Background*
> Stackdriver error reporting will double count worker errors logged to 
> Stackdriver, because:
>  # workers log errors to Stackdriver;
>  # workers report the same errors to dfe and dfe will log them again to 
> Stackdriver.
> The double counting is blocking us sending job message logs from dfe to 
> Stackdriver because we don't want to change the behavior of any existing log 
> and feature.
> There happens to be an inconsistency in Java batch 
> [DataflowWorkerLoggingHandler|[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/logging/DataflowWorkerLoggingHandler.java#L82]]
>  and streaming 
> ([StreamingDataflowWorker|[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java#L1747]])
>  error reporting to dfe that results in reported error from streaming Java 
> worker will eventually be ignored by Stackdriver Error Reporting.
> h1. *Details*
> Inspired by the inconsistency, we decide to apply the streaming Java worker 
> error reporting logic to batch to both fix the inconsistency and work around 
> double counting issue on Stackdriver Error Reporting.
> The change will be when workers reporting errors to dfe,
>  * For Java, construct stack trace from StackTrace object instead of using 
> printStackTrace;
>  * For Python, report the complete error message details exactly the same to 
> worker logging instead of only reporting traceback through traceback module.
> Users will not experience change since job message logging to Stackdriver 
> hasn’t been launched yet.
> h1. *Test Plan*
> We'll add unit test for public methods changed in the process.
> Google has internal integration tests where we can push worker harness images 
> and set worker harness container image to test in sandbox.
> When releasing, we also have integration tests in different releasing stages.
> The workaround needs to be released completely before we can enable job 
> message logging.
> We can verify the format of stacktraces in sandbox and release stages by 
> executing example pipelines in our projects and directly browse prod 
> Stackdriver logging and error reporting consoles. This should be done before 
> and after enabling job message logging.
> Run any other existing and required tests before sending PR.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to