[ 
https://issues.apache.org/jira/browse/BEAM-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Burke updated BEAM-4224:
-------------------------------
    Description: 
Jira tracking work around CPU profiling the Go SDK.

Prior to this, a hook that enables the Go CPU and trace profiling libraries was 
added in the following PR
https://github.com/apache/beam/commit/adb78f6c3055693a053a89bdbaa46ca86685a290

At present, it's broken on distributed runners.
https://github.com/apache/beam/blob/410ad7699621e28433d81809f6b9c42fe7bd6a60/sdks/go/pkg/beam/x/hooks/perf/perf.go#L50
See also: 
https://stackoverflow.com/questions/67076744/cpu-profiling-not-covering-all-the-vcpu-time-of-apache-beam-pipeline-on-dataflow/67082075?noredirect=1#comment118629835_67082075

The original intent was to have each bundle profiled individually, but this is 
at odds with how CPU profiling works with Go, which measures the whole process.

At this point, different bundles start and stop each others profiling leading 
to a severe undercounting, which is not ideal. A better approach would be to 
start the profiling on Init, and do the sampling periodically.  So that we can 
get ~30 second chunks or similar, writing to new files each time, per worker. 
This at least avoid losing most of the profiling information at the end of a 
worker life. (profiles can be "merged" after the fact, so if something is 
stopped and started again right away, little is lost).

Optionally, we should add a Teardown trigger to the hooks so we can do a clean 
exit in this case, but it's not a hard requirement for a first pass.

Optionally, figure out a clean way to get a job to work with Google Cloud 
Profiler, likely as a different hook. 
https://cloud.google.com/profiler/docs/profiling-go

  was:
Jira tracking work around CPU profiling the Go SDK.

Prior to this, a hook that enables the Go CPU and trace profiling libraries was 
added in the following PR
https://github.com/apache/beam/commit/adb78f6c3055693a053a89bdbaa46ca86685a290

At present, it's broken on distributed runners.
https://github.com/apache/beam/blob/410ad7699621e28433d81809f6b9c42fe7bd6a60/sdks/go/pkg/beam/x/hooks/perf/perf.go#L50

The original intent was to have each bundle profiled individually, but this is 
at odds with how CPU profiling works with Go, which measures the whole process.

At this point, different bundles start and stop each others profiling leading 
to a severe undercounting, which is not ideal. A better approach would be to 
start the profiling on Init, and do the sampling periodically.  So that we can 
get ~30 second chunks or similar, writing to new files each time, per worker. 
This at least avoid losing most of the profiling information at the end of a 
worker life. (profiles can be "merged" after the fact, so if something is 
stopped and started again right away, little is lost).

Optionally, we should add a Teardown trigger to the hooks so we can do a clean 
exit in this case, but it's not a hard requirement for a first pass.

Optionally, figure out a clean way to get a job to work with Google Cloud 
Profiler, likely as a different hook. 
https://cloud.google.com/profiler/docs/profiling-go


> Go SDK CPU Profiling
> --------------------
>
>                 Key: BEAM-4224
>                 URL: https://issues.apache.org/jira/browse/BEAM-4224
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-go
>            Reporter: Robert Burke
>            Priority: P3
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Jira tracking work around CPU profiling the Go SDK.
> Prior to this, a hook that enables the Go CPU and trace profiling libraries 
> was added in the following PR
> https://github.com/apache/beam/commit/adb78f6c3055693a053a89bdbaa46ca86685a290
> At present, it's broken on distributed runners.
> https://github.com/apache/beam/blob/410ad7699621e28433d81809f6b9c42fe7bd6a60/sdks/go/pkg/beam/x/hooks/perf/perf.go#L50
> See also: 
> https://stackoverflow.com/questions/67076744/cpu-profiling-not-covering-all-the-vcpu-time-of-apache-beam-pipeline-on-dataflow/67082075?noredirect=1#comment118629835_67082075
> The original intent was to have each bundle profiled individually, but this 
> is at odds with how CPU profiling works with Go, which measures the whole 
> process.
> At this point, different bundles start and stop each others profiling leading 
> to a severe undercounting, which is not ideal. A better approach would be to 
> start the profiling on Init, and do the sampling periodically.  So that we 
> can get ~30 second chunks or similar, writing to new files each time, per 
> worker. This at least avoid losing most of the profiling information at the 
> end of a worker life. (profiles can be "merged" after the fact, so if 
> something is stopped and started again right away, little is lost).
> Optionally, we should add a Teardown trigger to the hooks so we can do a 
> clean exit in this case, but it's not a hard requirement for a first pass.
> Optionally, figure out a clean way to get a job to work with Google Cloud 
> Profiler, likely as a different hook. 
> https://cloud.google.com/profiler/docs/profiling-go



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to