aIbrahiim commented on issue #30644: URL: https://github.com/apache/beam/issues/30644#issuecomment-4338576893
> Ok, that is a different job (https://console.cloud.google.com/dataflow/jobs/us-central1/2026-04-27_23_23_55-17842483795456124420) > > I think this problem is different than the gpu not being available. If I look in the logs, I see that drivers are getting installed. > > https://console.cloud.google.com/logs/query;query=resource.type%3D%22dataflow_step%22%0Aresource.labels.job_id%3D%222026-04-27_23_23_55-17842483795456124420%22%0ASEARCH%2528%22Driver%20Version%22%2529;cursorTimestamp=2026-04-28T06:53:19.858309Z;startTime=2026-04-28T06:23:56.662Z;endTime=2026-04-28T06:58:22.685Z?project=apache-beam-testing&e=13802955&mods=dm_deploy_from_gcs > > This probably means that pytorch is not able to find the driver for some reason. I'd recommend setting up a GPU with the same driver version, loading our container, and then running `torch._C._cuda_init()` to reproduce and validate this. > > Taking a closer look, though, I see we're not using a custom container; that is probably the issue, we should be following https://docs.cloud.google.com/dataflow/docs/gpu/use-gpus#custom-container and I'm not sure how this ever worked without that So I guess we can build a custom dataflow sdk container for the pytorch GPU benchmark (CUDA userspace + torch stack compatiple with Beam Python SDK image) Then push that image to the registry and pin it by tag And then update the GPU benchmark path to pass --sdk_container_image=<gpu-image> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
