fabito opened a new issue, #22402:
URL: https://github.com/apache/beam/issues/22402

   ### What would you like to happen?
   
   I am running a pipeline to extract image embeddings using `open-clip-torch` 
and the RunInference API in Dataflow.
   Sometimes, specially when the `DataflowRunner` triggers a scale up, we get 
unhealthy workers due to corrupted model files.
   Whenever that happens the whole job fails.
   Would it be possible to detect corrupted model files and reload them ?
   For more details see the log below:
   
   ```
   Traceback (most recent call last):
     File 
"/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", 
line 284, in _execute
       response = task()
     File 
"/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", 
line 357, in <lambda>
       lambda: self.create_worker().do_instruction(request), request)
     File 
"/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", 
line 597, in do_instruction
       return getattr(self, request_type)(
     File 
"/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", 
line 628, in process_bundle
       bundle_processor = self.bundle_processor_cache.get(
     File 
"/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", 
line 458, in get
       processor = bundle_processor.BundleProcessor(
     File 
"/venv/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py",
 line 873, in __init__
       op.setup()
     File "apache_beam/runners/worker/operations.py", line 833, in 
apache_beam.runners.worker.operations.DoOperation.setup
     File "apache_beam/runners/worker/operations.py", line 882, in 
apache_beam.runners.worker.operations.DoOperation.setup
     File "apache_beam/runners/common.py", line 1471, in 
apache_beam.runners.common.DoFnRunner.setup
     File "apache_beam/runners/common.py", line 1467, in 
apache_beam.runners.common.DoFnRunner._invoke_lifecycle_method
     File "apache_beam/runners/common.py", line 1507, in 
apache_beam.runners.common.DoFnRunner._reraise_augmented
     File "apache_beam/runners/common.py", line 1465, in 
apache_beam.runners.common.DoFnRunner._invoke_lifecycle_method
     File "apache_beam/runners/common.py", line 551, in 
apache_beam.runners.common.DoFnInvoker.invoke_setup
     File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", 
line 374, in setup
       self._model = self._load_model()
     File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", 
line 369, in _load_model
       return self._shared_model_handle.acquire(load)
     File "/venv/lib/python3.8/site-packages/apache_beam/utils/shared.py", line 
305, in acquire
       return _shared_map.acquire(self._key, constructor_fn, tag)
     File "/venv/lib/python3.8/site-packages/apache_beam/utils/shared.py", line 
246, in acquire
       result = control_block.acquire(constructor_fn, tag)
     File "/venv/lib/python3.8/site-packages/apache_beam/utils/shared.py", line 
139, in acquire
       result = constructor_fn()
     File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", 
line 358, in load
       model = self._model_handler.load_model()
     File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", 
line 146, in load_model
       return self._unkeyed.load_model()
     File 
"/venv/lib/python3.8/site-packages/conjurer/feature_extractor/embedder/clip.py",
 line 51, in load_model
       model = open_clip.create_model(self.model_name, 
pretrained=self.pretrained, device=self._device)
     File "/venv/lib/python3.8/site-packages/open_clip/factory.py", line 108, 
in create_model
       model.load_state_dict(load_state_dict(checkpoint_path))
     File "/venv/lib/python3.8/site-packages/open_clip/factory.py", line 50, in 
load_state_dict
       checkpoint = torch.load(checkpoint_path, map_location=map_location)
     File "/venv/lib/python3.8/site-packages/torch/serialization.py", line 713, 
in load
       return _legacy_load(opened_file, map_location, pickle_module, 
**pickle_load_args)
     File "/venv/lib/python3.8/site-packages/torch/serialization.py", line 938, 
in _legacy_load
       typed_storage._storage._set_from_file(
   RuntimeError: unexpected EOF, expected 1443121 more bytes. The file might be 
corrupted. [while running 
'OpenClipEmbedder(ViT-B-32-quickgelu)/PyTorchRunInference/ParDo(_RunInferenceDoFn)-ptransform-122']
   ```
   
   ### Issue Priority
   
   Priority: 3
   
   ### Issue Component
   
   Component: sdk-py-core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to