fabito opened a new issue, #22402:
URL: https://github.com/apache/beam/issues/22402
### What would you like to happen?
I am running a pipeline to extract image embeddings using `open-clip-torch`
and the RunInference API in Dataflow.
Sometimes, specially when the `DataflowRunner` triggers a scale up, we get
unhealthy workers due to corrupted model files.
Whenever that happens the whole job fails.
Would it be possible to detect corrupted model files and reload them ?
For more details see the log below:
```
Traceback (most recent call last):
File
"/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 284, in _execute
response = task()
File
"/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 357, in <lambda>
lambda: self.create_worker().do_instruction(request), request)
File
"/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 597, in do_instruction
return getattr(self, request_type)(
File
"/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 628, in process_bundle
bundle_processor = self.bundle_processor_cache.get(
File
"/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 458, in get
processor = bundle_processor.BundleProcessor(
File
"/venv/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py",
line 873, in __init__
op.setup()
File "apache_beam/runners/worker/operations.py", line 833, in
apache_beam.runners.worker.operations.DoOperation.setup
File "apache_beam/runners/worker/operations.py", line 882, in
apache_beam.runners.worker.operations.DoOperation.setup
File "apache_beam/runners/common.py", line 1471, in
apache_beam.runners.common.DoFnRunner.setup
File "apache_beam/runners/common.py", line 1467, in
apache_beam.runners.common.DoFnRunner._invoke_lifecycle_method
File "apache_beam/runners/common.py", line 1507, in
apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam/runners/common.py", line 1465, in
apache_beam.runners.common.DoFnRunner._invoke_lifecycle_method
File "apache_beam/runners/common.py", line 551, in
apache_beam.runners.common.DoFnInvoker.invoke_setup
File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py",
line 374, in setup
self._model = self._load_model()
File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py",
line 369, in _load_model
return self._shared_model_handle.acquire(load)
File "/venv/lib/python3.8/site-packages/apache_beam/utils/shared.py", line
305, in acquire
return _shared_map.acquire(self._key, constructor_fn, tag)
File "/venv/lib/python3.8/site-packages/apache_beam/utils/shared.py", line
246, in acquire
result = control_block.acquire(constructor_fn, tag)
File "/venv/lib/python3.8/site-packages/apache_beam/utils/shared.py", line
139, in acquire
result = constructor_fn()
File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py",
line 358, in load
model = self._model_handler.load_model()
File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py",
line 146, in load_model
return self._unkeyed.load_model()
File
"/venv/lib/python3.8/site-packages/conjurer/feature_extractor/embedder/clip.py",
line 51, in load_model
model = open_clip.create_model(self.model_name,
pretrained=self.pretrained, device=self._device)
File "/venv/lib/python3.8/site-packages/open_clip/factory.py", line 108,
in create_model
model.load_state_dict(load_state_dict(checkpoint_path))
File "/venv/lib/python3.8/site-packages/open_clip/factory.py", line 50, in
load_state_dict
checkpoint = torch.load(checkpoint_path, map_location=map_location)
File "/venv/lib/python3.8/site-packages/torch/serialization.py", line 713,
in load
return _legacy_load(opened_file, map_location, pickle_module,
**pickle_load_args)
File "/venv/lib/python3.8/site-packages/torch/serialization.py", line 938,
in _legacy_load
typed_storage._storage._set_from_file(
RuntimeError: unexpected EOF, expected 1443121 more bytes. The file might be
corrupted. [while running
'OpenClipEmbedder(ViT-B-32-quickgelu)/PyTorchRunInference/ParDo(_RunInferenceDoFn)-ptransform-122']
```
### Issue Priority
Priority: 3
### Issue Component
Component: sdk-py-core
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]