rszper commented on code in PR #28243:
URL: https://github.com/apache/beam/pull/28243#discussion_r1322098589


##########
sdks/python/apache_beam/ml/inference/base.py:
##########
@@ -482,6 +482,12 @@ def __init__(
     from the cohort. When model updates occur, the metrics will be reported in
     the form `<cohort_key>-<model id>-<metric_name>`.
 
+    Loading multiple models at once can introduce greater risks of an Out of

Review Comment:
   ```suggestion
       Loading multiple models at the same time can increase the risk of an out 
of
   ```



##########
sdks/python/apache_beam/ml/inference/base.py:
##########
@@ -482,6 +482,12 @@ def __init__(
     from the cohort. When model updates occur, the metrics will be reported in
     the form `<cohort_key>-<model id>-<metric_name>`.
 
+    Loading multiple models at once can introduce greater risks of an Out of
+    Memory (OOM) exception. To avoid this, you can use the parameter

Review Comment:
   ```suggestion
       memory (OOM) exception. To avoid this issue, use the parameter
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated

Review Comment:
   ```suggestion
   The previous example loads a model by using `config1`. That model is then 
used for inference for all examples associated
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.

Review Comment:
   ```suggestion
   with `key1`. It then loads a model by using `config2`. That model is used 
for all examples associated with `key2` and `key3`.
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not
+limit the number of models loaded into memory at once. This means that if not 
all models fit into memory at once,

Review Comment:
   ```suggestion
   limit the number of models loaded into memory at the same time. If the 
models don't all fit into memory,
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not
+limit the number of models loaded into memory at once. This means that if not 
all models fit into memory at once,
+your pipeline will likely fail with an Out of Memory exception. To avoid this, 
you can provide a hint about the
+maximum number of models loaded at once.
+
+```
+mhs = [
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>)),
+  KeyModelMapping(['key4'], PytorchModelHandlerTensor(<config3>)),
+  KeyModelMapping(['key5', 'key6', 'key7'], 
PytorchModelHandlerTensor(<config4>)),
+]
+keyed_model_handler = KeyedModelHandler(mhs, max_models_per_worker_hint=2)
+```
+
+The previous example will load at most 2 models per SDK worker process at any 
given time, and will unload models that aren't
+currently being used as needed. Runners that have multiple SDK worker 
processes on a given machine will load at most
+`max_models_per_worker_hint*<num worker processes>` models onto the machine. 
Make sure you leave enough space for the models
+and any additional memory needs from other transforms. There may also be some 
delay between offloading a model and the
+memory being released, so it is recommended that you leave some additional 
buffer.
+
+**Note**: If you have many models but a small `max_models_per_worker_hint`, 
that can lead to _memory thrashing_ where
+a large amount of execution time is wasted swapping models in and out of 
memory. To reduce the likelihood and impact
+of memory thrashing, consider inserting a
+[GroupByKey](https://beam.apache.org/documentation/transforms/python/aggregation/groupbykey/)
 transform before your
+inference step if you are using a distributed runner. This will ensure that 
elements with the same key/model are

Review Comment:
   ```suggestion
   inference step. This step reduces thrashing by ensuring that elements with 
the same key and model are
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not
+limit the number of models loaded into memory at once. This means that if not 
all models fit into memory at once,
+your pipeline will likely fail with an Out of Memory exception. To avoid this, 
you can provide a hint about the
+maximum number of models loaded at once.
+
+```
+mhs = [
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>)),
+  KeyModelMapping(['key4'], PytorchModelHandlerTensor(<config3>)),
+  KeyModelMapping(['key5', 'key6', 'key7'], 
PytorchModelHandlerTensor(<config4>)),
+]
+keyed_model_handler = KeyedModelHandler(mhs, max_models_per_worker_hint=2)
+```
+
+The previous example will load at most 2 models per SDK worker process at any 
given time, and will unload models that aren't
+currently being used as needed. Runners that have multiple SDK worker 
processes on a given machine will load at most
+`max_models_per_worker_hint*<num worker processes>` models onto the machine. 
Make sure you leave enough space for the models
+and any additional memory needs from other transforms. There may also be some 
delay between offloading a model and the
+memory being released, so it is recommended that you leave some additional 
buffer.
+
+**Note**: If you have many models but a small `max_models_per_worker_hint`, 
that can lead to _memory thrashing_ where
+a large amount of execution time is wasted swapping models in and out of 
memory. To reduce the likelihood and impact
+of memory thrashing, consider inserting a

Review Comment:
   ```suggestion
   of memory thrashing, if you're using a distributed runner, insert a
   ```



##########
sdks/python/apache_beam/ml/inference/base.py:
##########
@@ -482,6 +482,12 @@ def __init__(
     from the cohort. When model updates occur, the metrics will be reported in
     the form `<cohort_key>-<model id>-<metric_name>`.
 
+    Loading multiple models at once can introduce greater risks of an Out of
+    Memory (OOM) exception. To avoid this, you can use the parameter
+    `max_models_per_worker_hint` to limit the number of models loaded at once.
+    For more information on memory management, see
+    
https://beam.apache.org/documentation/sdks/python-machine-learning/#use-a-keyed-modelhandler

Review Comment:
   ```suggestion
       [Use a keyed 
`ModelHandler`](https://beam.apache.org/documentation/sdks/python-machine-learning/#use-a-keyed-modelhandler).
   ```



##########
sdks/python/apache_beam/ml/inference/base.py:
##########
@@ -491,7 +497,9 @@ def __init__(
         models can be held in memory at one time per worker process. For
         example, if your worker has 8 GB of memory provisioned and your workers
         take up 1 GB each, you should set this to 7 to allow all models to sit
-        in memory with some buffer.
+        in memory with some buffer. For more information on memory management,
+        see
+        
https://beam.apache.org/documentation/sdks/python-machine-learning/#use-a-keyed-modelhandler

Review Comment:
   ```suggestion
           [Use a keyed 
`ModelHandler`](https://beam.apache.org/documentation/sdks/python-machine-learning/#use-a-keyed-modelhandler).
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not

Review Comment:
   ```suggestion
   Loading multiple models at the same times increases the risk of out of 
memory (OOM) errors. By default, `KeyedModelHandler` doesn't
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not
+limit the number of models loaded into memory at once. This means that if not 
all models fit into memory at once,
+your pipeline will likely fail with an Out of Memory exception. To avoid this, 
you can provide a hint about the
+maximum number of models loaded at once.

Review Comment:
   ```suggestion
   maximum number of models that can be loaded at the same time.
   ```



##########
sdks/python/apache_beam/ml/inference/base.py:
##########
@@ -482,6 +482,12 @@ def __init__(
     from the cohort. When model updates occur, the metrics will be reported in
     the form `<cohort_key>-<model id>-<metric_name>`.
 
+    Loading multiple models at once can introduce greater risks of an Out of
+    Memory (OOM) exception. To avoid this, you can use the parameter
+    `max_models_per_worker_hint` to limit the number of models loaded at once.

Review Comment:
   ```suggestion
       `max_models_per_worker_hint` to limit the number of models that are 
loaded at the same time.
   ```



##########
sdks/python/apache_beam/ml/inference/base.py:
##########
@@ -482,6 +482,12 @@ def __init__(
     from the cohort. When model updates occur, the metrics will be reported in
     the form `<cohort_key>-<model id>-<metric_name>`.
 
+    Loading multiple models at once can introduce greater risks of an Out of
+    Memory (OOM) exception. To avoid this, you can use the parameter
+    `max_models_per_worker_hint` to limit the number of models loaded at once.
+    For more information on memory management, see

Review Comment:
   ```suggestion
       For more information about memory management, see
   ```



##########
sdks/python/apache_beam/ml/inference/base.py:
##########
@@ -491,7 +497,9 @@ def __init__(
         models can be held in memory at one time per worker process. For
         example, if your worker has 8 GB of memory provisioned and your workers
         take up 1 GB each, you should set this to 7 to allow all models to sit
-        in memory with some buffer.
+        in memory with some buffer. For more information on memory management,

Review Comment:
   ```suggestion
           in memory with some buffer. For more information about memory 
management,
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not
+limit the number of models loaded into memory at once. This means that if not 
all models fit into memory at once,
+your pipeline will likely fail with an Out of Memory exception. To avoid this, 
you can provide a hint about the

Review Comment:
   ```suggestion
   your pipeline will likely fail with an out of memory error. To avoid this 
issue, provide a hint about the
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not
+limit the number of models loaded into memory at once. This means that if not 
all models fit into memory at once,
+your pipeline will likely fail with an Out of Memory exception. To avoid this, 
you can provide a hint about the
+maximum number of models loaded at once.
+
+```
+mhs = [
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>)),
+  KeyModelMapping(['key4'], PytorchModelHandlerTensor(<config3>)),
+  KeyModelMapping(['key5', 'key6', 'key7'], 
PytorchModelHandlerTensor(<config4>)),
+]
+keyed_model_handler = KeyedModelHandler(mhs, max_models_per_worker_hint=2)
+```
+
+The previous example will load at most 2 models per SDK worker process at any 
given time, and will unload models that aren't
+currently being used as needed. Runners that have multiple SDK worker 
processes on a given machine will load at most
+`max_models_per_worker_hint*<num worker processes>` models onto the machine. 
Make sure you leave enough space for the models
+and any additional memory needs from other transforms. There may also be some 
delay between offloading a model and the
+memory being released, so it is recommended that you leave some additional 
buffer.

Review Comment:
   ```suggestion
   memory is released, it is recommended that you leave additional buffer.
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not
+limit the number of models loaded into memory at once. This means that if not 
all models fit into memory at once,
+your pipeline will likely fail with an Out of Memory exception. To avoid this, 
you can provide a hint about the
+maximum number of models loaded at once.
+
+```
+mhs = [
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>)),
+  KeyModelMapping(['key4'], PytorchModelHandlerTensor(<config3>)),
+  KeyModelMapping(['key5', 'key6', 'key7'], 
PytorchModelHandlerTensor(<config4>)),
+]
+keyed_model_handler = KeyedModelHandler(mhs, max_models_per_worker_hint=2)
+```
+
+The previous example will load at most 2 models per SDK worker process at any 
given time, and will unload models that aren't

Review Comment:
   ```suggestion
   The previous example loads at most two models per SDK worker process at any 
given time. It unloads models that aren't
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not
+limit the number of models loaded into memory at once. This means that if not 
all models fit into memory at once,
+your pipeline will likely fail with an Out of Memory exception. To avoid this, 
you can provide a hint about the
+maximum number of models loaded at once.
+
+```
+mhs = [
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>)),
+  KeyModelMapping(['key4'], PytorchModelHandlerTensor(<config3>)),
+  KeyModelMapping(['key5', 'key6', 'key7'], 
PytorchModelHandlerTensor(<config4>)),
+]
+keyed_model_handler = KeyedModelHandler(mhs, max_models_per_worker_hint=2)
+```
+
+The previous example will load at most 2 models per SDK worker process at any 
given time, and will unload models that aren't
+currently being used as needed. Runners that have multiple SDK worker 
processes on a given machine will load at most
+`max_models_per_worker_hint*<num worker processes>` models onto the machine. 
Make sure you leave enough space for the models

Review Comment:
   ```suggestion
   `max_models_per_worker_hint*<num worker processes>` models onto the machine. 
Leave enough space for the models
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not
+limit the number of models loaded into memory at once. This means that if not 
all models fit into memory at once,
+your pipeline will likely fail with an Out of Memory exception. To avoid this, 
you can provide a hint about the
+maximum number of models loaded at once.
+
+```
+mhs = [
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>)),
+  KeyModelMapping(['key4'], PytorchModelHandlerTensor(<config3>)),
+  KeyModelMapping(['key5', 'key6', 'key7'], 
PytorchModelHandlerTensor(<config4>)),
+]
+keyed_model_handler = KeyedModelHandler(mhs, max_models_per_worker_hint=2)
+```
+
+The previous example will load at most 2 models per SDK worker process at any 
given time, and will unload models that aren't
+currently being used as needed. Runners that have multiple SDK worker 
processes on a given machine will load at most

Review Comment:
   ```suggestion
   currently being used. Runners that have multiple SDK worker processes on a 
given machine load at most
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not
+limit the number of models loaded into memory at once. This means that if not 
all models fit into memory at once,
+your pipeline will likely fail with an Out of Memory exception. To avoid this, 
you can provide a hint about the
+maximum number of models loaded at once.
+
+```
+mhs = [
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>)),
+  KeyModelMapping(['key4'], PytorchModelHandlerTensor(<config3>)),
+  KeyModelMapping(['key5', 'key6', 'key7'], 
PytorchModelHandlerTensor(<config4>)),
+]
+keyed_model_handler = KeyedModelHandler(mhs, max_models_per_worker_hint=2)
+```
+
+The previous example will load at most 2 models per SDK worker process at any 
given time, and will unload models that aren't
+currently being used as needed. Runners that have multiple SDK worker 
processes on a given machine will load at most
+`max_models_per_worker_hint*<num worker processes>` models onto the machine. 
Make sure you leave enough space for the models
+and any additional memory needs from other transforms. There may also be some 
delay between offloading a model and the

Review Comment:
   ```suggestion
   and any additional memory needs from other transforms. Because there might 
be a delay between when a model is offloaded and when the
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not
+limit the number of models loaded into memory at once. This means that if not 
all models fit into memory at once,
+your pipeline will likely fail with an Out of Memory exception. To avoid this, 
you can provide a hint about the
+maximum number of models loaded at once.
+
+```
+mhs = [
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>)),
+  KeyModelMapping(['key4'], PytorchModelHandlerTensor(<config3>)),
+  KeyModelMapping(['key5', 'key6', 'key7'], 
PytorchModelHandlerTensor(<config4>)),
+]
+keyed_model_handler = KeyedModelHandler(mhs, max_models_per_worker_hint=2)
+```
+
+The previous example will load at most 2 models per SDK worker process at any 
given time, and will unload models that aren't
+currently being used as needed. Runners that have multiple SDK worker 
processes on a given machine will load at most
+`max_models_per_worker_hint*<num worker processes>` models onto the machine. 
Make sure you leave enough space for the models
+and any additional memory needs from other transforms. There may also be some 
delay between offloading a model and the
+memory being released, so it is recommended that you leave some additional 
buffer.
+
+**Note**: If you have many models but a small `max_models_per_worker_hint`, 
that can lead to _memory thrashing_ where
+a large amount of execution time is wasted swapping models in and out of 
memory. To reduce the likelihood and impact
+of memory thrashing, consider inserting a
+[GroupByKey](https://beam.apache.org/documentation/transforms/python/aggregation/groupbykey/)
 transform before your
+inference step if you are using a distributed runner. This will ensure that 
elements with the same key/model are
+colocated on the same worker, reducing thrashing.

Review Comment:
   ```suggestion
   collocated on the same worker.
   ```



##########
website/www/site/content/en/documentation/sdks/python-machine-learning.md:
##########
@@ -215,6 +215,54 @@ with pipeline as p:
 
 If you are unsure if your data is keyed, you can also use 
`MaybeKeyedModelHandler`.
 
+You can also use a `KeyedModelHandler` to load several different models based 
on their associated key:
+
+```
+from apache_beam.ml.inference.base import KeyedModelHandler
+keyed_model_handler = KeyedModelHandler([
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>))
+])
+with pipeline as p:
+   data = p | beam.Create([
+      ('key1', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key2', torch.tensor([[1,2,3],[4,5,6],...])),
+      ('key3', torch.tensor([[1,2,3],[4,5,6],...])),
+   ])
+   predictions = data | RunInference(keyed_model_handler)
+```
+
+The previous example will load a model using `config1` and use that for 
inference for all examples associated
+with `key1`, and will load a model using `config2` and use that for all 
examples associated with `key2` and `key3`.
+
+There are memory risks associated with loading multiple models at once. By 
default, `KeyedModelHandler` will not
+limit the number of models loaded into memory at once. This means that if not 
all models fit into memory at once,
+your pipeline will likely fail with an Out of Memory exception. To avoid this, 
you can provide a hint about the
+maximum number of models loaded at once.
+
+```
+mhs = [
+  KeyModelMapping(['key1'], PytorchModelHandlerTensor(<config1>)),
+  KeyModelMapping(['key2', 'key3'], PytorchModelHandlerTensor(<config2>)),
+  KeyModelMapping(['key4'], PytorchModelHandlerTensor(<config3>)),
+  KeyModelMapping(['key5', 'key6', 'key7'], 
PytorchModelHandlerTensor(<config4>)),
+]
+keyed_model_handler = KeyedModelHandler(mhs, max_models_per_worker_hint=2)
+```
+
+The previous example will load at most 2 models per SDK worker process at any 
given time, and will unload models that aren't
+currently being used as needed. Runners that have multiple SDK worker 
processes on a given machine will load at most
+`max_models_per_worker_hint*<num worker processes>` models onto the machine. 
Make sure you leave enough space for the models
+and any additional memory needs from other transforms. There may also be some 
delay between offloading a model and the
+memory being released, so it is recommended that you leave some additional 
buffer.
+
+**Note**: If you have many models but a small `max_models_per_worker_hint`, 
that can lead to _memory thrashing_ where

Review Comment:
   ```suggestion
   **Note**: Having many models but a small `max_models_per_worker_hint` can 
lead to _memory thrashing_, where
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to