Re: [PR] [Python]Enable state cache to 100 MB [beam]

via GitHub Tue, 31 Oct 2023 07:35:23 -0700


damccorm commented on code in PR #28781:
URL: https://github.com/apache/beam/pull/28781#discussion_r1377691624



##########
sdks/python/apache_beam/runners/worker/sdk_worker_main.py:
##########
@@ -239,24 +241,24 @@ def _parse_pipeline_options(options_json):
   return PipelineOptions.from_dictionary(_load_pipeline_options(options_json))
 
 
-def _get_state_cache_size(experiments):
-  """Defines the upper number of state items to cache.
-
-  Note: state_cache_size is an experimental flag and might not be available in
-  future releases.
+def _get_state_cache_size_bytes(options):
+  """Return the maximum size of state cache in bytes.
 
   Returns:
-    an int indicating the maximum number of megabytes to cache.
+    an int indicating the maximum number of bytes to cache.
       Default is 100 MB
   """
-
+  max_cache_memory_usage_mb = options.view_as(
+      WorkerOptions).max_cache_memory_usage_mb
+  # to maintain backward compatibility
+  experiments = options.view_as(DebugOptions).experiments or []
   for experiment in experiments:
     # There should only be 1 match so returning from the loop
     if re.match(r'state_cache_size=', experiment):

Review Comment:
   Should we warn if users are still using the experiment? There is the 
possibility of them using both the experiment and the flag leading to undefined 
behavior.
   
   We will eventually want to rip out the experiment.



##########
CHANGES.md:
##########
@@ -74,6 +74,7 @@ should handle this. 
([#25252](https://github.com/apache/beam/issues/25252)).
   jobs using the DataStream API. By default the option is set to false, so the 
batch jobs are still executed
   using the DataSet API.
 * `upload_graph` as one of the Experiments options for DataflowRunner is no 
longer required when the graph is larger than 10MB for Java SDK 
([PR#28621](https://github.com/apache/beam/pull/28621).
+* state cache has been enabled to a default of 100 MB. Use 
`--max_cache_memory_usage_mb=X` to provide cache size. (Python) 
([#28770](https://github.com/apache/beam/issues/28770)).

Review Comment:
   Could you add a short description of what this cache is and link to 
https://beam.apache.org/releases/pydoc/2.50.0/apache_beam.options.pipeline_options.html#module-apache_beam.options.pipeline_options?



##########
CHANGES.md:
##########
@@ -74,6 +74,7 @@ should handle this. 
([#25252](https://github.com/apache/beam/issues/25252)).
   jobs using the DataStream API. By default the option is set to false, so the 
batch jobs are still executed
   using the DataSet API.
 * `upload_graph` as one of the Experiments options for DataflowRunner is no 
longer required when the graph is larger than 10MB for Java SDK 
([PR#28621](https://github.com/apache/beam/pull/28621).
+* state cache has been enabled to a default of 100 MB. Use 
`--max_cache_memory_usage_mb=X` to provide cache size. (Python) 
([#28770](https://github.com/apache/beam/issues/28770)).

Review Comment:
   Particularly, we should make it clear here that this impacts both user state 
and side inputs



##########
sdks/python/apache_beam/options/pipeline_options.py:
##########
@@ -1135,6 +1135,20 @@ def _add_argparse_args(cls, parser):
         dest='min_cpu_platform',
         type=str,
         help='GCE minimum CPU platform. Default is determined by GCP.')
+    parser.add_argument(
+        '--max_cache_memory_usage_mb',
+        dest='max_cache_memory_usage_mb',
+        type=int,
+        default=100,
+        help=(
+            'Size of the SdkHarness cache to store user state and side inputs '
+            'in MB. Default is 100MB. If the cache is full, least recently '
+            'used elements will be evicted. This cache will be per '
+            'SdkHarness/Sdk Process. SDKHarness is a python process,'

Review Comment:
   ```suggestion
               'SdkHarness. SDKHarness is a python process,'
   ```
   
   Optional, but I think this is clearer since we then explain what an SDK 
harness is right after this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Python]Enable state cache to 100 MB [beam]

Reply via email to