Re: [PR] [Python] Add caching for BigQuery table definitions [beam]

via GitHub Wed, 05 Mar 2025 01:01:21 -0800


Strikerrx01 commented on code in PR #34135:
URL: https://github.com/apache/beam/pull/34135#discussion_r1979634717



##########
sdks/python/apache_beam/io/gcp/bigquery_tools.py:
##########
@@ -351,6 +352,9 @@ class BigQueryWrapper(object):
   TEMP_DATASET = 'beam_temp_dataset_'
 
   HISTOGRAM_METRIC_LOGGER = MetricLogger()
+  
+  # Default TTL for cached table definitions in seconds
+  DEFAULT_TABLE_DEFINITION_TTL = 3600  # 1 hour

Review Comment:
   R: @sjvanrossum You're right about not mixing different caching 
implementations. After evaluating the options:
   
   1. For our requirements:
      - High concurrency (hundreds of threads)
      - Short TTL (1s to match Java SDK)
      - Size limits to prevent memory issues
      - Thread safety without lock contention
   
   2. I'm leaning towards `cachetools.TTLCache` because:
      - Built-in TTL support
      - LRU eviction strategy
      - Thread-safe implementation
      - Good performance under high concurrency
      - Simple integration
   
   3. For the cache scope issue:
      - Move from instance variable to module-level cache
      - Use singleton pattern to ensure single cache instance
      - Properly handle cache invalidation across threads
   
   Would this approach work: 
    from cachetools import TTLCache
   from functools import partial
   #Module level cache with 1s TTL to match Java SDK
   TABLE_CACHE = TTLCache( maxsize=DEFAULT_CACHE_MAX_SIZE, ttl=1, # 1 second 
TTL getsizeof=None)# Use simple item counting
   
   Let me know if you'd prefer a different approach or have concerns about 
using cachetools for this use case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Python] Add caching for BigQuery table definitions [beam]

Reply via email to