Strikerrx01 commented on code in PR #34135: URL: https://github.com/apache/beam/pull/34135#discussion_r1979634717
########## sdks/python/apache_beam/io/gcp/bigquery_tools.py: ########## @@ -351,6 +352,9 @@ class BigQueryWrapper(object): TEMP_DATASET = 'beam_temp_dataset_' HISTOGRAM_METRIC_LOGGER = MetricLogger() + + # Default TTL for cached table definitions in seconds + DEFAULT_TABLE_DEFINITION_TTL = 3600 # 1 hour Review Comment: R: @sjvanrossum You're right about not mixing different caching implementations. After evaluating the options: 1. For our requirements: - High concurrency (hundreds of threads) - Short TTL (1s to match Java SDK) - Size limits to prevent memory issues - Thread safety without lock contention 2. I'm leaning towards `cachetools.TTLCache` because: - Built-in TTL support - LRU eviction strategy - Thread-safe implementation - Good performance under high concurrency - Simple integration 3. For the cache scope issue: - Move from instance variable to module-level cache - Use singleton pattern to ensure single cache instance - Properly handle cache invalidation across threads Would this approach work: from cachetools import TTLCache from functools import partial #Module level cache with 1s TTL to match Java SDK TABLE_CACHE = TTLCache( maxsize=DEFAULT_CACHE_MAX_SIZE, ttl=1, # 1 second TTL getsizeof=None)# Use simple item counting Let me know if you'd prefer a different approach or have concerns about using cachetools for this use case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org