Re: [PR] [Python] Add caching for BigQuery table definitions [beam]

via GitHub Tue, 04 Mar 2025 06:38:16 -0800


Strikerrx01 commented on code in PR #34135:
URL: https://github.com/apache/beam/pull/34135#discussion_r1979589935



##########
sdks/python/apache_beam/io/gcp/bigquery_tools.py:
##########
@@ -386,8 +390,28 @@ def __init__(self, client=None, temp_dataset_id=None, 
temp_table_ref=None):
       self._temporary_table_suffix = uuid.uuid4().hex
       self.temp_dataset_id = temp_dataset_id or self._get_temp_dataset()
 
+    # Initialize table definition cache with default TTL of 1 hour
+    # Cache entries are invalidated after TTL expires to ensure fresh metadata
+    self._table_cache = {}

Review Comment:
   @sjvanrossum Thanks for the guidance. I'll evaluate the available caching 
packages carefully:
   
   1. For functools:
   - `@functools.lru_cache` - Thread-safe but no TTL support
   - `@functools.cache` - Simple but no size limits or TTL
   
   2. For cachetools:
   - `TTLCache` - Has TTL but uses simple lock
   - `LRUCache` - Good size management but no TTL
   - `cached` decorator - Combines features but may have lock contention
   
   3. Other options:
   - `fastcache` - C implementation, very fast but less flexible
   - `pylru` - Pure Python, good for LRU but no TTL
   
   Since we need:
   - TTL for quick schema propagation (1s)
   - Thread safety for hundreds of concurrent threads
   - Size limits to prevent memory issues
   - Good lock scaling for high concurrency
   
   Do you have any recommendations on which package would be most suitable for 
our use case? I'm particularly interested in your thoughts on lock scaling with 
high thread counts, since you mentioned this cache could be accessed by 
hundreds of threads.
   
   I can also do some performance testing with different options focusing on 
lock contention under high thread counts if that would be helpful.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [Python] Add caching for BigQuery table definitions [beam]

Reply via email to