Strikerrx01 commented on code in PR #34135:
URL: https://github.com/apache/beam/pull/34135#discussion_r1979589935
##########
sdks/python/apache_beam/io/gcp/bigquery_tools.py:
##########
@@ -386,8 +390,28 @@ def __init__(self, client=None, temp_dataset_id=None,
temp_table_ref=None):
self._temporary_table_suffix = uuid.uuid4().hex
self.temp_dataset_id = temp_dataset_id or self._get_temp_dataset()
+ # Initialize table definition cache with default TTL of 1 hour
+ # Cache entries are invalidated after TTL expires to ensure fresh metadata
+ self._table_cache = {}
Review Comment:
@sjvanrossum Thanks for the guidance. I'll evaluate the available caching
packages carefully:
1. For functools:
- `@functools.lru_cache` - Thread-safe but no TTL support
- `@functools.cache` - Simple but no size limits or TTL
2. For cachetools:
- `TTLCache` - Has TTL but uses simple lock
- `LRUCache` - Good size management but no TTL
- `cached` decorator - Combines features but may have lock contention
3. Other options:
- `fastcache` - C implementation, very fast but less flexible
- `pylru` - Pure Python, good for LRU but no TTL
Since we need:
- TTL for quick schema propagation (1s)
- Thread safety for hundreds of concurrent threads
- Size limits to prevent memory issues
- Good lock scaling for high concurrency
Do you have any recommendations on which package would be most suitable for
our use case? I'm particularly interested in your thoughts on lock scaling with
high thread counts, since you mentioned this cache could be accessed by
hundreds of threads.
I can also do some performance testing with different options focusing on
lock contention under high thread counts if that would be helpful.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]