potiuk commented on PR #36147: URL: https://github.com/apache/airflow/pull/36147#issuecomment-1863726693
> Hey @potiuk! we have implemented these changes :). However, I'm thinking if maybe the lru_caches for the _get_patterns fuctions may be unecessary as the _match fuctions, which are the ones calling the prior mentioned, are already wrapped by the cache functions. Do you think they are still a good precaution? As I don't have much experience with the use of cache maybe I'm missing something... These two serve different purpose and will work differently (you can compare it to L1 / L2 cache in processors. Tier 1) cache: __get_patterns will compile all the regular expression exactly once per interpreter run. Even if there different classes attempted to be serialized, they will return the same compiled list of patterns used over and over -(withou thte need to compile them again). Tier 2) cache: __match functions will have precisely one True/False boolean stored PER class. When you have a method that has string as parameter, the cache works like a dictionary - you will have one value computed per each combination of parameters. The parameters need to be hashable and positional (which they are) It means that in this case if you run `_match` on (say) 'my_package.my_module.MyClass` - the bool return value for it will be calculated by matching regexps and stored in the dictionary (and every time the same class is checked, it will be very quickly returned from that dictionary. But if you use it for `my_package.my_module.MyClass2` - then MyClass2 will be checked against the same list of compiled patterns. Matching will happen (once per class) but the patterns will be reused as they are cached in Level 1) Once more comment - there is a slight worry for Tier 2) that there will be a lot of classes checked whether they are serializable and the cache will grow a lot - but I think theis will be small. First of all it is unlikely to have vast amount of classes to be serialized. Secondly those classes should already be in memory (because otherwise they would not be checked) so they should fit memory anyway. Keeping directory of Bools hashed by class name is a very small overhead - the cache dictionary will reuse bool True/False objects (True/False are singletons) and the only overhead will be calculated hash based on class module/name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
