Re: [PR] Fix broken regex for allowed_deserialization_classes [airflow]

via GitHub Tue, 19 Dec 2023 17:55:25 -0800


potiuk commented on PR #36147:
URL: https://github.com/apache/airflow/pull/36147#issuecomment-1863726693


   > Hey @potiuk! we have implemented these changes :). However, I'm thinking 
if maybe the lru_caches for the _get_patterns fuctions may be unecessary as the 
_match fuctions, which are the ones calling the prior mentioned, are already 
wrapped by the cache functions. Do you think they are still a good precaution? 
As I don't have much experience with the use of cache maybe I'm missing 
something...
   
   These two serve different purpose and will work differently (you can compare 
it to L1 / L2 cache in processors.
   
   Tier 1) cache: __get_patterns will compile all the regular expression 
exactly once per interpreter run. Even if there different classes attempted to 
be serialized, they will return the same compiled list of patterns used over 
and over -(withou thte need to compile them again).
   
   Tier 2) cache: __match functions will have precisely one True/False boolean 
stored PER class. When you have a method that has string as parameter, the 
cache works like a dictionary - you will have one value computed per each 
combination of parameters. The parameters need to be hashable and positional 
(which they are)
   
   It means that in this case if you run `_match` on (say) 
'my_package.my_module.MyClass` - the bool return value for it will be 
calculated by matching regexps and stored in the dictionary (and every time the 
same class is checked, it will be very quickly returned from that dictionary. 
But if you use it for `my_package.my_module.MyClass2` - then MyClass2 will be 
checked against the same list of compiled patterns. Matching will happen (once 
per class) but the patterns will be reused as they are cached in Level 1) 
   
   Once more comment - there is a slight worry for Tier 2)  that there will be 
a lot of classes checked whether they are serializable and the cache will grow 
a lot - but I think theis will be small. First of all it is unlikely to have 
vast amount of classes to be serialized. Secondly those classes should already 
be in memory (because otherwise they would not be checked) so they should fit 
memory anyway. Keeping directory of Bools hashed by class name is a very small 
overhead - the cache dictionary will reuse bool True/False objects (True/False 
are singletons) and the only overhead will be calculated hash based on class 
module/name.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix broken regex for allowed_deserialization_classes [airflow]

Reply via email to