rawwar opened a new pull request, #57238:
URL: https://github.com/apache/airflow/pull/57238

   doing in operation on list, when having large number of files is not 
optimal. So, we'll just create a set of the self._file_queue and then use it.
   
   I did a test(Just wanted to give a try to see the gain)
   
   ```
   import timeit
   from collections import deque
   from pathlib import Path
   from airflow.dag_processing.manager import DagFileInfo
   
   queue = deque(DagFileInfo(Path(f"dag_{i}.py"), f"bundle_{i%10}", f"{i%5}.0") 
for i in range(400))
   files = [DagFileInfo(Path(f"dag_{i}.py"), f"bundle_{i%10}", f"{i%5}.0") for 
i in range(2000)]
   
   def test_current(): return [f for f in files if f not in queue]
   def test_optimized(): queue_set = set(queue); return [f for f in files if f 
not in queue_set]
   
   iterations = 1000
   current = timeit.timeit(test_current, number=iterations)
   optimized = timeit.timeit(test_optimized, number=iterations)
   
   print(f"Current:   {current:.3f}s ({current/iterations*1000:.2f}ms/call)")
   print(f"Optimized: {optimized:.3f}s 
({optimized/iterations*1000:.2f}ms/call)")
   print(f"Speedup:   {current/optimized:.1f}x faster")
   ```
   
   ```
   Current:   97.487s (97.49ms/call)
   Optimized: 0.281s (0.28ms/call)
   Speedup:   346.6x faster
   ```
   
   Assumed, there can be about 400 files in queue.. while parsing about 2k 
files in total across dag-bundles. Intentionally choose high number of dags to 
see perf. gain.
   
   I'm keeping this in draft as I'm just checking how large can queue can be.. 
when parsing 2k files. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to