jedcunningham commented on code in PR #61489:
URL: https://github.com/apache/airflow/pull/61489#discussion_r2770630367


##########
providers/git/src/airflow/providers/git/bundles/git.py:
##########
@@ -169,18 +169,76 @@ def initialize(self) -> None:
         reraise=True,
     )
     def _clone_repo_if_required(self) -> None:
+        """
+        Clone repository if required, with atomic directory creation to 
prevent races.
+
+        This method ensures that:
+        1. Only one process clones a specific version at a time (version-level 
locking)

Review Comment:
   We already have this: 
https://github.com/apache/airflow/blob/7ab6e9b3009a7a7de9c46b03d8f02b16de3faa1e/providers/git/src/airflow/providers/git/bundles/git.py#L119
   
   



##########
providers/git/src/airflow/providers/git/bundles/git.py:
##########
@@ -169,18 +169,76 @@ def initialize(self) -> None:
         reraise=True,
     )
     def _clone_repo_if_required(self) -> None:
+        """
+        Clone repository if required, with atomic directory creation to 
prevent races.
+
+        This method ensures that:
+        1. Only one process clones a specific version at a time (version-level 
locking)
+        2. Clones are atomic - no partial clones visible to other processes

Review Comment:
   And that lock should prevent others from seeing partials too.



##########
providers/git/src/airflow/providers/git/bundles/git.py:
##########
@@ -169,18 +169,76 @@ def initialize(self) -> None:
         reraise=True,
     )
     def _clone_repo_if_required(self) -> None:
+        """
+        Clone repository if required, with atomic directory creation to 
prevent races.
+
+        This method ensures that:
+        1. Only one process clones a specific version at a time (version-level 
locking)
+        2. Clones are atomic - no partial clones visible to other processes
+        3. Existing valid clones are reused safely
+        4. Race conditions during concurrent task execution are handled
+        """
         try:
-            if not os.path.exists(self.repo_path):
+            # Fast path: if repo exists and is valid, reuse it
+            if os.path.exists(self.repo_path):
+                try:
+                    self.repo = Repo(self.repo_path)
+                    # Validate it's a real repo by accessing a property
+                    _ = self.repo.head
+                    self._log.debug("reusing existing valid repo", 
repo_path=self.repo_path)
+                    return
+                except (InvalidGitRepositoryError, GitCommandError, 
ValueError) as e:
+                    # Repo exists but is invalid/corrupted - clean it up
+                    self._log.warning(
+                        "Existing repo is invalid, will re-clone",

Review Comment:
   And, afaik this is the root cause of the issue you picked up - with 
`prune_dotgit_folder` true you always end up with an "invalid" repo, leading to 
churn. That's the thing we need to ultimately fix - `prune_dotgit_folder` 
should not result in needing to reclone to reuse the bundle.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to