Copilot commented on code in PR #401:
URL: https://github.com/apache/fory-site/pull/401#discussion_r2777992502


##########
.github/HINDI_SUMMARY.md:
##########
@@ -0,0 +1,189 @@
+# ✅ Final Push Se Pehle - Sab Kuch Ready Hai!
+
+## हिंदी में Summary
+
+### 🔧 Jo Issues Fix Kiye Gaye
+
+1. **Workflow में Pip Cache Issue** ✅
+   - Problem: `cache: 'pip'` subdirectory requirements.txt ke saath kaam nahi 
karta
+   - Fix: Cache remove kiya, ab requirements.txt se directly install hoga
+   - Status: FIXED
+
+2. **Dependencies Installation** ✅
+   - Problem: Dependencies workflow mein hardcoded the
+   - Fix: Ab requirements.txt se consistent installation hogi
+   - Status: FIXED
+
+3. **Error Handling** ✅
+   - Problem: GitHub API failures ke liye koi error handling nahi tha
+   - Fix: Try-catch blocks aur proper error messages add kiye
+   - Status: FIXED
+
+4. **Edge Cases** ✅
+   - Problem: Empty titles aur bodies handle nahi ho rahe the
+   - Fix: `.strip()` aur default values add kar diye
+   - Status: FIXED
+
+### 📊 Kya Banaya Gaya
+
+**Total 12 Files Create/Update Kiye:**
+
+1. ✅ `.github/workflows/duplicate-detector.yml` - Main workflow
+2. ✅ `.github/scripts/detect-duplicates.py` - Detection script (400+ lines)
+3. ✅ `.github/scripts/requirements.txt` - Python dependencies
+4. ✅ `.github/duplicate-detector-config.yml` - Configuration
+5. ✅ `.github/DUPLICATE_DETECTION.md` - Full documentation
+6. ✅ `.github/QUICKSTART.md` - Quick start guide
+7. ✅ `.github/IMPLEMENTATION_SUMMARY.md` - Implementation summary
+8. ✅ `.github/PRE_PUSH_VALIDATION.md` - Validation checklist
+9. ✅ `.github/scripts/README.md` - Scripts documentation
+10. ✅ `.github/scripts/test-local.sh` - Linux/Mac test script
+11. ✅ `.github/scripts/test-local.ps1` - Windows test script
+12. ✅ `CONTRIBUTING.md` - Updated with duplicate detection info
+
+### 🎯 Kaise Kaam Karega
+
+```
+New Issue/PR Created
+        ↓
+Workflow Trigger Hoga
+        ↓
+Python Script Chalegi
+        ↓
+ML-Based Similarity Check
+        ↓
+Duplicate Found?
+   ↙         ↘
+ YES         NO
+   ↓          ↓
+Label +    Kuch Nahi
+Comment    Karo
+```
+
+### ✅ GitHub Par Properly Kaam Karega - Guaranteed!
+
+**Kyun Confident Hai:**
+- ✅ Workflow syntax 100% correct
+- ✅ Python script mein proper error handling
+- ✅ Dependencies sahi tareeke se install hongi
+- ✅ Permissions properly set hain
+- ✅ Environment variables sahi handle ho rahe hain
+- ✅ Edge cases handle kar liye
+- ✅ Rate limiting ka bhi dhyan rakha
+
+### 🧪 Local Testing (Optional)
+
+Agar push se pehle locally test karna chahte ho:
+
+**Windows (PowerShell):**
+```powershell
+cd .github\scripts
+$env:GITHUB_TOKEN="your_token_here"
+.\test-local.ps1
+```
+
+**Linux/Mac:**
+```bash
+cd .github/scripts
+export GITHUB_TOKEN="your_token_here"
+./test-local.sh
+```
+
+### 🚀 Push Ke Baad Kya Hoga
+
+1. **Automatically Active** - Koi manual setup nahi chaiye
+2. **New Issues par chalega** - Jab bhi koi issue/PR banayega
+3. **Bot comment karega** - Agar duplicate mila
+4. **Label add karega** - `possible-duplicate` ya `duplicate`
+5. **Auto-close NAHI karega** - Safe default (manual review ke liye)
+
+### 🎛️ Agar Tune Karna Ho
+
+File edit karo: `.github/duplicate-detector-config.yml`
+
+**Zyada strict chahiye (kam false positives):**
+```yaml
+similarity_threshold: 0.80
+```
+
+**Zyada sensitive chahiye (zyada duplicates pakdo):**
+```yaml
+similarity_threshold: 0.70
+```
+
+**Auto-close enable karna ho:**
+```yaml
+auto_close_exact_match: true
+```
+
+### ⚠️ VS Code Mein Jo Errors Dikh Rahe Hain
+
+```
+Import "github" could not be resolved
+Import "sklearn" could not be resolved
+Import "yaml" could not be resolved
+```
+
+**Tension mat lo!** Ye normal hai because:
+- Ye packages aapke local system mein installed nahi hain
+- GitHub Actions workflow mein automatically install ho jayenge
+- Workflow perfectly kaam karega
+
+Ye **fake warnings** hain, **real errors nahi!**
+
+### 📝 Final Checklist
+
+- [x] Sab files sahi jagah hain
+- [x] Workflow syntax correct hai
+- [x] Python script tested hai
+- [x] Error handling comprehensive hai
+- [x] Documentation complete hai
+- [x] Edge cases handle hain
+- [x] CONTRIBUTING.md update ho gaya
+- [x] Test scripts bhi ready hain
+
+## 🎉 Confidence Level: 100% ✅
+
+**HAÃ, BILKUL READY HAI PUSH KARNE KE LIYE!**

Review Comment:
   The text “HAÔ appears to be a character-encoding/typo issue (likely 
intended to be “Haan”). Please fix the garbled text so the Hindi summary 
renders correctly.
   ```suggestion
   **Haan, BILKUL READY HAI PUSH KARNE KE LIYE!**
   ```



##########
.github/scripts/requirements.txt:
##########
@@ -0,0 +1,5 @@
+PyGithub>=2.1.1
+scikit-learn>=1.3.0
+numpy>=1.24.0
+PyYAML>=6.0
+requests>=2.31.0

Review Comment:
   `requests` is listed as a dependency, but it isn’t used by the duplicate 
detection script. Consider removing it to reduce install time and the 
dependency surface area.
   ```suggestion
   
   ```



##########
.github/scripts/detect-duplicates.py:
##########
@@ -0,0 +1,315 @@
+#!/usr/bin/env python3
+"""
+Duplicate Issue and Pull Request Detection Script
+Detects potential duplicate issues and PRs using text similarity analysis.
+"""
+
+import os
+import sys
+import argparse
+import json
+from typing import List, Dict, Tuple
+from github import Github
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+import numpy as np
+import yaml
+
+# Configuration defaults
+DEFAULT_SIMILARITY_THRESHOLD = 0.75
+DEFAULT_HIGH_SIMILARITY_THRESHOLD = 0.90
+DEFAULT_MAX_ISSUES_TO_CHECK = 200
+DEFAULT_AUTO_CLOSE_EXACT_MATCH = False
+DEFAULT_LABEL_POSSIBLE_DUPLICATE = "possible-duplicate"
+DEFAULT_LABEL_EXACT_DUPLICATE = "duplicate"
+
+
+class DuplicateDetector:
+    """Detects duplicate issues and pull requests."""
+    
+    def __init__(self, token: str, repo_name: str, config_path: str = None):
+        try:
+            self.github = Github(token)
+            self.repo = self.github.get_repo(repo_name)
+            self.config = self.load_config(config_path)
+        except Exception as e:
+            print(f"Error initializing GitHub connection: {e}")
+            sys.exit(1)
+        
+    def load_config(self, config_path: str = None) -> Dict:
+        """Load configuration from YAML file or use defaults."""
+        default_config = {
+            'similarity_threshold': DEFAULT_SIMILARITY_THRESHOLD,
+            'high_similarity_threshold': DEFAULT_HIGH_SIMILARITY_THRESHOLD,
+            'max_issues_to_check': DEFAULT_MAX_ISSUES_TO_CHECK,
+            'auto_close_exact_match': DEFAULT_AUTO_CLOSE_EXACT_MATCH,
+            'label_possible_duplicate': DEFAULT_LABEL_POSSIBLE_DUPLICATE,
+            'label_exact_duplicate': DEFAULT_LABEL_EXACT_DUPLICATE,
+            'exclude_labels': ['wontfix', 'invalid'],
+            'min_text_length': 20,
+        }
+        
+        if config_path and os.path.exists(config_path):
+            try:
+                with open(config_path, 'r') as f:
+                    user_config = yaml.safe_load(f)
+                    default_config.update(user_config)
+            except Exception as e:
+                print(f"Warning: Could not load config file: {e}")
+        
+        return default_config
+    
+    def preprocess_text(self, text: str) -> str:
+        """Preprocess text for comparison."""
+        if not text:
+            return ""
+        # Convert to lowercase and strip whitespace
+        text = text.lower().strip()
+        # Remove URLs
+        import re
+        text = re.sub(r'http\S+|www.\S+', '', text)
+        # Remove markdown code blocks
+        text = re.sub(r'```[\s\S]*?```', '', text)
+        # Remove special characters but keep spaces
+        text = re.sub(r'[^a-z0-9\s]', ' ', text)
+        # Remove extra whitespace
+        text = ' '.join(text.split())
+        return text
+    
+    def calculate_similarity(self, text1: str, text2: str) -> float:
+        """Calculate cosine similarity between two texts."""
+        if not text1 or not text2:
+            return 0.0
+        
+        try:
+            vectorizer = TfidfVectorizer(
+                min_df=1,
+                stop_words='english',
+                ngram_range=(1, 2)
+            )
+            tfidf_matrix = vectorizer.fit_transform([text1, text2])
+            similarity = cosine_similarity(tfidf_matrix[0:1], 
tfidf_matrix[1:2])[0][0]
+            return float(similarity)
+        except Exception as e:
+            print(f"Error calculating similarity: {e}")
+            return 0.0
+    
+    def find_similar_issues(self, current_number: int, current_title: str, 
+                           current_body: str, item_type: str = 'issue') -> 
List[Tuple[int, str, float]]:
+        """Find similar issues or PRs."""
+        current_text = self.preprocess_text(f"{current_title} {current_body}")
+        
+        if len(current_text) < self.config['min_text_length']:
+            print(f"Text too short for meaningful comparison: 
{len(current_text)} chars")
+            return []
+        
+        similar_items = []
+        
+        # Get existing items to compare against
+        if item_type == 'issue':
+            items = self.repo.get_issues(state='all')
+        else:
+            items = self.repo.get_pulls(state='all')
+        
+        checked_count = 0
+        
+        try:
+            for item in items:
+                if checked_count >= self.config['max_issues_to_check']:
+                    break
+                
+                # Skip the current item
+                if item.number == current_number:
+                    continue
+                
+                try:
+                    # Skip items with excluded labels
+                    item_labels = [label.name for label in item.labels]
+                    if any(label in self.config['exclude_labels'] for label in 
item_labels):
+                        continue
+                    
+                    # Calculate similarity
+                    item_text = self.preprocess_text(f"{item.title} {item.body 
or ''}")
+                    similarity = self.calculate_similarity(current_text, 
item_text)
+                    
+                    if similarity >= self.config['similarity_threshold']:
+                        similar_items.append((item.number, item.title, 
similarity))
+                except Exception as e:
+                    print(f"Warning: Error processing item #{item.number}: 
{e}")
+                    continue
+                
+                checked_count += 1
+        except Exception as e:
+            print(f"Error fetching items from repository: {e}")
+            print("This might be due to API rate limits or permissions 
issues.")
+        
+        # Sort by similarity (highest first)
+        similar_items.sort(key=lambda x: x[2], reverse=True)
+        return similar_items
+    
+    def add_label(self, item_number: int, label: str, item_type: str = 
'issue'):
+        """Add a label to an issue or PR."""
+        try:
+            # Ensure label exists
+            try:
+                self.repo.get_label(label)
+            except:
+                # Create label if it doesn't exist
+                if label == self.config['label_possible_duplicate']:
+                    self.repo.create_label(label, "FFA500", "Potential 
duplicate issue")
+                elif label == self.config['label_exact_duplicate']:
+                    self.repo.create_label(label, "FF0000", "Exact duplicate 
issue")
+            
+            if item_type == 'issue':
+                item = self.repo.get_issue(item_number)
+            else:
+                item = self.repo.get_pull(item_number)
+            
+            item.add_to_labels(label)
+            print(f"Added label '{label}' to {item_type} #{item_number}")
+        except Exception as e:
+            print(f"Error adding label: {e}")
+    
+    def add_comment(self, item_number: int, similar_items: List[Tuple[int, 
str, float]], 
+                   item_type: str = 'issue'):
+        """Add a comment about potential duplicates."""
+        if not similar_items:
+            return
+        
+        item_type_name = "issue" if item_type == 'issue' else "pull request"
+        
+        # Build comment message
+        comment = f"👋 **Potential Duplicate Detected**\n\n"
+        comment += f"This {item_type_name} appears to be similar to existing 
{item_type_name}s:\n\n"
+        
+        for number, title, similarity in similar_items[:5]:  # Show top 5
+            similarity_pct = int(similarity * 100)
+            comment += f"- #{number}: {title} (Similarity: 
{similarity_pct}%)\n"
+        
+        comment += f"\n---\n"
+        comment += f"Please review these {item_type_name}s to see if any of 
them address your concern. "
+        comment += f"If this is indeed a duplicate, please close this 
{item_type_name} and continue the discussion in the existing one.\n\n"
+        comment += f"If this is **not** a duplicate, please add more context 
to help differentiate it.\n\n"
+        comment += f"*This is an automated message. If you believe this is 
incorrect, please remove the label and mention a maintainer.*"
+        
+        try:
+            if item_type == 'issue':
+                item = self.repo.get_issue(item_number)
+            else:
+                item = self.repo.get_pull(item_number)
+            
+            item.create_comment(comment)
+            print(f"Added duplicate detection comment to {item_type} 
#{item_number}")

Review Comment:
   For PRs, this uses `repo.get_pull()` and then calls `create_comment()`. 
PyGitHub commonly exposes PR comments via the issue comment API; consider using 
`repo.get_issue(pr_number).create_comment(...)` (or the appropriate PR 
issue-comment method) so commenting works for PRs.



##########
.github/scripts/detect-duplicates.py:
##########
@@ -0,0 +1,315 @@
+#!/usr/bin/env python3
+"""
+Duplicate Issue and Pull Request Detection Script
+Detects potential duplicate issues and PRs using text similarity analysis.
+"""
+
+import os
+import sys
+import argparse
+import json
+from typing import List, Dict, Tuple
+from github import Github
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+import numpy as np

Review Comment:
   `json` and `numpy` are imported but never used in this script. Please remove 
unused imports (and if `numpy` isn’t needed directly, avoid listing it as a 
direct dependency) to keep the script and dependency set minimal.
   ```suggestion
   from typing import List, Dict, Tuple
   from github import Github
   from sklearn.feature_extraction.text import TfidfVectorizer
   from sklearn.metrics.pairwise import cosine_similarity
   ```



##########
.github/scripts/detect-duplicates.py:
##########
@@ -0,0 +1,315 @@
+#!/usr/bin/env python3
+"""
+Duplicate Issue and Pull Request Detection Script
+Detects potential duplicate issues and PRs using text similarity analysis.
+"""
+
+import os
+import sys
+import argparse
+import json
+from typing import List, Dict, Tuple
+from github import Github
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+import numpy as np
+import yaml
+
+# Configuration defaults
+DEFAULT_SIMILARITY_THRESHOLD = 0.75
+DEFAULT_HIGH_SIMILARITY_THRESHOLD = 0.90
+DEFAULT_MAX_ISSUES_TO_CHECK = 200
+DEFAULT_AUTO_CLOSE_EXACT_MATCH = False
+DEFAULT_LABEL_POSSIBLE_DUPLICATE = "possible-duplicate"
+DEFAULT_LABEL_EXACT_DUPLICATE = "duplicate"
+
+
+class DuplicateDetector:
+    """Detects duplicate issues and pull requests."""
+    
+    def __init__(self, token: str, repo_name: str, config_path: str = None):
+        try:
+            self.github = Github(token)
+            self.repo = self.github.get_repo(repo_name)
+            self.config = self.load_config(config_path)
+        except Exception as e:
+            print(f"Error initializing GitHub connection: {e}")
+            sys.exit(1)
+        
+    def load_config(self, config_path: str = None) -> Dict:
+        """Load configuration from YAML file or use defaults."""
+        default_config = {
+            'similarity_threshold': DEFAULT_SIMILARITY_THRESHOLD,
+            'high_similarity_threshold': DEFAULT_HIGH_SIMILARITY_THRESHOLD,
+            'max_issues_to_check': DEFAULT_MAX_ISSUES_TO_CHECK,
+            'auto_close_exact_match': DEFAULT_AUTO_CLOSE_EXACT_MATCH,
+            'label_possible_duplicate': DEFAULT_LABEL_POSSIBLE_DUPLICATE,
+            'label_exact_duplicate': DEFAULT_LABEL_EXACT_DUPLICATE,
+            'exclude_labels': ['wontfix', 'invalid'],
+            'min_text_length': 20,
+        }
+        
+        if config_path and os.path.exists(config_path):
+            try:
+                with open(config_path, 'r') as f:
+                    user_config = yaml.safe_load(f)
+                    default_config.update(user_config)
+            except Exception as e:
+                print(f"Warning: Could not load config file: {e}")
+        
+        return default_config
+    
+    def preprocess_text(self, text: str) -> str:
+        """Preprocess text for comparison."""
+        if not text:
+            return ""
+        # Convert to lowercase and strip whitespace
+        text = text.lower().strip()
+        # Remove URLs
+        import re
+        text = re.sub(r'http\S+|www.\S+', '', text)
+        # Remove markdown code blocks
+        text = re.sub(r'```[\s\S]*?```', '', text)
+        # Remove special characters but keep spaces
+        text = re.sub(r'[^a-z0-9\s]', ' ', text)
+        # Remove extra whitespace
+        text = ' '.join(text.split())
+        return text
+    
+    def calculate_similarity(self, text1: str, text2: str) -> float:
+        """Calculate cosine similarity between two texts."""
+        if not text1 or not text2:
+            return 0.0
+        
+        try:
+            vectorizer = TfidfVectorizer(
+                min_df=1,
+                stop_words='english',
+                ngram_range=(1, 2)
+            )
+            tfidf_matrix = vectorizer.fit_transform([text1, text2])
+            similarity = cosine_similarity(tfidf_matrix[0:1], 
tfidf_matrix[1:2])[0][0]
+            return float(similarity)
+        except Exception as e:
+            print(f"Error calculating similarity: {e}")
+            return 0.0
+    
+    def find_similar_issues(self, current_number: int, current_title: str, 
+                           current_body: str, item_type: str = 'issue') -> 
List[Tuple[int, str, float]]:
+        """Find similar issues or PRs."""
+        current_text = self.preprocess_text(f"{current_title} {current_body}")
+        
+        if len(current_text) < self.config['min_text_length']:
+            print(f"Text too short for meaningful comparison: 
{len(current_text)} chars")

Review Comment:
   The similarity comparison only uses title+body 
(`current_title`/`current_body`). This doesn’t implement the PR description’s 
claim that labels are part of the comparison; either incorporate labels into 
the similarity input/scoring or update the documentation/PR description to 
match behavior.



##########
.github/scripts/detect-duplicates.py:
##########
@@ -0,0 +1,315 @@
+#!/usr/bin/env python3
+"""
+Duplicate Issue and Pull Request Detection Script
+Detects potential duplicate issues and PRs using text similarity analysis.
+"""
+
+import os
+import sys
+import argparse
+import json
+from typing import List, Dict, Tuple
+from github import Github
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+import numpy as np
+import yaml
+
+# Configuration defaults
+DEFAULT_SIMILARITY_THRESHOLD = 0.75
+DEFAULT_HIGH_SIMILARITY_THRESHOLD = 0.90
+DEFAULT_MAX_ISSUES_TO_CHECK = 200
+DEFAULT_AUTO_CLOSE_EXACT_MATCH = False
+DEFAULT_LABEL_POSSIBLE_DUPLICATE = "possible-duplicate"
+DEFAULT_LABEL_EXACT_DUPLICATE = "duplicate"
+
+
+class DuplicateDetector:
+    """Detects duplicate issues and pull requests."""
+    
+    def __init__(self, token: str, repo_name: str, config_path: str = None):
+        try:
+            self.github = Github(token)
+            self.repo = self.github.get_repo(repo_name)
+            self.config = self.load_config(config_path)
+        except Exception as e:
+            print(f"Error initializing GitHub connection: {e}")
+            sys.exit(1)
+        
+    def load_config(self, config_path: str = None) -> Dict:
+        """Load configuration from YAML file or use defaults."""
+        default_config = {
+            'similarity_threshold': DEFAULT_SIMILARITY_THRESHOLD,
+            'high_similarity_threshold': DEFAULT_HIGH_SIMILARITY_THRESHOLD,
+            'max_issues_to_check': DEFAULT_MAX_ISSUES_TO_CHECK,
+            'auto_close_exact_match': DEFAULT_AUTO_CLOSE_EXACT_MATCH,
+            'label_possible_duplicate': DEFAULT_LABEL_POSSIBLE_DUPLICATE,
+            'label_exact_duplicate': DEFAULT_LABEL_EXACT_DUPLICATE,
+            'exclude_labels': ['wontfix', 'invalid'],
+            'min_text_length': 20,
+        }
+        
+        if config_path and os.path.exists(config_path):
+            try:
+                with open(config_path, 'r') as f:
+                    user_config = yaml.safe_load(f)
+                    default_config.update(user_config)
+            except Exception as e:
+                print(f"Warning: Could not load config file: {e}")
+        
+        return default_config
+    
+    def preprocess_text(self, text: str) -> str:
+        """Preprocess text for comparison."""
+        if not text:
+            return ""
+        # Convert to lowercase and strip whitespace
+        text = text.lower().strip()
+        # Remove URLs
+        import re
+        text = re.sub(r'http\S+|www.\S+', '', text)
+        # Remove markdown code blocks
+        text = re.sub(r'```[\s\S]*?```', '', text)
+        # Remove special characters but keep spaces
+        text = re.sub(r'[^a-z0-9\s]', ' ', text)
+        # Remove extra whitespace
+        text = ' '.join(text.split())
+        return text
+    
+    def calculate_similarity(self, text1: str, text2: str) -> float:
+        """Calculate cosine similarity between two texts."""
+        if not text1 or not text2:
+            return 0.0
+        
+        try:
+            vectorizer = TfidfVectorizer(
+                min_df=1,
+                stop_words='english',
+                ngram_range=(1, 2)
+            )
+            tfidf_matrix = vectorizer.fit_transform([text1, text2])
+            similarity = cosine_similarity(tfidf_matrix[0:1], 
tfidf_matrix[1:2])[0][0]
+            return float(similarity)
+        except Exception as e:
+            print(f"Error calculating similarity: {e}")
+            return 0.0
+    
+    def find_similar_issues(self, current_number: int, current_title: str, 
+                           current_body: str, item_type: str = 'issue') -> 
List[Tuple[int, str, float]]:
+        """Find similar issues or PRs."""
+        current_text = self.preprocess_text(f"{current_title} {current_body}")
+        
+        if len(current_text) < self.config['min_text_length']:
+            print(f"Text too short for meaningful comparison: 
{len(current_text)} chars")
+            return []
+        
+        similar_items = []
+        
+        # Get existing items to compare against
+        if item_type == 'issue':
+            items = self.repo.get_issues(state='all')
+        else:
+            items = self.repo.get_pulls(state='all')
+        
+        checked_count = 0
+        
+        try:
+            for item in items:
+                if checked_count >= self.config['max_issues_to_check']:
+                    break
+                
+                # Skip the current item
+                if item.number == current_number:
+                    continue
+                
+                try:
+                    # Skip items with excluded labels
+                    item_labels = [label.name for label in item.labels]
+                    if any(label in self.config['exclude_labels'] for label in 
item_labels):
+                        continue
+                    
+                    # Calculate similarity
+                    item_text = self.preprocess_text(f"{item.title} {item.body 
or ''}")
+                    similarity = self.calculate_similarity(current_text, 
item_text)
+                    
+                    if similarity >= self.config['similarity_threshold']:
+                        similar_items.append((item.number, item.title, 
similarity))
+                except Exception as e:
+                    print(f"Warning: Error processing item #{item.number}: 
{e}")
+                    continue
+                
+                checked_count += 1
+        except Exception as e:
+            print(f"Error fetching items from repository: {e}")
+            print("This might be due to API rate limits or permissions 
issues.")
+        
+        # Sort by similarity (highest first)
+        similar_items.sort(key=lambda x: x[2], reverse=True)
+        return similar_items
+    
+    def add_label(self, item_number: int, label: str, item_type: str = 
'issue'):
+        """Add a label to an issue or PR."""
+        try:
+            # Ensure label exists
+            try:
+                self.repo.get_label(label)
+            except:
+                # Create label if it doesn't exist
+                if label == self.config['label_possible_duplicate']:
+                    self.repo.create_label(label, "FFA500", "Potential 
duplicate issue")
+                elif label == self.config['label_exact_duplicate']:
+                    self.repo.create_label(label, "FF0000", "Exact duplicate 
issue")

Review Comment:
   Avoid bare `except:` here; it can mask unrelated errors (e.g., permission 
issues, network failures) and make debugging harder. Catch `Exception` (or the 
specific PyGitHub exception type) and handle only the "label not found" case 
explicitly.



##########
.github/scripts/detect-duplicates.py:
##########
@@ -0,0 +1,315 @@
+#!/usr/bin/env python3
+"""
+Duplicate Issue and Pull Request Detection Script
+Detects potential duplicate issues and PRs using text similarity analysis.
+"""
+
+import os
+import sys
+import argparse
+import json
+from typing import List, Dict, Tuple
+from github import Github
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+import numpy as np
+import yaml
+
+# Configuration defaults
+DEFAULT_SIMILARITY_THRESHOLD = 0.75
+DEFAULT_HIGH_SIMILARITY_THRESHOLD = 0.90
+DEFAULT_MAX_ISSUES_TO_CHECK = 200
+DEFAULT_AUTO_CLOSE_EXACT_MATCH = False
+DEFAULT_LABEL_POSSIBLE_DUPLICATE = "possible-duplicate"
+DEFAULT_LABEL_EXACT_DUPLICATE = "duplicate"
+
+
+class DuplicateDetector:
+    """Detects duplicate issues and pull requests."""
+    
+    def __init__(self, token: str, repo_name: str, config_path: str = None):
+        try:
+            self.github = Github(token)
+            self.repo = self.github.get_repo(repo_name)
+            self.config = self.load_config(config_path)
+        except Exception as e:
+            print(f"Error initializing GitHub connection: {e}")
+            sys.exit(1)
+        
+    def load_config(self, config_path: str = None) -> Dict:
+        """Load configuration from YAML file or use defaults."""
+        default_config = {
+            'similarity_threshold': DEFAULT_SIMILARITY_THRESHOLD,
+            'high_similarity_threshold': DEFAULT_HIGH_SIMILARITY_THRESHOLD,
+            'max_issues_to_check': DEFAULT_MAX_ISSUES_TO_CHECK,
+            'auto_close_exact_match': DEFAULT_AUTO_CLOSE_EXACT_MATCH,
+            'label_possible_duplicate': DEFAULT_LABEL_POSSIBLE_DUPLICATE,
+            'label_exact_duplicate': DEFAULT_LABEL_EXACT_DUPLICATE,
+            'exclude_labels': ['wontfix', 'invalid'],
+            'min_text_length': 20,
+        }
+        
+        if config_path and os.path.exists(config_path):
+            try:
+                with open(config_path, 'r') as f:
+                    user_config = yaml.safe_load(f)
+                    default_config.update(user_config)
+            except Exception as e:
+                print(f"Warning: Could not load config file: {e}")
+        
+        return default_config
+    
+    def preprocess_text(self, text: str) -> str:
+        """Preprocess text for comparison."""
+        if not text:
+            return ""
+        # Convert to lowercase and strip whitespace
+        text = text.lower().strip()
+        # Remove URLs
+        import re
+        text = re.sub(r'http\S+|www.\S+', '', text)
+        # Remove markdown code blocks
+        text = re.sub(r'```[\s\S]*?```', '', text)
+        # Remove special characters but keep spaces
+        text = re.sub(r'[^a-z0-9\s]', ' ', text)
+        # Remove extra whitespace
+        text = ' '.join(text.split())
+        return text
+    
+    def calculate_similarity(self, text1: str, text2: str) -> float:
+        """Calculate cosine similarity between two texts."""
+        if not text1 or not text2:
+            return 0.0
+        
+        try:
+            vectorizer = TfidfVectorizer(
+                min_df=1,
+                stop_words='english',
+                ngram_range=(1, 2)
+            )
+            tfidf_matrix = vectorizer.fit_transform([text1, text2])
+            similarity = cosine_similarity(tfidf_matrix[0:1], 
tfidf_matrix[1:2])[0][0]
+            return float(similarity)
+        except Exception as e:
+            print(f"Error calculating similarity: {e}")
+            return 0.0
+    
+    def find_similar_issues(self, current_number: int, current_title: str, 
+                           current_body: str, item_type: str = 'issue') -> 
List[Tuple[int, str, float]]:
+        """Find similar issues or PRs."""
+        current_text = self.preprocess_text(f"{current_title} {current_body}")
+        
+        if len(current_text) < self.config['min_text_length']:
+            print(f"Text too short for meaningful comparison: 
{len(current_text)} chars")
+            return []
+        
+        similar_items = []
+        
+        # Get existing items to compare against
+        if item_type == 'issue':
+            items = self.repo.get_issues(state='all')
+        else:
+            items = self.repo.get_pulls(state='all')
+        
+        checked_count = 0
+        
+        try:
+            for item in items:
+                if checked_count >= self.config['max_issues_to_check']:
+                    break
+                
+                # Skip the current item
+                if item.number == current_number:
+                    continue
+                
+                try:
+                    # Skip items with excluded labels
+                    item_labels = [label.name for label in item.labels]
+                    if any(label in self.config['exclude_labels'] for label in 
item_labels):
+                        continue
+                    
+                    # Calculate similarity
+                    item_text = self.preprocess_text(f"{item.title} {item.body 
or ''}")
+                    similarity = self.calculate_similarity(current_text, 
item_text)
+                    
+                    if similarity >= self.config['similarity_threshold']:
+                        similar_items.append((item.number, item.title, 
similarity))
+                except Exception as e:
+                    print(f"Warning: Error processing item #{item.number}: 
{e}")
+                    continue
+                
+                checked_count += 1
+        except Exception as e:
+            print(f"Error fetching items from repository: {e}")
+            print("This might be due to API rate limits or permissions 
issues.")
+        
+        # Sort by similarity (highest first)
+        similar_items.sort(key=lambda x: x[2], reverse=True)
+        return similar_items
+    
+    def add_label(self, item_number: int, label: str, item_type: str = 
'issue'):
+        """Add a label to an issue or PR."""
+        try:
+            # Ensure label exists
+            try:
+                self.repo.get_label(label)
+            except:
+                # Create label if it doesn't exist
+                if label == self.config['label_possible_duplicate']:
+                    self.repo.create_label(label, "FFA500", "Potential 
duplicate issue")
+                elif label == self.config['label_exact_duplicate']:
+                    self.repo.create_label(label, "FF0000", "Exact duplicate 
issue")
+            
+            if item_type == 'issue':
+                item = self.repo.get_issue(item_number)
+            else:
+                item = self.repo.get_pull(item_number)
+            
+            item.add_to_labels(label)
+            print(f"Added label '{label}' to {item_type} #{item_number}")
+        except Exception as e:
+            print(f"Error adding label: {e}")
+    
+    def add_comment(self, item_number: int, similar_items: List[Tuple[int, 
str, float]], 
+                   item_type: str = 'issue'):
+        """Add a comment about potential duplicates."""
+        if not similar_items:
+            return
+        
+        item_type_name = "issue" if item_type == 'issue' else "pull request"
+        
+        # Build comment message
+        comment = f"👋 **Potential Duplicate Detected**\n\n"
+        comment += f"This {item_type_name} appears to be similar to existing 
{item_type_name}s:\n\n"
+        
+        for number, title, similarity in similar_items[:5]:  # Show top 5
+            similarity_pct = int(similarity * 100)
+            comment += f"- #{number}: {title} (Similarity: 
{similarity_pct}%)\n"
+        

Review Comment:
   The config file defines `max_similar_to_show`, but the script hard-codes 
showing the top 5 similar items. Please read this value from config so behavior 
matches configuration/documentation.
   ```suggestion
           max_to_show = int(self.config.get('max_similar_to_show', 5))
           for number, title, similarity in similar_items[:max_to_show]:  # 
Show configured number (default 5)
               similarity_pct = int(similarity * 100)
               comment += f"- #{number}: {title} (Similarity: 
{similarity_pct}%)\n"
   ```



##########
.github/scripts/detect-duplicates.py:
##########
@@ -0,0 +1,315 @@
+#!/usr/bin/env python3
+"""
+Duplicate Issue and Pull Request Detection Script
+Detects potential duplicate issues and PRs using text similarity analysis.
+"""
+
+import os
+import sys
+import argparse
+import json
+from typing import List, Dict, Tuple
+from github import Github
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+import numpy as np
+import yaml
+
+# Configuration defaults
+DEFAULT_SIMILARITY_THRESHOLD = 0.75
+DEFAULT_HIGH_SIMILARITY_THRESHOLD = 0.90
+DEFAULT_MAX_ISSUES_TO_CHECK = 200
+DEFAULT_AUTO_CLOSE_EXACT_MATCH = False
+DEFAULT_LABEL_POSSIBLE_DUPLICATE = "possible-duplicate"
+DEFAULT_LABEL_EXACT_DUPLICATE = "duplicate"
+
+
+class DuplicateDetector:
+    """Detects duplicate issues and pull requests."""
+    
+    def __init__(self, token: str, repo_name: str, config_path: str = None):
+        try:
+            self.github = Github(token)
+            self.repo = self.github.get_repo(repo_name)
+            self.config = self.load_config(config_path)
+        except Exception as e:
+            print(f"Error initializing GitHub connection: {e}")
+            sys.exit(1)
+        
+    def load_config(self, config_path: str = None) -> Dict:
+        """Load configuration from YAML file or use defaults."""
+        default_config = {
+            'similarity_threshold': DEFAULT_SIMILARITY_THRESHOLD,
+            'high_similarity_threshold': DEFAULT_HIGH_SIMILARITY_THRESHOLD,
+            'max_issues_to_check': DEFAULT_MAX_ISSUES_TO_CHECK,
+            'auto_close_exact_match': DEFAULT_AUTO_CLOSE_EXACT_MATCH,
+            'label_possible_duplicate': DEFAULT_LABEL_POSSIBLE_DUPLICATE,
+            'label_exact_duplicate': DEFAULT_LABEL_EXACT_DUPLICATE,
+            'exclude_labels': ['wontfix', 'invalid'],
+            'min_text_length': 20,
+        }
+        
+        if config_path and os.path.exists(config_path):
+            try:
+                with open(config_path, 'r') as f:
+                    user_config = yaml.safe_load(f)
+                    default_config.update(user_config)
+            except Exception as e:
+                print(f"Warning: Could not load config file: {e}")
+        
+        return default_config
+    
+    def preprocess_text(self, text: str) -> str:
+        """Preprocess text for comparison."""
+        if not text:
+            return ""
+        # Convert to lowercase and strip whitespace
+        text = text.lower().strip()
+        # Remove URLs
+        import re
+        text = re.sub(r'http\S+|www.\S+', '', text)
+        # Remove markdown code blocks
+        text = re.sub(r'```[\s\S]*?```', '', text)
+        # Remove special characters but keep spaces
+        text = re.sub(r'[^a-z0-9\s]', ' ', text)
+        # Remove extra whitespace
+        text = ' '.join(text.split())
+        return text
+    
+    def calculate_similarity(self, text1: str, text2: str) -> float:
+        """Calculate cosine similarity between two texts."""
+        if not text1 or not text2:
+            return 0.0
+        
+        try:
+            vectorizer = TfidfVectorizer(
+                min_df=1,
+                stop_words='english',
+                ngram_range=(1, 2)
+            )
+            tfidf_matrix = vectorizer.fit_transform([text1, text2])
+            similarity = cosine_similarity(tfidf_matrix[0:1], 
tfidf_matrix[1:2])[0][0]
+            return float(similarity)
+        except Exception as e:
+            print(f"Error calculating similarity: {e}")
+            return 0.0
+    
+    def find_similar_issues(self, current_number: int, current_title: str, 
+                           current_body: str, item_type: str = 'issue') -> 
List[Tuple[int, str, float]]:
+        """Find similar issues or PRs."""
+        current_text = self.preprocess_text(f"{current_title} {current_body}")
+        
+        if len(current_text) < self.config['min_text_length']:
+            print(f"Text too short for meaningful comparison: 
{len(current_text)} chars")
+            return []
+        
+        similar_items = []
+        
+        # Get existing items to compare against
+        if item_type == 'issue':
+            items = self.repo.get_issues(state='all')
+        else:
+            items = self.repo.get_pulls(state='all')
+        
+        checked_count = 0
+        
+        try:
+            for item in items:
+                if checked_count >= self.config['max_issues_to_check']:
+                    break
+                
+                # Skip the current item
+                if item.number == current_number:
+                    continue
+                
+                try:
+                    # Skip items with excluded labels
+                    item_labels = [label.name for label in item.labels]
+                    if any(label in self.config['exclude_labels'] for label in 
item_labels):
+                        continue
+                    
+                    # Calculate similarity
+                    item_text = self.preprocess_text(f"{item.title} {item.body 
or ''}")
+                    similarity = self.calculate_similarity(current_text, 
item_text)
+                    
+                    if similarity >= self.config['similarity_threshold']:
+                        similar_items.append((item.number, item.title, 
similarity))
+                except Exception as e:
+                    print(f"Warning: Error processing item #{item.number}: 
{e}")
+                    continue
+                
+                checked_count += 1
+        except Exception as e:
+            print(f"Error fetching items from repository: {e}")
+            print("This might be due to API rate limits or permissions 
issues.")
+        
+        # Sort by similarity (highest first)
+        similar_items.sort(key=lambda x: x[2], reverse=True)
+        return similar_items
+    
+    def add_label(self, item_number: int, label: str, item_type: str = 
'issue'):
+        """Add a label to an issue or PR."""
+        try:
+            # Ensure label exists
+            try:
+                self.repo.get_label(label)
+            except:
+                # Create label if it doesn't exist
+                if label == self.config['label_possible_duplicate']:
+                    self.repo.create_label(label, "FFA500", "Potential 
duplicate issue")
+                elif label == self.config['label_exact_duplicate']:
+                    self.repo.create_label(label, "FF0000", "Exact duplicate 
issue")
+            
+            if item_type == 'issue':
+                item = self.repo.get_issue(item_number)
+            else:
+                item = self.repo.get_pull(item_number)

Review Comment:
   For PRs, this uses `repo.get_pull()` and then calls `add_to_labels()`. In 
the GitHub API (and PyGitHub), labeling a PR is typically done through the 
Issue API; consider fetching the underlying issue (`repo.get_issue(pr_number)`) 
when adding labels so PR labeling works reliably.
   ```suggestion
                   # For pull requests, use the underlying issue to manage 
labels
                   item = self.repo.get_issue(item_number)
   ```



##########
.github/scripts/detect-duplicates.py:
##########
@@ -0,0 +1,315 @@
+#!/usr/bin/env python3
+"""
+Duplicate Issue and Pull Request Detection Script
+Detects potential duplicate issues and PRs using text similarity analysis.
+"""
+
+import os
+import sys
+import argparse
+import json
+from typing import List, Dict, Tuple
+from github import Github
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+import numpy as np
+import yaml
+
+# Configuration defaults
+DEFAULT_SIMILARITY_THRESHOLD = 0.75
+DEFAULT_HIGH_SIMILARITY_THRESHOLD = 0.90
+DEFAULT_MAX_ISSUES_TO_CHECK = 200
+DEFAULT_AUTO_CLOSE_EXACT_MATCH = False
+DEFAULT_LABEL_POSSIBLE_DUPLICATE = "possible-duplicate"
+DEFAULT_LABEL_EXACT_DUPLICATE = "duplicate"
+
+
+class DuplicateDetector:
+    """Detects duplicate issues and pull requests."""
+    
+    def __init__(self, token: str, repo_name: str, config_path: str = None):
+        try:
+            self.github = Github(token)
+            self.repo = self.github.get_repo(repo_name)
+            self.config = self.load_config(config_path)
+        except Exception as e:
+            print(f"Error initializing GitHub connection: {e}")
+            sys.exit(1)
+        
+    def load_config(self, config_path: str = None) -> Dict:
+        """Load configuration from YAML file or use defaults."""
+        default_config = {
+            'similarity_threshold': DEFAULT_SIMILARITY_THRESHOLD,
+            'high_similarity_threshold': DEFAULT_HIGH_SIMILARITY_THRESHOLD,
+            'max_issues_to_check': DEFAULT_MAX_ISSUES_TO_CHECK,
+            'auto_close_exact_match': DEFAULT_AUTO_CLOSE_EXACT_MATCH,
+            'label_possible_duplicate': DEFAULT_LABEL_POSSIBLE_DUPLICATE,
+            'label_exact_duplicate': DEFAULT_LABEL_EXACT_DUPLICATE,
+            'exclude_labels': ['wontfix', 'invalid'],
+            'min_text_length': 20,
+        }
+        
+        if config_path and os.path.exists(config_path):
+            try:
+                with open(config_path, 'r') as f:
+                    user_config = yaml.safe_load(f)
+                    default_config.update(user_config)
+            except Exception as e:
+                print(f"Warning: Could not load config file: {e}")
+        
+        return default_config
+    
+    def preprocess_text(self, text: str) -> str:
+        """Preprocess text for comparison."""
+        if not text:
+            return ""
+        # Convert to lowercase and strip whitespace
+        text = text.lower().strip()
+        # Remove URLs
+        import re
+        text = re.sub(r'http\S+|www.\S+', '', text)
+        # Remove markdown code blocks
+        text = re.sub(r'```[\s\S]*?```', '', text)
+        # Remove special characters but keep spaces
+        text = re.sub(r'[^a-z0-9\s]', ' ', text)
+        # Remove extra whitespace
+        text = ' '.join(text.split())
+        return text
+    
+    def calculate_similarity(self, text1: str, text2: str) -> float:
+        """Calculate cosine similarity between two texts."""
+        if not text1 or not text2:
+            return 0.0
+        
+        try:
+            vectorizer = TfidfVectorizer(
+                min_df=1,
+                stop_words='english',
+                ngram_range=(1, 2)
+            )
+            tfidf_matrix = vectorizer.fit_transform([text1, text2])
+            similarity = cosine_similarity(tfidf_matrix[0:1], 
tfidf_matrix[1:2])[0][0]
+            return float(similarity)
+        except Exception as e:
+            print(f"Error calculating similarity: {e}")
+            return 0.0
+    
+    def find_similar_issues(self, current_number: int, current_title: str, 
+                           current_body: str, item_type: str = 'issue') -> 
List[Tuple[int, str, float]]:
+        """Find similar issues or PRs."""
+        current_text = self.preprocess_text(f"{current_title} {current_body}")
+        
+        if len(current_text) < self.config['min_text_length']:
+            print(f"Text too short for meaningful comparison: 
{len(current_text)} chars")
+            return []
+        
+        similar_items = []
+        
+        # Get existing items to compare against
+        if item_type == 'issue':
+            items = self.repo.get_issues(state='all')
+        else:
+            items = self.repo.get_pulls(state='all')
+        
+        checked_count = 0
+        
+        try:
+            for item in items:
+                if checked_count >= self.config['max_issues_to_check']:
+                    break
+                
+                # Skip the current item
+                if item.number == current_number:
+                    continue
+                
+                try:
+                    # Skip items with excluded labels
+                    item_labels = [label.name for label in item.labels]
+                    if any(label in self.config['exclude_labels'] for label in 
item_labels):
+                        continue
+                    
+                    # Calculate similarity
+                    item_text = self.preprocess_text(f"{item.title} {item.body 
or ''}")
+                    similarity = self.calculate_similarity(current_text, 
item_text)
+                    
+                    if similarity >= self.config['similarity_threshold']:
+                        similar_items.append((item.number, item.title, 
similarity))
+                except Exception as e:
+                    print(f"Warning: Error processing item #{item.number}: 
{e}")
+                    continue
+                
+                checked_count += 1
+        except Exception as e:
+            print(f"Error fetching items from repository: {e}")
+            print("This might be due to API rate limits or permissions 
issues.")
+        
+        # Sort by similarity (highest first)
+        similar_items.sort(key=lambda x: x[2], reverse=True)
+        return similar_items
+    
+    def add_label(self, item_number: int, label: str, item_type: str = 
'issue'):
+        """Add a label to an issue or PR."""
+        try:
+            # Ensure label exists
+            try:
+                self.repo.get_label(label)
+            except:
+                # Create label if it doesn't exist
+                if label == self.config['label_possible_duplicate']:
+                    self.repo.create_label(label, "FFA500", "Potential 
duplicate issue")
+                elif label == self.config['label_exact_duplicate']:
+                    self.repo.create_label(label, "FF0000", "Exact duplicate 
issue")
+            
+            if item_type == 'issue':
+                item = self.repo.get_issue(item_number)
+            else:
+                item = self.repo.get_pull(item_number)
+            
+            item.add_to_labels(label)
+            print(f"Added label '{label}' to {item_type} #{item_number}")
+        except Exception as e:
+            print(f"Error adding label: {e}")
+    
+    def add_comment(self, item_number: int, similar_items: List[Tuple[int, 
str, float]], 
+                   item_type: str = 'issue'):
+        """Add a comment about potential duplicates."""
+        if not similar_items:
+            return
+        
+        item_type_name = "issue" if item_type == 'issue' else "pull request"
+        
+        # Build comment message
+        comment = f"👋 **Potential Duplicate Detected**\n\n"
+        comment += f"This {item_type_name} appears to be similar to existing 
{item_type_name}s:\n\n"
+        
+        for number, title, similarity in similar_items[:5]:  # Show top 5
+            similarity_pct = int(similarity * 100)
+            comment += f"- #{number}: {title} (Similarity: 
{similarity_pct}%)\n"
+        
+        comment += f"\n---\n"
+        comment += f"Please review these {item_type_name}s to see if any of 
them address your concern. "
+        comment += f"If this is indeed a duplicate, please close this 
{item_type_name} and continue the discussion in the existing one.\n\n"
+        comment += f"If this is **not** a duplicate, please add more context 
to help differentiate it.\n\n"
+        comment += f"*This is an automated message. If you believe this is 
incorrect, please remove the label and mention a maintainer.*"
+        
+        try:
+            if item_type == 'issue':
+                item = self.repo.get_issue(item_number)
+            else:
+                item = self.repo.get_pull(item_number)
+            
+            item.create_comment(comment)
+            print(f"Added duplicate detection comment to {item_type} 
#{item_number}")
+        except Exception as e:
+            print(f"Error adding comment: {e}")
+    
+    def close_item(self, item_number: int, duplicate_of: int, item_type: str = 
'issue'):
+        """Close an item as a duplicate."""
+        try:
+            if item_type == 'issue':
+                item = self.repo.get_issue(item_number)
+            else:
+                item = self.repo.get_pull(item_number)
+            
+            comment = f"🔒 **Closing as Exact Duplicate**\n\n"
+            comment += f"This {item_type} is an exact duplicate of 
#{duplicate_of}.\n\n"
+            comment += f"Please continue the discussion in #{duplicate_of}."
+            
+            item.create_comment(comment)
+            item.edit(state='closed')
+            print(f"Closed {item_type} #{item_number} as duplicate of 
#{duplicate_of}")

Review Comment:
   For PRs, this uses `repo.get_pull()` and then calls `create_comment()` 
before closing. As with labeling/commenting elsewhere, consider using the 
underlying issue API for PR comments (or the dedicated PR issue-comment method) 
so the close message is posted reliably.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to