Re: [PR] feat(llm): (BREAKING CHANGE) update keyword extraction method [incubator-hugegraph-ai]

via GitHub Fri, 17 Oct 2025 21:28:32 -0700


imbajin commented on PR #282:
URL: 
https://github.com/apache/incubator-hugegraph-ai/pull/282#issuecomment-3346303714


   ## Detailed Code Review - Additional Comments
   
   ### Critical Issue Details
   
   #### 1. Division by Zero in TextRank PageRank Normalization
   In `textrank_word_extract.py`, line 129-131:
   ```python
   pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, 
weights='weight')
   if max(pagerank_scores) > 0:
       pagerank_scores = [scores/max(pagerank_scores) for scores in 
pagerank_scores]
   ```
   
   **Problem**: The code already checks for `max(pagerank_scores) > 0`, which 
is good, but doesn't handle the else case properly.
   
   **Suggested Fix**:
   ```python
   pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, 
weights='weight')
   if pagerank_scores and max(pagerank_scores) > 0:
       max_score = max(pagerank_scores)
       pagerank_scores = [score/max_score for score in pagerank_scores]
   else:
       pagerank_scores = [0.0] * len(pagerank_scores) if pagerank_scores else []
   ```
   
   #### 2. Breaking Changes in graph_rag_task.py
   
   The removal of `language` and `max_keywords` parameters needs careful 
handling:
   
   **Current problematic change** (lines 57-82):
   ```python
   # Before:
   def extract_keywords(self, text=None, max_keywords=5, language="english", 
extract_template=None):
   
   # After: 
   def extract_keywords(self, text=None, extract_template=None):
   ```
   
   **Suggested backward-compatible implementation**:
   ```python
   def extract_keywords(
       self,
       text: Optional[str] = None,
       extract_template: Optional[str] = None,
       max_keywords: Optional[int] = None,  # Deprecated
       language: Optional[str] = None,  # Deprecated
       **kwargs
   ):
       """
       Extract keywords from text.
       
       Args:
           text: Text to extract keywords from
           extract_template: Template for extraction
           max_keywords: [DEPRECATED] Use context['max_keywords'] instead
           language: [DEPRECATED] Use llm_settings.language instead
       """
       import warnings
       
       if max_keywords is not None:
           warnings.warn(
               "max_keywords parameter is deprecated and will be removed in 
v2.0. "
               "Please pass it via context dictionary instead.",
               DeprecationWarning,
               stacklevel=2
           )
           # Store in context for backward compatibility
           self._context['max_keywords'] = max_keywords
       
       if language is not None:
           warnings.warn(
               "language parameter is deprecated and will be removed in v2.0. "
               "Please configure it via llm_settings instead.",
               DeprecationWarning,
               stacklevel=2
           )
           # Store for backward compatibility
           self._context['language'] = language
           
       self._operators.append(
           KeywordExtract(
               text=text,
               extract_template=extract_template,
               **kwargs
           )
       )
       return self
   ```
   
   #### 3. Regex Injection Prevention
   
   In `textrank_word_extract.py`, the rules list should be immutable and 
validated:
   
   **Current vulnerable code**:
   ```python
   self.rules = [r"https?://\\S+|www\\.\\S+",
                 r"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b",
                 r"\\b\\w+(?:[-'\\']\\w+)+\\b",
                 r"\\b\\d+[,.]\\d+\\b"]
   ```
   
   **Suggested secure implementation**:
   ```python
   import re
   
   class MultiLingualTextRank:
       # Make rules immutable
       _SAFE_RULES = tuple([
           r"https?://\\S+|www\\.\\S+",
           r"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b",
           r"\\b\\w+(?:[-'\\']\\w+)+\\b",
           r"\\b\\d+[,.]\\d+\\b"
       ])
       
       def __init__(self, keyword_num: int = 5, window_size: int = 3):
           # Validate and compile patterns at initialization
           try:
               self._compiled_rules = re.compile('|'.join(self._SAFE_RULES), 
re.IGNORECASE)
           except re.error as e:
               raise ValueError(f"Invalid regex pattern: {e}")
   ```
   
   #### 4. NLTK Error Handling Enhancement
   
   In `nltk_helper.py`, add more comprehensive error handling:
   
   ```python
   def check_nltk_data(self):
       """Check and download required NLTK data packages."""
       import socket
       
       # Set timeout for network operations
       original_timeout = socket.getdefaulttimeout()
       socket.setdefaulttimeout(30)
       
       try:
           # ... existing code ...
           for package in required_packages:
               try:
                   # ... existing check ...
               except LookupError:
                   max_retries = 3
                   for attempt in range(max_retries):
                       try:
                           log.info("Downloading NLTK package %s (attempt 
%d/%d)", 
                                   package, attempt + 1, max_retries)
                           nltk.download(package, download_dir=nltk_data_dir, 
quiet=False)
                           break
                       except (URLError, HTTPError, PermissionError, 
socket.timeout) as e:
                           if attempt == max_retries - 1:
                               log.error("Failed to download %s after %d 
attempts: %s", 
                                       package, max_retries, e)
                           else:
                               time.sleep(2 ** attempt)  # Exponential backoff
       finally:
           socket.setdefaulttimeout(original_timeout)
   ```
   
   ### Testing Requirements
   
   The PR is missing test coverage for:
   
   1. **TextRank algorithm tests**:
      - Test with empty text
      - Test with single word
      - Test with non-ASCII characters
      - Test window size edge cases
      - Test PageRank score normalization
   
   2. **Hybrid extraction tests**:
      - Test weight balancing
      - Test fallback when one method fails
      - Test score combination logic
   
   3. **Migration tests**:
      - Test deprecated parameter warnings
      - Test backward compatibility
   
   Example test structure:
   ```python
   def test_textrank_empty_input():
       extractor = MultiLingualTextRank()
       result = extractor.extract_keywords("")
       assert result == {}
   
   def test_textrank_division_by_zero():
       extractor = MultiLingualTextRank()
       # Create scenario where all PageRank scores are 0
       result = extractor.extract_keywords("a a a")  
       assert all(score >= 0 for score in result.values())
   
   def test_deprecated_parameters():
       with warnings.catch_warnings(record=True) as w:
           pipeline = RAGPipeline()
           pipeline.extract_keywords("test", max_keywords=5)
           assert len(w) == 1
           assert "deprecated" in str(w[0].message).lower()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(llm): (BREAKING CHANGE) update keyword extraction method [incubator-hugegraph-ai]

Reply via email to