imbajin commented on PR #282:
URL:
https://github.com/apache/incubator-hugegraph-ai/pull/282#issuecomment-3346303714
## Detailed Code Review - Additional Comments
### Critical Issue Details
#### 1. Division by Zero in TextRank PageRank Normalization
In `textrank_word_extract.py`, line 129-131:
```python
pagerank_scores = self.graph.pagerank(directed=False, damping=0.85,
weights='weight')
if max(pagerank_scores) > 0:
pagerank_scores = [scores/max(pagerank_scores) for scores in
pagerank_scores]
```
**Problem**: The code already checks for `max(pagerank_scores) > 0`, which
is good, but doesn't handle the else case properly.
**Suggested Fix**:
```python
pagerank_scores = self.graph.pagerank(directed=False, damping=0.85,
weights='weight')
if pagerank_scores and max(pagerank_scores) > 0:
max_score = max(pagerank_scores)
pagerank_scores = [score/max_score for score in pagerank_scores]
else:
pagerank_scores = [0.0] * len(pagerank_scores) if pagerank_scores else []
```
#### 2. Breaking Changes in graph_rag_task.py
The removal of `language` and `max_keywords` parameters needs careful
handling:
**Current problematic change** (lines 57-82):
```python
# Before:
def extract_keywords(self, text=None, max_keywords=5, language="english",
extract_template=None):
# After:
def extract_keywords(self, text=None, extract_template=None):
```
**Suggested backward-compatible implementation**:
```python
def extract_keywords(
self,
text: Optional[str] = None,
extract_template: Optional[str] = None,
max_keywords: Optional[int] = None, # Deprecated
language: Optional[str] = None, # Deprecated
**kwargs
):
"""
Extract keywords from text.
Args:
text: Text to extract keywords from
extract_template: Template for extraction
max_keywords: [DEPRECATED] Use context['max_keywords'] instead
language: [DEPRECATED] Use llm_settings.language instead
"""
import warnings
if max_keywords is not None:
warnings.warn(
"max_keywords parameter is deprecated and will be removed in
v2.0. "
"Please pass it via context dictionary instead.",
DeprecationWarning,
stacklevel=2
)
# Store in context for backward compatibility
self._context['max_keywords'] = max_keywords
if language is not None:
warnings.warn(
"language parameter is deprecated and will be removed in v2.0. "
"Please configure it via llm_settings instead.",
DeprecationWarning,
stacklevel=2
)
# Store for backward compatibility
self._context['language'] = language
self._operators.append(
KeywordExtract(
text=text,
extract_template=extract_template,
**kwargs
)
)
return self
```
#### 3. Regex Injection Prevention
In `textrank_word_extract.py`, the rules list should be immutable and
validated:
**Current vulnerable code**:
```python
self.rules = [r"https?://\\S+|www\\.\\S+",
r"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b",
r"\\b\\w+(?:[-'\\']\\w+)+\\b",
r"\\b\\d+[,.]\\d+\\b"]
```
**Suggested secure implementation**:
```python
import re
class MultiLingualTextRank:
# Make rules immutable
_SAFE_RULES = tuple([
r"https?://\\S+|www\\.\\S+",
r"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b",
r"\\b\\w+(?:[-'\\']\\w+)+\\b",
r"\\b\\d+[,.]\\d+\\b"
])
def __init__(self, keyword_num: int = 5, window_size: int = 3):
# Validate and compile patterns at initialization
try:
self._compiled_rules = re.compile('|'.join(self._SAFE_RULES),
re.IGNORECASE)
except re.error as e:
raise ValueError(f"Invalid regex pattern: {e}")
```
#### 4. NLTK Error Handling Enhancement
In `nltk_helper.py`, add more comprehensive error handling:
```python
def check_nltk_data(self):
"""Check and download required NLTK data packages."""
import socket
# Set timeout for network operations
original_timeout = socket.getdefaulttimeout()
socket.setdefaulttimeout(30)
try:
# ... existing code ...
for package in required_packages:
try:
# ... existing check ...
except LookupError:
max_retries = 3
for attempt in range(max_retries):
try:
log.info("Downloading NLTK package %s (attempt
%d/%d)",
package, attempt + 1, max_retries)
nltk.download(package, download_dir=nltk_data_dir,
quiet=False)
break
except (URLError, HTTPError, PermissionError,
socket.timeout) as e:
if attempt == max_retries - 1:
log.error("Failed to download %s after %d
attempts: %s",
package, max_retries, e)
else:
time.sleep(2 ** attempt) # Exponential backoff
finally:
socket.setdefaulttimeout(original_timeout)
```
### Testing Requirements
The PR is missing test coverage for:
1. **TextRank algorithm tests**:
- Test with empty text
- Test with single word
- Test with non-ASCII characters
- Test window size edge cases
- Test PageRank score normalization
2. **Hybrid extraction tests**:
- Test weight balancing
- Test fallback when one method fails
- Test score combination logic
3. **Migration tests**:
- Test deprecated parameter warnings
- Test backward compatibility
Example test structure:
```python
def test_textrank_empty_input():
extractor = MultiLingualTextRank()
result = extractor.extract_keywords("")
assert result == {}
def test_textrank_division_by_zero():
extractor = MultiLingualTextRank()
# Create scenario where all PageRank scores are 0
result = extractor.extract_keywords("a a a")
assert all(score >= 0 for score in result.values())
def test_deprecated_parameters():
with warnings.catch_warnings(record=True) as w:
pipeline = RAGPipeline()
pipeline.extract_keywords("test", max_keywords=5)
assert len(w) == 1
assert "deprecated" in str(w[0].message).lower()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]