LRriver opened a new issue, #345: URL: https://github.com/apache/hugegraph-ai/issues/345
### Search before asking - [x] I had searched in the [feature](https://github.com/apache/hugegraph-ai/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Feature Description (功能描述) ## Feature Description The demo upload path supports `.txt` and `.docx`, but PDF files currently raise an error. Please add PDF text extraction support for uploaded documents used by both vector index building and graph extraction. ## Current verification - `read_documents()` handles `.txt` and `.docx`. - `.pdf` currently hits a TODO and raises `gr.Error("PDF will be supported later! Try to upload text/docx now")` in `hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py:48`. - The same `read_documents()` helper is used by both `build_vector_index()` and `extract_graph()`. - Current project dependencies include `python-docx`, but no PDF parser dependency was found in the checked files. ## Suggested scope - Choose a maintained PDF text extraction dependency and add it through the existing `uv` dependency setup. - Extract text page by page and join it in a stable order. - Return a clear error for encrypted, scanned-image-only, or unreadable PDFs. - Keep `.txt` and `.docx` behavior unchanged. - Update UI copy that currently says uploads should be TXT or DOCX only. ## Mermaid reference ```mermaid flowchart LR Upload[Uploaded files] --> Detect{Extension} Detect -->|.txt| Txt[Read UTF-8 text] Detect -->|.docx| Docx[Read paragraphs] Detect -->|.pdf| Pdf[Extract page text] Pdf --> Validate{Text found?} Validate -->|yes| Texts[Append document text] Validate -->|no| Error[Readable Gradio error] Txt --> Texts Docx --> Texts Texts --> Consumers[Vector index and graph extraction] ``` ## Acceptance criteria - Uploading a text-based PDF no longer raises the current TODO error. - Extracted PDF text can be used by both vector index import and graph extraction. - Unreadable PDFs produce a user-facing error that explains the limitation. - Existing TXT and DOCX upload tests still pass. ## Suggested tests - Unit test for `read_documents()` with a small text-based PDF fixture. - Unit test for unsupported or unreadable PDF behavior. - Regression test that TXT and DOCX paths still work. ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
