LRriver opened a new issue, #345:
URL: https://github.com/apache/hugegraph-ai/issues/345

   ### Search before asking
   
   - [x] I had searched in the 
[feature](https://github.com/apache/hugegraph-ai/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Feature Description (功能描述)
   
   ## Feature Description
   
   The demo upload path supports `.txt` and `.docx`, but PDF files currently 
raise
   an error. Please add PDF text extraction support for uploaded documents used 
by
   both vector index building and graph extraction.
   
   ## Current verification
   
   - `read_documents()` handles `.txt` and `.docx`.
   - `.pdf` currently hits a TODO and raises `gr.Error("PDF will be supported 
later! Try to upload text/docx now")` in 
`hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py:48`.
   - The same `read_documents()` helper is used by both `build_vector_index()` 
and `extract_graph()`.
   - Current project dependencies include `python-docx`, but no PDF parser 
dependency was found in the checked files.
   
   ## Suggested scope
   
   - Choose a maintained PDF text extraction dependency and add it through the 
existing `uv` dependency setup.
   - Extract text page by page and join it in a stable order.
   - Return a clear error for encrypted, scanned-image-only, or unreadable PDFs.
   - Keep `.txt` and `.docx` behavior unchanged.
   - Update UI copy that currently says uploads should be TXT or DOCX only.
   
   ## Mermaid reference
   
   ```mermaid
   flowchart LR
       Upload[Uploaded files] --> Detect{Extension}
       Detect -->|.txt| Txt[Read UTF-8 text]
       Detect -->|.docx| Docx[Read paragraphs]
       Detect -->|.pdf| Pdf[Extract page text]
       Pdf --> Validate{Text found?}
       Validate -->|yes| Texts[Append document text]
       Validate -->|no| Error[Readable Gradio error]
       Txt --> Texts
       Docx --> Texts
       Texts --> Consumers[Vector index and graph extraction]
   ```
   
   ## Acceptance criteria
   
   - Uploading a text-based PDF no longer raises the current TODO error.
   - Extracted PDF text can be used by both vector index import and graph 
extraction.
   - Unreadable PDFs produce a user-facing error that explains the limitation.
   - Existing TXT and DOCX upload tests still pass.
   
   ## Suggested tests
   
   - Unit test for `read_documents()` with a small text-based PDF fixture.
   - Unit test for unsupported or unreadable PDF behavior.
   - Regression test that TXT and DOCX paths still work.
   
   
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to