[PR] feat(indexing): add markdown document indexing [solr-mcp]

via GitHub Thu, 11 Jun 2026 13:44:34 -0700


shahzadarain opened a new pull request, #144:
URL: https://github.com/apache/solr-mcp/pull/144


   ## Summary
   
   Adds an `index-markdown-documents` MCP tool that indexes markdown content 
into Solr, alongside the existing JSON/CSV/XML tools — extracting searchable 
structure rather than a flat text blob.
   
   Closes #69
   
   ## Design
   
   A new `MarkdownDocumentCreator` follows the existing strategy pattern in 
`indexing/documentcreator/`, parsing with **CommonMark-Java 0.28.0** plus its 
YAML front matter extension (lightweight, reflection-free, so GraalVM 
native-image safe — no extra hints needed).
   
   Field extraction:
   
   | Field | Source |
   |---|---|
   | front matter entries | each entry becomes a field (names sanitized via 
`FieldNameSanitizer`; block- and flow-style lists become multi-valued fields) |
   | `id` | front matter `id` if present, otherwise SHA-256 of the input |
   | `title` | front matter `title`, else first level-1 heading |
   | `headings` | multi-valued field with every heading text (the document 
outline) |
   | `content` | plain text body, front matter excluded |
   
   The `index-data` MCP prompt also accepts `markdown`/`md` formats.
   
   ## Maintainer feedback from #69, addressed
   
   **Stable id / idempotency:** when front matter supplies no `id`, the 
document id is derived deterministically from a SHA-256 hash of the input, so 
re-indexing identical markdown overwrites the same document — matching the 
tool's `idempotentHint`. Since LLM-driven conversion of other formats to 
markdown is non-deterministic, the tool description also instructs clients to 
supply a stable front matter `id` when indexing converted content.
   
   **Tool steering:** the tool description now reads: *"Do NOT use for 
JSON/CSV/XML input; use index-json-documents, index-csv-documents, or 
index-xml-documents instead. Only convert source content to markdown when there 
is no dedicated tool for the source format, and supply a stable 'id' in the 
YAML front matter when doing so."*
   
   ## Testing
   
   - `MarkdownIndexingTest` — 10 unit tests: front matter extraction, title 
resolution (front matter wins over H1), heading collection, plain-text body 
with formatting stripped, field name sanitization, flow- and block-style YAML 
lists, front matter id, content-hash id stability, empty-input rejection
   - Prompt-path tests for `markdown` and `md` formats in `IndexingServiceTest`
   - MCP round-trip added to `McpClientIntegrationTestBase`: index markdown via 
the tool, then find it by front matter id (runs across all transport × runtime 
combinations)
   - Tool registered in `listToolsReturnsExpectedTools` and behavior hints 
asserted in `toolsExposeBehaviorHints`
   - `./gradlew spotlessApply` clean; unit tests green locally on JDK 25
   
   Docs updated: README tool table, AGENTS.md format/creator lists.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat(indexing): add markdown document indexing [solr-mcp]

Reply via email to