shahzadarain opened a new pull request, #144: URL: https://github.com/apache/solr-mcp/pull/144
## Summary Adds an `index-markdown-documents` MCP tool that indexes markdown content into Solr, alongside the existing JSON/CSV/XML tools — extracting searchable structure rather than a flat text blob. Closes #69 ## Design A new `MarkdownDocumentCreator` follows the existing strategy pattern in `indexing/documentcreator/`, parsing with **CommonMark-Java 0.28.0** plus its YAML front matter extension (lightweight, reflection-free, so GraalVM native-image safe — no extra hints needed). Field extraction: | Field | Source | |---|---| | front matter entries | each entry becomes a field (names sanitized via `FieldNameSanitizer`; block- and flow-style lists become multi-valued fields) | | `id` | front matter `id` if present, otherwise SHA-256 of the input | | `title` | front matter `title`, else first level-1 heading | | `headings` | multi-valued field with every heading text (the document outline) | | `content` | plain text body, front matter excluded | The `index-data` MCP prompt also accepts `markdown`/`md` formats. ## Maintainer feedback from #69, addressed **Stable id / idempotency:** when front matter supplies no `id`, the document id is derived deterministically from a SHA-256 hash of the input, so re-indexing identical markdown overwrites the same document — matching the tool's `idempotentHint`. Since LLM-driven conversion of other formats to markdown is non-deterministic, the tool description also instructs clients to supply a stable front matter `id` when indexing converted content. **Tool steering:** the tool description now reads: *"Do NOT use for JSON/CSV/XML input; use index-json-documents, index-csv-documents, or index-xml-documents instead. Only convert source content to markdown when there is no dedicated tool for the source format, and supply a stable 'id' in the YAML front matter when doing so."* ## Testing - `MarkdownIndexingTest` — 10 unit tests: front matter extraction, title resolution (front matter wins over H1), heading collection, plain-text body with formatting stripped, field name sanitization, flow- and block-style YAML lists, front matter id, content-hash id stability, empty-input rejection - Prompt-path tests for `markdown` and `md` formats in `IndexingServiceTest` - MCP round-trip added to `McpClientIntegrationTestBase`: index markdown via the tool, then find it by front matter id (runs across all transport × runtime combinations) - Tool registered in `listToolsReturnsExpectedTools` and behavior hints asserted in `toolsExposeBehaviorHints` - `./gradlew spotlessApply` clean; unit tests green locally on JDK 25 Docs updated: README tool table, AGENTS.md format/creator lists. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
