This is an automated email from the ASF dual-hosted git repository.
acosentino pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/camel-jbang-examples.git
The following commit(s) were added to refs/heads/main by this push:
new 01885bb Added an example of usage of docling-serve, ollama and
langchain4j (#44)
01885bb is described below
commit 01885bb915e4091c377326fc2e254e0378640cf1
Author: Andrea Cosentino <[email protected]>
AuthorDate: Tue Oct 14 14:48:44 2025 +0200
Added an example of usage of docling-serve, ollama and langchain4j (#44)
Signed-off-by: Andrea Cosentino <[email protected]>
---
docling-langchain4j-rag/README.adoc | 651 +++++++++++++++++++++
docling-langchain4j-rag/application.properties | 39 ++
docling-langchain4j-rag/compose.yaml | 45 ++
.../docling-langchain4j-rag.yaml | 378 ++++++++++++
docling-langchain4j-rag/run.sh | 27 +
docling-langchain4j-rag/sample.md | 65 ++
6 files changed, 1205 insertions(+)
diff --git a/docling-langchain4j-rag/README.adoc
b/docling-langchain4j-rag/README.adoc
new file mode 100644
index 0000000..326b535
--- /dev/null
+++ b/docling-langchain4j-rag/README.adoc
@@ -0,0 +1,651 @@
+= Document Analysis with Docling and LangChain4j RAG
+
+This example demonstrates a complete RAG (Retrieval Augmented Generation)
workflow using Apache Camel, combining:
+
+* **Docling** - AI-powered document conversion (PDF, Word, PowerPoint →
Markdown/JSON)
+* **LangChain4j** - Integration with Large Language Models
+* **Ollama** - Local LLM inference
+
+== Overview
+
+This application provides intelligent document processing capabilities:
+
+* **Automatic Document Conversion** - Convert various document formats to
Markdown using Docling
+* **AI-Powered Analysis** - Analyze documents using LLMs via LangChain4j
+* **Interactive Q&A** - Ask questions about your documents through REST API
+* **Batch Processing** - Summarize multiple documents automatically
+* **Structured Data Extraction** - Extract tables and structured information
from documents
+
+== Architecture
+
+=== Components
+
+[source,text]
+----
+Documents → Docling (Convert) → Markdown → LangChain4j → Ollama (LLM) →
Analysis
+----
+
+**Docling-Serve**: Python-based document conversion service running in Docker
+
+**Ollama**: Local LLM server running models like Llama 3.2
+
+**Camel Routes**: Orchestrate the workflow between components
+
+=== Features
+
+* **Document Format Support**: PDF, DOCX, PPTX, HTML, Markdown
+* **Multiple Operations**: Analysis, Q&A, Summarization, Data Extraction
+* **Docker-based**: All services run in containers
+* **REST API**: HTTP endpoints for interaction
+* **Automatic Processing**: File watcher for automatic document processing
+
+== Prerequisites
+
+* JBang installed (https://www.jbang.dev)
+* Java 11 or later
+* Docker and Docker Compose
+
+== Project Structure
+
+[source,text]
+----
+docling-langchain4j-rag/
+├── docling-langchain4j-rag.yaml # Main YAML configuration
+├── application.properties # Configuration settings
+├── compose.yaml # Docker Compose for services
+├── run.sh # Convenience run script
+├── sample.md # Sample document (copy to documents/ for
testing)
+├── README.adoc # This file
+├── documents/ # Input directory (files auto-deleted
after processing)
+└── output/ # Analysis reports output
+----
+
+== Setup
+
+=== Step 1: Start Required Services
+
+You have two options for running the required services:
+
+==== Option A: Using Docker Compose (Recommended)
+
+Start both Docling and Ollama services:
+
+[source,sh]
+----
+$ docker compose up -d
+----
+
+Pull the Ollama model (first time only):
+
+[source,sh]
+----
+$ docker exec -it ollama ollama pull orca-mini
+----
+
+Verify services are running:
+
+[source,sh]
+----
+$ curl http://localhost:5001/ # Docling
+$ curl http://localhost:11434/ # Ollama
+----
+
+==== Option B: Using Camel Infra Commands (If Available)
+
+[source,sh]
+----
+# Start Docling (if camel infra supports it)
+$ jbang -Dcamel.jbang.version=4.16.0-SNAPSHOT camel@apache/camel infra run
docling
+
+# Start Ollama (if camel infra supports it)
+$ jbang -Dcamel.jbang.version=4.16.0-SNAPSHOT camel@apache/camel infra run
ollama
+----
+
+==== Option C: Manual Docker Commands
+
+[source,sh]
+----
+# Start Docling-Serve
+$ docker run -d -p 5001:5001 --name docling-serve
ghcr.io/docling-project/docling-serve:latest
+
+# Start Ollama
+$ docker run -d -p 11434:11434 --name ollama ollama/ollama:latest
+
+# Pull Ollama model
+$ docker exec -it ollama ollama pull orca-mini
+----
+
+=== Step 2: Create Required Directories
+
+The `documents/` and `output/` directories will be created automatically when
needed, but you can create them manually:
+
+[source,sh]
+----
+$ mkdir -p documents output
+----
+
+**Note:** Files placed in `documents/` will be automatically processed and
then **deleted** after analysis is complete.
+
+=== Step 3: Run the Camel Application
+
+[source,sh]
+----
+$ jbang -Dcamel.jbang.version=4.16.0-SNAPSHOT camel@apache/camel run \
+ --fresh \
+ --dep=camel:docling \
+ --dep=camel:langchain4j-chat \
+ --dep=camel:platform-http \
+ --dep=dev.langchain4j:langchain4j:1.6.0 \
+ --dep=dev.langchain4j:langchain4j-ollama:1.6.0 \
+ --properties=application.properties \
+ docling-langchain4j-rag.yaml
+----
+
+The application will start and listen on port 8080.
+
+== Usage
+
+=== 1. Automatic Document Analysis
+
+Copy a document to the `documents/` directory for processing:
+
+[source,sh]
+----
+# Using the provided sample
+$ cp sample.md documents/
+
+# Or use your own document
+$ cp /path/to/your/document.pdf documents/
+----
+
+The system will:
+
+1. Detect the new file
+2. Convert it to Markdown using Docling
+3. Analyze it with the LLM
+4. Generate a comprehensive analysis report in `output/`
+5. **Automatically delete the source file** from `documents/` after processing
+
+**Example Output** (`output/sample.md_analysis.md`):
+
+[source,markdown]
+----
+# Document Analysis Report
+
+**File:** document.pdf
+**Date:** 2025-10-14 12:30:45
+
+---
+
+## AI Analysis
+
+**Summary:** This document discusses the implementation of RAG systems...
+
+**Key Topics:**
+- Document processing pipelines
+- LLM integration patterns
+- Vector embeddings and similarity search
+
+**Important Findings:**
+- RAG improves LLM accuracy by 40%
+- Hybrid search outperforms pure vector search
+...
+
+---
+
+## Full Document Content (Markdown)
+
+[Full converted markdown content here]
+----
+
+=== 2. Interactive Q&A
+
+Ask questions about your documents via HTTP API:
+
+[source,sh]
+----
+$ curl -X POST http://localhost:8080/api/ask \
+ -H "Content-Type: text/plain" \
+ -d "What are the main topics discussed in the document?"
+----
+
+**Response:**
+
+[source,text]
+----
+The document discusses three main topics:
+1. RAG (Retrieval Augmented Generation) architecture
+2. Document processing with Docling
+3. Integration with LangChain4j for LLM orchestration
+----
+
+=== 3. Structured Data Extraction
+
+Extract tables and structured data:
+
+[source,sh]
+----
+$ curl -X POST http://localhost:8080/api/extract \
+ -H "Content-Type: application/octet-stream" \
+ --data-binary "@documents/report.pdf"
+----
+
+**Response:**
+
+[source,text]
+----
+**Document Type:** Financial Report
+
+**Key Data Fields:**
+- Revenue: $1.2M (Table 1, Row 3)
+- Expenses: $800K (Table 1, Row 5)
+- Net Profit: $400K (calculated)
+
+**Tables Identified:**
+1. Quarterly Financial Summary (5 rows, 4 columns)
+2. Department Breakdown (8 rows, 3 columns)
+...
+----
+
+=== 4. Health Check
+
+Check system status:
+
+[source,sh]
+----
+$ curl http://localhost:8080/api/health
+----
+
+**Response:**
+
+[source,json]
+----
+{
+ "status": "healthy",
+ "components": {
+ "docling": {
+ "url": "http://localhost:5001",
+ "status": "configured"
+ },
+ "ollama": {
+ "url": "http://localhost:11434",
+ "model": "llama3.2",
+ "status": "configured"
+ }
+ },
+ "directories": {
+ "documents": "documents",
+ "output": "output"
+ }
+}
+----
+
+== Configuration
+
+=== application.properties
+
+[source,properties]
+----
+# Directories
+documents.directory=documents
+output.directory=output
+
+# Docling-Serve URL
+docling.serve.url=http://localhost:5001
+
+# Ollama Configuration
+ollama.base.url=http://localhost:11434
+ollama.model.name=llama3.2
+
+# Server Port
+camel.server.port=8080
+----
+
+=== Using Different Ollama Models
+
+Available models:
+
+* **llama3.2** (default) - Latest Llama model, good balance of speed and
quality
+* **llama3.2:1b** - Smaller, faster model
+* **mistral** - Alternative high-quality model
+* **phi3** - Microsoft's efficient model
+* **gemma2** - Google's Gemma model
+
+To use a different model:
+
+1. Pull the model:
+
+[source,sh]
+----
+$ docker exec -it ollama ollama pull mistral
+----
+
+2. Update `application.properties`:
+
+[source,properties]
+----
+ollama.model.name=mistral
+----
+
+3. Restart the Camel application
+
+=== Using Remote Ollama Instance
+
+To use Ollama running on a different machine:
+
+[source,properties]
+----
+ollama.base.url=http://remote-server:11434
+----
+
+== Routes Explanation
+
+=== Route 1: document-analysis-workflow
+
+**Trigger:** New file in `documents/` directory
+
+**Flow:**
+
+1. Detect new document
+2. Convert to Markdown via Docling
+3. Send to LLM for analysis
+4. Generate comprehensive report
+5. Save to `output/` directory
+
+**Supported Formats:** PDF, DOCX, PPTX, HTML, MD
+
+=== Route 2: document-qa-api
+
+**Endpoint:** `POST /api/ask`
+
+**Description:** Answer questions about the most recent document
+
+**Input:** Plain text question
+
+**Output:** AI-generated answer based on document content
+
+=== Route 3: batch-summarization
+
+**Trigger:** Timer (configurable)
+
+**Description:** Process all documents in batch and generate summaries
+
+**Configuration:** Set `batch.delay` in application.properties (default:
disabled)
+
+=== Route 4: health-check
+
+**Endpoint:** `GET /api/health`
+
+**Description:** System health and configuration status
+
+=== Route 5: extract-structured-data
+
+**Endpoint:** `POST /api/extract`
+
+**Description:** Extract tables and structured data from uploaded documents
+
+**Input:** Binary document data
+
+**Output:** AI analysis of extracted structured data
+
+== Advanced Usage
+
+=== Batch Processing
+
+Enable automatic batch summarization:
+
+[source,properties]
+----
+# Run every 1 hour (3600000 ms)
+batch.delay=3600000
+----
+
+All documents in the `documents/` directory will be summarized periodically.
+
+=== Custom Document Processing
+
+You can extend the routes to add custom processing logic:
+
+[source,yaml]
+----
+- route:
+ id: custom-processing
+ from:
+ uri: file:documents
+ parameters:
+ include: ".*\\.pdf"
+ steps:
+ # Your custom processing here
+ - to: docling:CONVERT_TO_HTML
+ - to: langchain4j-chat:custom
+----
+
+=== Integration with Vector Stores
+
+For production RAG, consider adding vector embeddings:
+
+[source,yaml]
+----
+# Add after document conversion
+- to: langchain4j-embeddings:embed
+- to: your-vector-store
+----
+
+== Troubleshooting
+
+=== Docling Not Responding
+
+**Check Docling service:**
+
+[source,sh]
+----
+$ docker logs docling-serve
+$ curl http://localhost:5001/
+----
+
+**Restart service:**
+
+[source,sh]
+----
+$ docker restart docling-serve
+----
+
+=== Ollama Model Not Found
+
+**Pull the model:**
+
+[source,sh]
+----
+$ docker exec -it ollama ollama pull llama3.2
+----
+
+**Check available models:**
+
+[source,sh]
+----
+$ docker exec -it ollama ollama list
+----
+
+=== Slow Document Processing
+
+**Causes:**
+
+* Large documents (>100 pages)
+* Complex layouts with many images
+* Limited CPU/memory
+
+**Solutions:**
+
+* Increase timeout in `application.properties`:
+
+[source,properties]
+----
+ollama.timeout=300
+----
+
+* Use a smaller/faster model (llama3.2:1b)
+* Process smaller documents first
+
+=== Out of Memory
+
+**Increase Docker memory:**
+
+[source,sh]
+----
+# In Docker Desktop: Settings → Resources → Memory
+# Recommended: 8GB or more for LLMs
+----
+
+== Performance Considerations
+
+=== Document Conversion
+
+* **PDF**: 1-5 seconds per page (depends on complexity)
+* **DOCX**: 0.5-2 seconds per page
+* **OCR-required**: 5-10 seconds per page (scanned PDFs)
+
+=== LLM Inference
+
+* **llama3.2 (3B)**: 5-15 seconds per response
+* **llama3.2:1b**: 2-5 seconds per response
+* **Speed depends on**: Prompt length, context size, hardware
+
+=== Recommended Hardware
+
+* **Minimum**: 8GB RAM, 4 CPU cores
+* **Recommended**: 16GB RAM, 8 CPU cores, GPU (optional)
+
+== Security Considerations
+
+=== Current Implementation
+
+* **Development Setup** - Not production-ready
+* **No Authentication** - Open HTTP endpoints
+* **Local Processing** - Data stays on your machine
+
+=== Production Recommendations
+
+**1. Authentication & Authorization**
+
+[source,yaml]
+----
+# Add to routes
+- setHeader:
+ name: Authorization
+ constant: "Bearer ${env:API_TOKEN}"
+----
+
+**2. Input Validation**
+
+* Validate file sizes
+* Check file types
+* Scan for malware
+
+**3. Rate Limiting**
+
+* Implement request throttling
+* Add queue management
+
+**4. Data Privacy**
+
+* Encrypt sensitive documents
+* Secure API endpoints with TLS
+* Implement access logging
+
+== Production Deployment
+
+=== Using Kubernetes
+
+[source,yaml]
+----
+# See k8s-deployment.yaml (example)
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+ name: docling-langchain4j-rag
+spec:
+ replicas: 3
+ ...
+----
+
+=== Scaling Considerations
+
+* **Horizontal**: Multiple Camel instances with load balancer
+* **Vertical**: Increase memory/CPU for Ollama container
+* **Caching**: Cache frequent document conversions
+
+== Cleanup
+
+Stop all services:
+
+[source,sh]
+----
+# Docker Compose
+$ docker compose down
+
+# Or manual cleanup
+$ docker stop docling-serve ollama
+$ docker rm docling-serve ollama
+----
+
+Remove volumes (optional):
+
+[source,sh]
+----
+$ docker volume rm docling-langchain4j-rag_ollama_data
+----
+
+== Alternative Configurations
+
+=== Using OpenAI Instead of Ollama
+
+[source,properties]
+----
+# application.properties
+openai.api.key=sk-your-api-key-here
+----
+
+[source,yaml]
+----
+# Update bean configuration
+- name: chatModel
+ type: dev.langchain4j.model.chat.ChatLanguageModel
+ scriptLanguage: groovy
+ script: |
+ import dev.langchain4j.model.openai.OpenAiChatModel
+
+ return OpenAiChatModel.builder()
+ .apiKey(context.resolvePropertyPlaceholders("{{openai.api.key}}"))
+ .modelName("gpt-4")
+ .temperature(0.3)
+ .build()
+----
+
+=== Using Cloud Docling Service
+
+If you have a cloud-hosted Docling service:
+
+[source,properties]
+----
+docling.serve.url=https://your-docling-service.com
+docling.auth.token=your-auth-token
+----
+
+== References
+
+* **Docling**: https://github.com/DS4SD/docling
+* **LangChain4j**: https://github.com/langchain4j/langchain4j
+* **Ollama**: https://ollama.ai
+* **Apache Camel**: https://camel.apache.org
+* **Camel Docling Component**:
/home/oscerd/workspace/apache-camel/camel/components/camel-ai/camel-docling/
+* **Camel LangChain4j Components**:
/home/oscerd/workspace/apache-camel/camel/components/camel-ai/
+
+== Help and Contributions
+
+If you hit any problem using Camel or have some feedback, then please
+https://camel.apache.org/community/support/[let us know].
+
+We also love contributors, so
+https://camel.apache.org/community/contributing/[get involved] :-)
+
+The Camel riders!
diff --git a/docling-langchain4j-rag/application.properties
b/docling-langchain4j-rag/application.properties
new file mode 100644
index 0000000..1c416dd
--- /dev/null
+++ b/docling-langchain4j-rag/application.properties
@@ -0,0 +1,39 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Application Configuration
+camel.main.name = DoclingLangChain4jRAG
+
+# Directory Configuration
+documents.directory=documents
+output.directory=output
+
+# Docling-Serve Configuration
+# Can be run via: docker run -p 5001:5001
ghcr.io/docling-project/docling-serve:latest
+# Or via: camel infra run docling (if available)
+docling.serve.url=http://localhost:5001
+
+# Ollama Configuration
+ollama.base.url=http://localhost:11434
+ollama.model.name=orca-mini
+
+# Batch Processing Configuration
+# Set to -1 to disable batch processing
+batch.delay=10000
+
+# HTTP Server Configuration
+camel.server.port=8080
diff --git a/docling-langchain4j-rag/compose.yaml
b/docling-langchain4j-rag/compose.yaml
new file mode 100644
index 0000000..56f4d91
--- /dev/null
+++ b/docling-langchain4j-rag/compose.yaml
@@ -0,0 +1,45 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+services:
+ docling-serve:
+ image: ghcr.io/docling-project/docling-serve:latest
+ container_name: docling-serve
+ ports:
+ - "5001:5001"
+ restart: unless-stopped
+ healthcheck:
+ test: ["CMD", "curl", "-f", "http://localhost:5001/"]
+ interval: 30s
+ timeout: 10s
+ retries: 3
+
+ ollama:
+ image: ollama/ollama:latest
+ container_name: ollama-rag
+ ports:
+ - "11435:11434"
+ volumes:
+ - ollama_data:/root/.ollama
+ restart: unless-stopped
+ healthcheck:
+ test: ["CMD", "curl", "-f", "http://localhost:11434/"]
+ interval: 30s
+ timeout: 10s
+ retries: 3
+
+volumes:
+ ollama_data:
+ driver: local
diff --git a/docling-langchain4j-rag/docling-langchain4j-rag.yaml
b/docling-langchain4j-rag/docling-langchain4j-rag.yaml
new file mode 100644
index 0000000..8004956
--- /dev/null
+++ b/docling-langchain4j-rag/docling-langchain4j-rag.yaml
@@ -0,0 +1,378 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Document Processing with Docling and AI Analysis with LangChain4j
+# This example demonstrates RAG (Retrieval Augmented Generation) using:
+# - Docling for document conversion (PDF, Word, etc. to Markdown)
+# - LangChain4j with Ollama for AI-powered document analysis
+
+# Bean Definitions
+- beans:
+ # Configure Ollama Chat Model
+ - name: chatModel
+ type: "#class:dev.langchain4j.model.ollama.OllamaChatModel"
+ scriptLanguage: groovy
+ script: |
+ import dev.langchain4j.model.ollama.OllamaChatModel
+ import static java.time.Duration.ofSeconds
+
+ return OllamaChatModel.builder()
+ .baseUrl("{{ollama.base.url}}")
+ .modelName("{{ollama.model.name}}")
+ .temperature(0.3)
+ .timeout(ofSeconds(120))
+ .build()
+
+# Route Definitions
+
+# Route 1: Main RAG workflow - Convert document and analyze with AI
+- route:
+ id: document-analysis-workflow
+ from:
+ uri: file:{{documents.directory}}
+ parameters:
+ include: ".*\\.(pdf|docx|pptx|html|md)"
+ noop: true
+ idempotent: true
+ steps:
+ - log: "Processing document: ${header.CamelFileName}"
+ - setProperty:
+ name: originalFileName
+ simple: "${header.CamelFileName}"
+
+ # Convert GenericFile to file path
+ - setBody:
+ simple: "${body.file.absolutePath}"
+
+ # Step 1: Convert document to Markdown using Docling
+ - log: "Converting document to Markdown with Docling..."
+ - to:
+ uri: docling:CONVERT_TO_MARKDOWN
+ parameters:
+ useDoclingServe: true
+ doclingServeUrl: "{{docling.serve.url}}"
+ contentInBody: true
+ - log: "Document converted to Markdown successfully"
+
+ # Save the file path for cleanup
+ - setProperty:
+ name: sourceFilePath
+ simple: "${exchangeProperty.originalFileName}"
+
+ # Step 2: Store converted content
+ - setProperty:
+ name: convertedMarkdown
+ simple: "${body}"
+
+ # Step 3: Log the converted content (first 500 chars)
+ - script:
+ groovy: |
+ def markdown = exchange.getProperty("convertedMarkdown",
String.class)
+ def preview = markdown.length() > 500 ? markdown.substring(0,
500) + "..." : markdown
+ log.info("Converted Markdown preview:\n{}", preview)
+
+ # Step 4: Prepare AI prompt for document analysis
+ - setBody:
+ simple: |
+ You are a helpful document analysis assistant. Please analyze
the following document and provide:
+ 1. A brief summary (2-3 sentences)
+ 2. Key topics and main points
+ 3. Any important findings or conclusions
+
+ Document content:
+ ${exchangeProperty.convertedMarkdown}
+
+ # Step 5: Send to LangChain4j Chat for AI analysis
+ - log: "Analyzing document with AI model..."
+ - to:
+ uri: langchain4j-chat:analysis
+ parameters:
+ chatModel: "#chatModel"
+
+ # Step 6: Store AI analysis result
+ - setProperty:
+ name: aiAnalysis
+ simple: "${body}"
+ - log: "AI analysis completed"
+
+ # Step 7: Create combined result (markdown + analysis)
+ - script:
+ groovy: |
+ def fileName = exchange.getProperty("originalFileName")
+ def markdown = exchange.getProperty("convertedMarkdown",
String.class)
+ def analysis = exchange.getProperty("aiAnalysis", String.class)
+ def dateStr = new java.text.SimpleDateFormat("yyyy-MM-dd
HH:mm:ss").format(new java.util.Date())
+
+ def result = "# Document Analysis Report\n\n" +
+ "**File:** ${fileName}\n" +
+ "**Date:** ${dateStr}\n\n" +
+ "---\n\n" +
+ "## AI Analysis\n\n" +
+ "${analysis}\n\n" +
+ "---\n\n" +
+ "## Full Document Content (Markdown)\n\n" +
+ "${markdown}"
+
+ exchange.message.setBody(result)
+
+ # Step 8: Save combined result
+ - setHeader:
+ name: CamelFileName
+ simple: "${exchangeProperty.originalFileName}_analysis.md"
+ - to:
+ uri: file:{{output.directory}}
+ - log: "Analysis report saved: ${header.CamelFileName}"
+
+ # Step 9: Clean up - delete the processed file from documents directory
+ - script:
+ groovy: |
+ import java.nio.file.Files
+ import java.nio.file.Paths
+
+ def docDir =
camelContext.resolvePropertyPlaceholders("{{documents.directory}}")
+ def fileName = exchange.getProperty("sourceFilePath")
+ def filePath = Paths.get(docDir, fileName)
+
+ if (Files.exists(filePath)) {
+ Files.delete(filePath)
+ log.info("Cleaned up source file: {}", filePath)
+ }
+ - log: "Processing complete for: ${exchangeProperty.originalFileName}"
+
+# Route 2: Interactive Q&A about documents
+- route:
+ id: document-qa-api
+ from:
+ uri: platform-http:/api/ask
+ steps:
+ - log: "Received question: ${body}"
+ - setProperty:
+ name: userQuestion
+ simple: "${body}"
+
+ # Read the most recent document from documents directory
+ - script:
+ groovy: |
+ import java.nio.file.Files
+ import java.nio.file.Paths
+ import java.util.stream.Collectors
+
+ def docDir =
camelContext.resolvePropertyPlaceholders("{{documents.directory}}")
+ def docPath = Paths.get(docDir)
+
+ if (Files.exists(docPath)) {
+ def latestFile = Files.list(docPath)
+ .filter { f ->
f.toString().matches(".*\\.(pdf|docx|pptx|html|md)") }
+ .max { f1, f2 ->
Files.getLastModifiedTime(f1).compareTo(Files.getLastModifiedTime(f2)) }
+ .orElse(null)
+
+ if (latestFile != null) {
+ exchange.message.setBody(latestFile.toFile())
+ exchange.setProperty("documentFound", true)
+ } else {
+ exchange.setProperty("documentFound", false)
+ exchange.message.setBody("No documents found in directory")
+ }
+ } else {
+ exchange.setProperty("documentFound", false)
+ exchange.message.setBody("Documents directory does not exist")
+ }
+
+ # Convert document to markdown if found
+ - choice:
+ when:
+ - simple: "${exchangeProperty.documentFound} == true"
+ steps:
+ - log: "Converting document for Q&A..."
+ - to:
+ uri: docling:CONVERT_TO_MARKDOWN
+ parameters:
+ useDoclingServe: true
+ doclingServeUrl: "{{docling.serve.url}}"
+ contentInBody: true
+
+ - setProperty:
+ name: documentContent
+ simple: "${body}"
+
+ # Prepare RAG prompt
+ - setBody:
+ simple: |
+ You are a helpful assistant answering questions about
documents.
+
+ Document content:
+ ${exchangeProperty.documentContent}
+
+ Question: ${exchangeProperty.userQuestion}
+
+ Please provide a clear and concise answer based on the
document content above.
+
+ - to:
+ uri: langchain4j-chat:qa
+ parameters:
+ chatModel: "#chatModel"
+
+ - setHeader:
+ name: Content-Type
+ constant: "text/plain"
+ otherwise:
+ steps:
+ - setBody:
+ simple: "Error: ${body}"
+ - setHeader:
+ name: Content-Type
+ constant: "text/plain"
+
+# Route 3: Batch document summarization
+- route:
+ id: batch-summarization
+ from:
+ uri: timer:batchSummarize
+ parameters:
+ delay: "{{batch.delay}}"
+ repeatCount: 0
+ steps:
+ - log: "Starting batch document summarization..."
+ - script:
+ groovy: |
+ import java.nio.file.Files
+ import java.nio.file.Paths
+
+ def docDir =
camelContext.resolvePropertyPlaceholders("{{documents.directory}}")
+ def docPath = Paths.get(docDir)
+
+ if (Files.exists(docPath)) {
+ def files = Files.list(docPath)
+ .filter { f ->
f.toString().matches(".*\\.(pdf|docx|pptx|html|md)") }
+ .collect { it.toFile() }
+
+ exchange.message.setBody(files)
+ } else {
+ exchange.message.setBody([])
+ }
+
+ - split:
+ simple: "${body}"
+ steps:
+ - log: "Summarizing: ${body}"
+ - setProperty:
+ name: currentFile
+ simple: "${body}"
+
+ # Convert to markdown
+ - to:
+ uri: docling:CONVERT_TO_MARKDOWN
+ parameters:
+ useDoclingServe: true
+ doclingServeUrl: "{{docling.serve.url}}"
+ contentInBody: true
+
+ # Generate summary
+ - setBody:
+ simple: |
+ Please provide a concise 3-sentence summary of the
following document:
+
+ ${body}
+
+ - to:
+ uri: langchain4j-chat:summary
+ parameters:
+ chatModel: "#chatModel"
+
+ - log: "Summary: ${body}"
+
+# Route 4: Health check endpoint
+- route:
+ id: health-check
+ from:
+ uri: platform-http:/api/health
+ steps:
+ - log: "Health check requested"
+ - script:
+ groovy: |
+ import groovy.json.JsonOutput
+
+ def doclingUrl =
camelContext.resolvePropertyPlaceholders("{{docling.serve.url}}")
+ def ollamaUrl =
camelContext.resolvePropertyPlaceholders("{{ollama.base.url}}")
+
+ def health = [
+ status: "healthy",
+ components: [
+ docling: [
+ url: doclingUrl,
+ status: "configured"
+ ],
+ ollama: [
+ url: ollamaUrl,
+ model:
camelContext.resolvePropertyPlaceholders("{{ollama.model.name}}"),
+ status: "configured"
+ ]
+ ],
+ directories: [
+ documents:
camelContext.resolvePropertyPlaceholders("{{documents.directory}}"),
+ output:
camelContext.resolvePropertyPlaceholders("{{output.directory}}")
+ ]
+ ]
+
+ exchange.message.setBody(JsonOutput.toJson(health))
+ - setHeader:
+ name: Content-Type
+ constant: "application/json"
+
+# Route 5: Extract structured data from documents
+- route:
+ id: extract-structured-data
+ from:
+ uri: platform-http:/api/extract
+ parameters:
+ httpMethodRestrict: "POST"
+ steps:
+ - log: "Extracting structured data from uploaded document"
+ - setProperty:
+ name: uploadedContent
+ simple: "${body}"
+
+ # Extract as JSON with Docling
+ - to:
+ uri: docling:EXTRACT_STRUCTURED_DATA
+ parameters:
+ useDoclingServe: true
+ doclingServeUrl: "{{docling.serve.url}}"
+ outputFormat: "json"
+ contentInBody: true
+
+ - setProperty:
+ name: structuredData
+ simple: "${body}"
+
+ # Ask AI to analyze the structured data
+ - setBody:
+ simple: |
+ Please analyze this structured document data and identify:
+ 1. Document type and structure
+ 2. Key data fields and their values
+ 3. Any tables or structured information
+
+ Structured data:
+ ${exchangeProperty.structuredData}
+
+ - to:
+ uri: langchain4j-chat:extract
+ parameters:
+ chatModel: "#chatModel"
+
+ - setHeader:
+ name: Content-Type
+ constant: "text/plain"
diff --git a/docling-langchain4j-rag/run.sh b/docling-langchain4j-rag/run.sh
new file mode 100755
index 0000000..1c43df8
--- /dev/null
+++ b/docling-langchain4j-rag/run.sh
@@ -0,0 +1,27 @@
+#!/bin/bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Run the Docling + LangChain4j RAG example
+
+jbang -Dcamel.jbang.version=4.16.0-SNAPSHOT camel@apache/camel run \
+ --fresh \
+ --dep=camel:docling \
+ --dep=camel:langchain4j-chat \
+ --dep=camel:platform-http \
+ --dep=dev.langchain4j:langchain4j:1.6.0 \
+ --dep=dev.langchain4j:langchain4j-ollama:1.6.0 \
+ --properties=application.properties \
+ docling-langchain4j-rag.yaml
diff --git a/docling-langchain4j-rag/sample.md
b/docling-langchain4j-rag/sample.md
new file mode 100644
index 0000000..5ad4111
--- /dev/null
+++ b/docling-langchain4j-rag/sample.md
@@ -0,0 +1,65 @@
+# Sample Document for RAG Analysis
+
+## Introduction to Apache Camel
+
+Apache Camel is an open-source integration framework based on known Enterprise
Integration Patterns (EIPs). It provides a rule-based routing and mediation
engine that allows developers to define routing and mediation rules in various
domain-specific languages.
+
+## Key Features
+
+### 1. Routing and Mediation Engine
+
+Camel supports routing and mediation rules in various DSLs including:
+
+- Java DSL
+- XML Configuration
+- YAML DSL
+- Groovy DSL
+
+### 2. Extensive Component Library
+
+Camel provides over 300 components for integrating with:
+
+- Messaging systems (JMS, Kafka, AMQP)
+- Databases (JDBC, MongoDB, Cassandra)
+- Cloud services (AWS, Azure, Google Cloud)
+- APIs (REST, SOAP, GraphQL)
+
+### 3. Enterprise Integration Patterns
+
+Camel implements all EIPs from the famous book by Gregor Hohpe and Bobby Woolf:
+
+- Content-Based Router
+- Message Filter
+- Splitter and Aggregator
+- Dead Letter Channel
+- Wire Tap
+
+## AI Integration
+
+Camel now includes AI components for modern integration needs:
+
+### LangChain4j Components
+
+- **langchain4j-chat**: Integrate with Large Language Models
+- **langchain4j-embeddings**: Generate vector embeddings
+- **langchain4j-tools**: Create AI tools and agents
+
+### Docling Component
+
+The Docling component enables document processing:
+
+- Convert PDF, Word, PowerPoint to Markdown
+- Extract structured data from documents
+- Support for OCR and table extraction
+- Integration with AI models for document analysis
+
+## Use Cases
+
+1. **Document Processing Pipeline**: Convert documents and analyze with AI
+2. **RAG Systems**: Retrieval Augmented Generation with vector stores
+3. **Intelligent Routing**: Use LLMs to make routing decisions
+4. **Data Extraction**: Extract and transform unstructured data
+
+## Conclusion
+
+Apache Camel continues to evolve, now bridging traditional integration
patterns with modern AI capabilities, making it an ideal choice for building
intelligent integration solutions.