This is an automated email from the ASF dual-hosted git repository. acosentino pushed a commit to branch rag-docling-serve in repository https://gitbox.apache.org/repos/asf/camel-jbang-examples.git
commit 15b4930be49ecaaa8f2cc247757c7682ba3a7a25 Author: Andrea Cosentino <[email protected]> AuthorDate: Tue Oct 14 14:35:52 2025 +0200 Added an example of usage of docling-serve, ollama and langchain4j Signed-off-by: Andrea Cosentino <[email protected]> --- docling-langchain4j-rag/.gitignore | 11 + docling-langchain4j-rag/README.adoc | 651 +++++++++++++++++++++ docling-langchain4j-rag/application.properties | 39 ++ docling-langchain4j-rag/compose.yaml | 45 ++ .../docling-langchain4j-rag.yaml | 378 ++++++++++++ docling-langchain4j-rag/run.sh | 27 + docling-langchain4j-rag/sample.md | 65 ++ 7 files changed, 1216 insertions(+) diff --git a/docling-langchain4j-rag/.gitignore b/docling-langchain4j-rag/.gitignore new file mode 100644 index 0000000..408ca46 --- /dev/null +++ b/docling-langchain4j-rag/.gitignore @@ -0,0 +1,11 @@ +# Camel JBang working directory +.camel-jbang/ + +# Output directories +output/ + +# All documents (users copy sample.md manually for testing) +documents/* + +# Logs +*.log diff --git a/docling-langchain4j-rag/README.adoc b/docling-langchain4j-rag/README.adoc new file mode 100644 index 0000000..326b535 --- /dev/null +++ b/docling-langchain4j-rag/README.adoc @@ -0,0 +1,651 @@ += Document Analysis with Docling and LangChain4j RAG + +This example demonstrates a complete RAG (Retrieval Augmented Generation) workflow using Apache Camel, combining: + +* **Docling** - AI-powered document conversion (PDF, Word, PowerPoint → Markdown/JSON) +* **LangChain4j** - Integration with Large Language Models +* **Ollama** - Local LLM inference + +== Overview + +This application provides intelligent document processing capabilities: + +* **Automatic Document Conversion** - Convert various document formats to Markdown using Docling +* **AI-Powered Analysis** - Analyze documents using LLMs via LangChain4j +* **Interactive Q&A** - Ask questions about your documents through REST API +* **Batch Processing** - Summarize multiple documents automatically +* **Structured Data Extraction** - Extract tables and structured information from documents + +== Architecture + +=== Components + +[source,text] +---- +Documents → Docling (Convert) → Markdown → LangChain4j → Ollama (LLM) → Analysis +---- + +**Docling-Serve**: Python-based document conversion service running in Docker + +**Ollama**: Local LLM server running models like Llama 3.2 + +**Camel Routes**: Orchestrate the workflow between components + +=== Features + +* **Document Format Support**: PDF, DOCX, PPTX, HTML, Markdown +* **Multiple Operations**: Analysis, Q&A, Summarization, Data Extraction +* **Docker-based**: All services run in containers +* **REST API**: HTTP endpoints for interaction +* **Automatic Processing**: File watcher for automatic document processing + +== Prerequisites + +* JBang installed (https://www.jbang.dev) +* Java 11 or later +* Docker and Docker Compose + +== Project Structure + +[source,text] +---- +docling-langchain4j-rag/ +├── docling-langchain4j-rag.yaml # Main YAML configuration +├── application.properties # Configuration settings +├── compose.yaml # Docker Compose for services +├── run.sh # Convenience run script +├── sample.md # Sample document (copy to documents/ for testing) +├── README.adoc # This file +├── documents/ # Input directory (files auto-deleted after processing) +└── output/ # Analysis reports output +---- + +== Setup + +=== Step 1: Start Required Services + +You have two options for running the required services: + +==== Option A: Using Docker Compose (Recommended) + +Start both Docling and Ollama services: + +[source,sh] +---- +$ docker compose up -d +---- + +Pull the Ollama model (first time only): + +[source,sh] +---- +$ docker exec -it ollama ollama pull orca-mini +---- + +Verify services are running: + +[source,sh] +---- +$ curl http://localhost:5001/ # Docling +$ curl http://localhost:11434/ # Ollama +---- + +==== Option B: Using Camel Infra Commands (If Available) + +[source,sh] +---- +# Start Docling (if camel infra supports it) +$ jbang -Dcamel.jbang.version=4.16.0-SNAPSHOT camel@apache/camel infra run docling + +# Start Ollama (if camel infra supports it) +$ jbang -Dcamel.jbang.version=4.16.0-SNAPSHOT camel@apache/camel infra run ollama +---- + +==== Option C: Manual Docker Commands + +[source,sh] +---- +# Start Docling-Serve +$ docker run -d -p 5001:5001 --name docling-serve ghcr.io/docling-project/docling-serve:latest + +# Start Ollama +$ docker run -d -p 11434:11434 --name ollama ollama/ollama:latest + +# Pull Ollama model +$ docker exec -it ollama ollama pull orca-mini +---- + +=== Step 2: Create Required Directories + +The `documents/` and `output/` directories will be created automatically when needed, but you can create them manually: + +[source,sh] +---- +$ mkdir -p documents output +---- + +**Note:** Files placed in `documents/` will be automatically processed and then **deleted** after analysis is complete. + +=== Step 3: Run the Camel Application + +[source,sh] +---- +$ jbang -Dcamel.jbang.version=4.16.0-SNAPSHOT camel@apache/camel run \ + --fresh \ + --dep=camel:docling \ + --dep=camel:langchain4j-chat \ + --dep=camel:platform-http \ + --dep=dev.langchain4j:langchain4j:1.6.0 \ + --dep=dev.langchain4j:langchain4j-ollama:1.6.0 \ + --properties=application.properties \ + docling-langchain4j-rag.yaml +---- + +The application will start and listen on port 8080. + +== Usage + +=== 1. Automatic Document Analysis + +Copy a document to the `documents/` directory for processing: + +[source,sh] +---- +# Using the provided sample +$ cp sample.md documents/ + +# Or use your own document +$ cp /path/to/your/document.pdf documents/ +---- + +The system will: + +1. Detect the new file +2. Convert it to Markdown using Docling +3. Analyze it with the LLM +4. Generate a comprehensive analysis report in `output/` +5. **Automatically delete the source file** from `documents/` after processing + +**Example Output** (`output/sample.md_analysis.md`): + +[source,markdown] +---- +# Document Analysis Report + +**File:** document.pdf +**Date:** 2025-10-14 12:30:45 + +--- + +## AI Analysis + +**Summary:** This document discusses the implementation of RAG systems... + +**Key Topics:** +- Document processing pipelines +- LLM integration patterns +- Vector embeddings and similarity search + +**Important Findings:** +- RAG improves LLM accuracy by 40% +- Hybrid search outperforms pure vector search +... + +--- + +## Full Document Content (Markdown) + +[Full converted markdown content here] +---- + +=== 2. Interactive Q&A + +Ask questions about your documents via HTTP API: + +[source,sh] +---- +$ curl -X POST http://localhost:8080/api/ask \ + -H "Content-Type: text/plain" \ + -d "What are the main topics discussed in the document?" +---- + +**Response:** + +[source,text] +---- +The document discusses three main topics: +1. RAG (Retrieval Augmented Generation) architecture +2. Document processing with Docling +3. Integration with LangChain4j for LLM orchestration +---- + +=== 3. Structured Data Extraction + +Extract tables and structured data: + +[source,sh] +---- +$ curl -X POST http://localhost:8080/api/extract \ + -H "Content-Type: application/octet-stream" \ + --data-binary "@documents/report.pdf" +---- + +**Response:** + +[source,text] +---- +**Document Type:** Financial Report + +**Key Data Fields:** +- Revenue: $1.2M (Table 1, Row 3) +- Expenses: $800K (Table 1, Row 5) +- Net Profit: $400K (calculated) + +**Tables Identified:** +1. Quarterly Financial Summary (5 rows, 4 columns) +2. Department Breakdown (8 rows, 3 columns) +... +---- + +=== 4. Health Check + +Check system status: + +[source,sh] +---- +$ curl http://localhost:8080/api/health +---- + +**Response:** + +[source,json] +---- +{ + "status": "healthy", + "components": { + "docling": { + "url": "http://localhost:5001", + "status": "configured" + }, + "ollama": { + "url": "http://localhost:11434", + "model": "llama3.2", + "status": "configured" + } + }, + "directories": { + "documents": "documents", + "output": "output" + } +} +---- + +== Configuration + +=== application.properties + +[source,properties] +---- +# Directories +documents.directory=documents +output.directory=output + +# Docling-Serve URL +docling.serve.url=http://localhost:5001 + +# Ollama Configuration +ollama.base.url=http://localhost:11434 +ollama.model.name=llama3.2 + +# Server Port +camel.server.port=8080 +---- + +=== Using Different Ollama Models + +Available models: + +* **llama3.2** (default) - Latest Llama model, good balance of speed and quality +* **llama3.2:1b** - Smaller, faster model +* **mistral** - Alternative high-quality model +* **phi3** - Microsoft's efficient model +* **gemma2** - Google's Gemma model + +To use a different model: + +1. Pull the model: + +[source,sh] +---- +$ docker exec -it ollama ollama pull mistral +---- + +2. Update `application.properties`: + +[source,properties] +---- +ollama.model.name=mistral +---- + +3. Restart the Camel application + +=== Using Remote Ollama Instance + +To use Ollama running on a different machine: + +[source,properties] +---- +ollama.base.url=http://remote-server:11434 +---- + +== Routes Explanation + +=== Route 1: document-analysis-workflow + +**Trigger:** New file in `documents/` directory + +**Flow:** + +1. Detect new document +2. Convert to Markdown via Docling +3. Send to LLM for analysis +4. Generate comprehensive report +5. Save to `output/` directory + +**Supported Formats:** PDF, DOCX, PPTX, HTML, MD + +=== Route 2: document-qa-api + +**Endpoint:** `POST /api/ask` + +**Description:** Answer questions about the most recent document + +**Input:** Plain text question + +**Output:** AI-generated answer based on document content + +=== Route 3: batch-summarization + +**Trigger:** Timer (configurable) + +**Description:** Process all documents in batch and generate summaries + +**Configuration:** Set `batch.delay` in application.properties (default: disabled) + +=== Route 4: health-check + +**Endpoint:** `GET /api/health` + +**Description:** System health and configuration status + +=== Route 5: extract-structured-data + +**Endpoint:** `POST /api/extract` + +**Description:** Extract tables and structured data from uploaded documents + +**Input:** Binary document data + +**Output:** AI analysis of extracted structured data + +== Advanced Usage + +=== Batch Processing + +Enable automatic batch summarization: + +[source,properties] +---- +# Run every 1 hour (3600000 ms) +batch.delay=3600000 +---- + +All documents in the `documents/` directory will be summarized periodically. + +=== Custom Document Processing + +You can extend the routes to add custom processing logic: + +[source,yaml] +---- +- route: + id: custom-processing + from: + uri: file:documents + parameters: + include: ".*\\.pdf" + steps: + # Your custom processing here + - to: docling:CONVERT_TO_HTML + - to: langchain4j-chat:custom +---- + +=== Integration with Vector Stores + +For production RAG, consider adding vector embeddings: + +[source,yaml] +---- +# Add after document conversion +- to: langchain4j-embeddings:embed +- to: your-vector-store +---- + +== Troubleshooting + +=== Docling Not Responding + +**Check Docling service:** + +[source,sh] +---- +$ docker logs docling-serve +$ curl http://localhost:5001/ +---- + +**Restart service:** + +[source,sh] +---- +$ docker restart docling-serve +---- + +=== Ollama Model Not Found + +**Pull the model:** + +[source,sh] +---- +$ docker exec -it ollama ollama pull llama3.2 +---- + +**Check available models:** + +[source,sh] +---- +$ docker exec -it ollama ollama list +---- + +=== Slow Document Processing + +**Causes:** + +* Large documents (>100 pages) +* Complex layouts with many images +* Limited CPU/memory + +**Solutions:** + +* Increase timeout in `application.properties`: + +[source,properties] +---- +ollama.timeout=300 +---- + +* Use a smaller/faster model (llama3.2:1b) +* Process smaller documents first + +=== Out of Memory + +**Increase Docker memory:** + +[source,sh] +---- +# In Docker Desktop: Settings → Resources → Memory +# Recommended: 8GB or more for LLMs +---- + +== Performance Considerations + +=== Document Conversion + +* **PDF**: 1-5 seconds per page (depends on complexity) +* **DOCX**: 0.5-2 seconds per page +* **OCR-required**: 5-10 seconds per page (scanned PDFs) + +=== LLM Inference + +* **llama3.2 (3B)**: 5-15 seconds per response +* **llama3.2:1b**: 2-5 seconds per response +* **Speed depends on**: Prompt length, context size, hardware + +=== Recommended Hardware + +* **Minimum**: 8GB RAM, 4 CPU cores +* **Recommended**: 16GB RAM, 8 CPU cores, GPU (optional) + +== Security Considerations + +=== Current Implementation + +* **Development Setup** - Not production-ready +* **No Authentication** - Open HTTP endpoints +* **Local Processing** - Data stays on your machine + +=== Production Recommendations + +**1. Authentication & Authorization** + +[source,yaml] +---- +# Add to routes +- setHeader: + name: Authorization + constant: "Bearer ${env:API_TOKEN}" +---- + +**2. Input Validation** + +* Validate file sizes +* Check file types +* Scan for malware + +**3. Rate Limiting** + +* Implement request throttling +* Add queue management + +**4. Data Privacy** + +* Encrypt sensitive documents +* Secure API endpoints with TLS +* Implement access logging + +== Production Deployment + +=== Using Kubernetes + +[source,yaml] +---- +# See k8s-deployment.yaml (example) +apiVersion: apps/v1 +kind: Deployment +metadata: + name: docling-langchain4j-rag +spec: + replicas: 3 + ... +---- + +=== Scaling Considerations + +* **Horizontal**: Multiple Camel instances with load balancer +* **Vertical**: Increase memory/CPU for Ollama container +* **Caching**: Cache frequent document conversions + +== Cleanup + +Stop all services: + +[source,sh] +---- +# Docker Compose +$ docker compose down + +# Or manual cleanup +$ docker stop docling-serve ollama +$ docker rm docling-serve ollama +---- + +Remove volumes (optional): + +[source,sh] +---- +$ docker volume rm docling-langchain4j-rag_ollama_data +---- + +== Alternative Configurations + +=== Using OpenAI Instead of Ollama + +[source,properties] +---- +# application.properties +openai.api.key=sk-your-api-key-here +---- + +[source,yaml] +---- +# Update bean configuration +- name: chatModel + type: dev.langchain4j.model.chat.ChatLanguageModel + scriptLanguage: groovy + script: | + import dev.langchain4j.model.openai.OpenAiChatModel + + return OpenAiChatModel.builder() + .apiKey(context.resolvePropertyPlaceholders("{{openai.api.key}}")) + .modelName("gpt-4") + .temperature(0.3) + .build() +---- + +=== Using Cloud Docling Service + +If you have a cloud-hosted Docling service: + +[source,properties] +---- +docling.serve.url=https://your-docling-service.com +docling.auth.token=your-auth-token +---- + +== References + +* **Docling**: https://github.com/DS4SD/docling +* **LangChain4j**: https://github.com/langchain4j/langchain4j +* **Ollama**: https://ollama.ai +* **Apache Camel**: https://camel.apache.org +* **Camel Docling Component**: /home/oscerd/workspace/apache-camel/camel/components/camel-ai/camel-docling/ +* **Camel LangChain4j Components**: /home/oscerd/workspace/apache-camel/camel/components/camel-ai/ + +== Help and Contributions + +If you hit any problem using Camel or have some feedback, then please +https://camel.apache.org/community/support/[let us know]. + +We also love contributors, so +https://camel.apache.org/community/contributing/[get involved] :-) + +The Camel riders! diff --git a/docling-langchain4j-rag/application.properties b/docling-langchain4j-rag/application.properties new file mode 100644 index 0000000..1c416dd --- /dev/null +++ b/docling-langchain4j-rag/application.properties @@ -0,0 +1,39 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Application Configuration +camel.main.name = DoclingLangChain4jRAG + +# Directory Configuration +documents.directory=documents +output.directory=output + +# Docling-Serve Configuration +# Can be run via: docker run -p 5001:5001 ghcr.io/docling-project/docling-serve:latest +# Or via: camel infra run docling (if available) +docling.serve.url=http://localhost:5001 + +# Ollama Configuration +ollama.base.url=http://localhost:11434 +ollama.model.name=orca-mini + +# Batch Processing Configuration +# Set to -1 to disable batch processing +batch.delay=10000 + +# HTTP Server Configuration +camel.server.port=8080 diff --git a/docling-langchain4j-rag/compose.yaml b/docling-langchain4j-rag/compose.yaml new file mode 100644 index 0000000..56f4d91 --- /dev/null +++ b/docling-langchain4j-rag/compose.yaml @@ -0,0 +1,45 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +services: + docling-serve: + image: ghcr.io/docling-project/docling-serve:latest + container_name: docling-serve + ports: + - "5001:5001" + restart: unless-stopped + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:5001/"] + interval: 30s + timeout: 10s + retries: 3 + + ollama: + image: ollama/ollama:latest + container_name: ollama-rag + ports: + - "11435:11434" + volumes: + - ollama_data:/root/.ollama + restart: unless-stopped + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:11434/"] + interval: 30s + timeout: 10s + retries: 3 + +volumes: + ollama_data: + driver: local diff --git a/docling-langchain4j-rag/docling-langchain4j-rag.yaml b/docling-langchain4j-rag/docling-langchain4j-rag.yaml new file mode 100644 index 0000000..8004956 --- /dev/null +++ b/docling-langchain4j-rag/docling-langchain4j-rag.yaml @@ -0,0 +1,378 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Document Processing with Docling and AI Analysis with LangChain4j +# This example demonstrates RAG (Retrieval Augmented Generation) using: +# - Docling for document conversion (PDF, Word, etc. to Markdown) +# - LangChain4j with Ollama for AI-powered document analysis + +# Bean Definitions +- beans: + # Configure Ollama Chat Model + - name: chatModel + type: "#class:dev.langchain4j.model.ollama.OllamaChatModel" + scriptLanguage: groovy + script: | + import dev.langchain4j.model.ollama.OllamaChatModel + import static java.time.Duration.ofSeconds + + return OllamaChatModel.builder() + .baseUrl("{{ollama.base.url}}") + .modelName("{{ollama.model.name}}") + .temperature(0.3) + .timeout(ofSeconds(120)) + .build() + +# Route Definitions + +# Route 1: Main RAG workflow - Convert document and analyze with AI +- route: + id: document-analysis-workflow + from: + uri: file:{{documents.directory}} + parameters: + include: ".*\\.(pdf|docx|pptx|html|md)" + noop: true + idempotent: true + steps: + - log: "Processing document: ${header.CamelFileName}" + - setProperty: + name: originalFileName + simple: "${header.CamelFileName}" + + # Convert GenericFile to file path + - setBody: + simple: "${body.file.absolutePath}" + + # Step 1: Convert document to Markdown using Docling + - log: "Converting document to Markdown with Docling..." + - to: + uri: docling:CONVERT_TO_MARKDOWN + parameters: + useDoclingServe: true + doclingServeUrl: "{{docling.serve.url}}" + contentInBody: true + - log: "Document converted to Markdown successfully" + + # Save the file path for cleanup + - setProperty: + name: sourceFilePath + simple: "${exchangeProperty.originalFileName}" + + # Step 2: Store converted content + - setProperty: + name: convertedMarkdown + simple: "${body}" + + # Step 3: Log the converted content (first 500 chars) + - script: + groovy: | + def markdown = exchange.getProperty("convertedMarkdown", String.class) + def preview = markdown.length() > 500 ? markdown.substring(0, 500) + "..." : markdown + log.info("Converted Markdown preview:\n{}", preview) + + # Step 4: Prepare AI prompt for document analysis + - setBody: + simple: | + You are a helpful document analysis assistant. Please analyze the following document and provide: + 1. A brief summary (2-3 sentences) + 2. Key topics and main points + 3. Any important findings or conclusions + + Document content: + ${exchangeProperty.convertedMarkdown} + + # Step 5: Send to LangChain4j Chat for AI analysis + - log: "Analyzing document with AI model..." + - to: + uri: langchain4j-chat:analysis + parameters: + chatModel: "#chatModel" + + # Step 6: Store AI analysis result + - setProperty: + name: aiAnalysis + simple: "${body}" + - log: "AI analysis completed" + + # Step 7: Create combined result (markdown + analysis) + - script: + groovy: | + def fileName = exchange.getProperty("originalFileName") + def markdown = exchange.getProperty("convertedMarkdown", String.class) + def analysis = exchange.getProperty("aiAnalysis", String.class) + def dateStr = new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(new java.util.Date()) + + def result = "# Document Analysis Report\n\n" + + "**File:** ${fileName}\n" + + "**Date:** ${dateStr}\n\n" + + "---\n\n" + + "## AI Analysis\n\n" + + "${analysis}\n\n" + + "---\n\n" + + "## Full Document Content (Markdown)\n\n" + + "${markdown}" + + exchange.message.setBody(result) + + # Step 8: Save combined result + - setHeader: + name: CamelFileName + simple: "${exchangeProperty.originalFileName}_analysis.md" + - to: + uri: file:{{output.directory}} + - log: "Analysis report saved: ${header.CamelFileName}" + + # Step 9: Clean up - delete the processed file from documents directory + - script: + groovy: | + import java.nio.file.Files + import java.nio.file.Paths + + def docDir = camelContext.resolvePropertyPlaceholders("{{documents.directory}}") + def fileName = exchange.getProperty("sourceFilePath") + def filePath = Paths.get(docDir, fileName) + + if (Files.exists(filePath)) { + Files.delete(filePath) + log.info("Cleaned up source file: {}", filePath) + } + - log: "Processing complete for: ${exchangeProperty.originalFileName}" + +# Route 2: Interactive Q&A about documents +- route: + id: document-qa-api + from: + uri: platform-http:/api/ask + steps: + - log: "Received question: ${body}" + - setProperty: + name: userQuestion + simple: "${body}" + + # Read the most recent document from documents directory + - script: + groovy: | + import java.nio.file.Files + import java.nio.file.Paths + import java.util.stream.Collectors + + def docDir = camelContext.resolvePropertyPlaceholders("{{documents.directory}}") + def docPath = Paths.get(docDir) + + if (Files.exists(docPath)) { + def latestFile = Files.list(docPath) + .filter { f -> f.toString().matches(".*\\.(pdf|docx|pptx|html|md)") } + .max { f1, f2 -> Files.getLastModifiedTime(f1).compareTo(Files.getLastModifiedTime(f2)) } + .orElse(null) + + if (latestFile != null) { + exchange.message.setBody(latestFile.toFile()) + exchange.setProperty("documentFound", true) + } else { + exchange.setProperty("documentFound", false) + exchange.message.setBody("No documents found in directory") + } + } else { + exchange.setProperty("documentFound", false) + exchange.message.setBody("Documents directory does not exist") + } + + # Convert document to markdown if found + - choice: + when: + - simple: "${exchangeProperty.documentFound} == true" + steps: + - log: "Converting document for Q&A..." + - to: + uri: docling:CONVERT_TO_MARKDOWN + parameters: + useDoclingServe: true + doclingServeUrl: "{{docling.serve.url}}" + contentInBody: true + + - setProperty: + name: documentContent + simple: "${body}" + + # Prepare RAG prompt + - setBody: + simple: | + You are a helpful assistant answering questions about documents. + + Document content: + ${exchangeProperty.documentContent} + + Question: ${exchangeProperty.userQuestion} + + Please provide a clear and concise answer based on the document content above. + + - to: + uri: langchain4j-chat:qa + parameters: + chatModel: "#chatModel" + + - setHeader: + name: Content-Type + constant: "text/plain" + otherwise: + steps: + - setBody: + simple: "Error: ${body}" + - setHeader: + name: Content-Type + constant: "text/plain" + +# Route 3: Batch document summarization +- route: + id: batch-summarization + from: + uri: timer:batchSummarize + parameters: + delay: "{{batch.delay}}" + repeatCount: 0 + steps: + - log: "Starting batch document summarization..." + - script: + groovy: | + import java.nio.file.Files + import java.nio.file.Paths + + def docDir = camelContext.resolvePropertyPlaceholders("{{documents.directory}}") + def docPath = Paths.get(docDir) + + if (Files.exists(docPath)) { + def files = Files.list(docPath) + .filter { f -> f.toString().matches(".*\\.(pdf|docx|pptx|html|md)") } + .collect { it.toFile() } + + exchange.message.setBody(files) + } else { + exchange.message.setBody([]) + } + + - split: + simple: "${body}" + steps: + - log: "Summarizing: ${body}" + - setProperty: + name: currentFile + simple: "${body}" + + # Convert to markdown + - to: + uri: docling:CONVERT_TO_MARKDOWN + parameters: + useDoclingServe: true + doclingServeUrl: "{{docling.serve.url}}" + contentInBody: true + + # Generate summary + - setBody: + simple: | + Please provide a concise 3-sentence summary of the following document: + + ${body} + + - to: + uri: langchain4j-chat:summary + parameters: + chatModel: "#chatModel" + + - log: "Summary: ${body}" + +# Route 4: Health check endpoint +- route: + id: health-check + from: + uri: platform-http:/api/health + steps: + - log: "Health check requested" + - script: + groovy: | + import groovy.json.JsonOutput + + def doclingUrl = camelContext.resolvePropertyPlaceholders("{{docling.serve.url}}") + def ollamaUrl = camelContext.resolvePropertyPlaceholders("{{ollama.base.url}}") + + def health = [ + status: "healthy", + components: [ + docling: [ + url: doclingUrl, + status: "configured" + ], + ollama: [ + url: ollamaUrl, + model: camelContext.resolvePropertyPlaceholders("{{ollama.model.name}}"), + status: "configured" + ] + ], + directories: [ + documents: camelContext.resolvePropertyPlaceholders("{{documents.directory}}"), + output: camelContext.resolvePropertyPlaceholders("{{output.directory}}") + ] + ] + + exchange.message.setBody(JsonOutput.toJson(health)) + - setHeader: + name: Content-Type + constant: "application/json" + +# Route 5: Extract structured data from documents +- route: + id: extract-structured-data + from: + uri: platform-http:/api/extract + parameters: + httpMethodRestrict: "POST" + steps: + - log: "Extracting structured data from uploaded document" + - setProperty: + name: uploadedContent + simple: "${body}" + + # Extract as JSON with Docling + - to: + uri: docling:EXTRACT_STRUCTURED_DATA + parameters: + useDoclingServe: true + doclingServeUrl: "{{docling.serve.url}}" + outputFormat: "json" + contentInBody: true + + - setProperty: + name: structuredData + simple: "${body}" + + # Ask AI to analyze the structured data + - setBody: + simple: | + Please analyze this structured document data and identify: + 1. Document type and structure + 2. Key data fields and their values + 3. Any tables or structured information + + Structured data: + ${exchangeProperty.structuredData} + + - to: + uri: langchain4j-chat:extract + parameters: + chatModel: "#chatModel" + + - setHeader: + name: Content-Type + constant: "text/plain" diff --git a/docling-langchain4j-rag/run.sh b/docling-langchain4j-rag/run.sh new file mode 100755 index 0000000..1c43df8 --- /dev/null +++ b/docling-langchain4j-rag/run.sh @@ -0,0 +1,27 @@ +#!/bin/bash +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Run the Docling + LangChain4j RAG example + +jbang -Dcamel.jbang.version=4.16.0-SNAPSHOT camel@apache/camel run \ + --fresh \ + --dep=camel:docling \ + --dep=camel:langchain4j-chat \ + --dep=camel:platform-http \ + --dep=dev.langchain4j:langchain4j:1.6.0 \ + --dep=dev.langchain4j:langchain4j-ollama:1.6.0 \ + --properties=application.properties \ + docling-langchain4j-rag.yaml diff --git a/docling-langchain4j-rag/sample.md b/docling-langchain4j-rag/sample.md new file mode 100644 index 0000000..5ad4111 --- /dev/null +++ b/docling-langchain4j-rag/sample.md @@ -0,0 +1,65 @@ +# Sample Document for RAG Analysis + +## Introduction to Apache Camel + +Apache Camel is an open-source integration framework based on known Enterprise Integration Patterns (EIPs). It provides a rule-based routing and mediation engine that allows developers to define routing and mediation rules in various domain-specific languages. + +## Key Features + +### 1. Routing and Mediation Engine + +Camel supports routing and mediation rules in various DSLs including: + +- Java DSL +- XML Configuration +- YAML DSL +- Groovy DSL + +### 2. Extensive Component Library + +Camel provides over 300 components for integrating with: + +- Messaging systems (JMS, Kafka, AMQP) +- Databases (JDBC, MongoDB, Cassandra) +- Cloud services (AWS, Azure, Google Cloud) +- APIs (REST, SOAP, GraphQL) + +### 3. Enterprise Integration Patterns + +Camel implements all EIPs from the famous book by Gregor Hohpe and Bobby Woolf: + +- Content-Based Router +- Message Filter +- Splitter and Aggregator +- Dead Letter Channel +- Wire Tap + +## AI Integration + +Camel now includes AI components for modern integration needs: + +### LangChain4j Components + +- **langchain4j-chat**: Integrate with Large Language Models +- **langchain4j-embeddings**: Generate vector embeddings +- **langchain4j-tools**: Create AI tools and agents + +### Docling Component + +The Docling component enables document processing: + +- Convert PDF, Word, PowerPoint to Markdown +- Extract structured data from documents +- Support for OCR and table extraction +- Integration with AI models for document analysis + +## Use Cases + +1. **Document Processing Pipeline**: Convert documents and analyze with AI +2. **RAG Systems**: Retrieval Augmented Generation with vector stores +3. **Intelligent Routing**: Use LLMs to make routing decisions +4. **Data Extraction**: Extract and transform unstructured data + +## Conclusion + +Apache Camel continues to evolve, now bridging traditional integration patterns with modern AI capabilities, making it an ideal choice for building intelligent integration solutions.
