squakez commented on code in PR #1423: URL: https://github.com/apache/camel-website/pull/1423#discussion_r2434711516
########## content/blog/2025/10/camel-docling/index.md: ########## @@ -0,0 +1,227 @@ +--- +title: "Building Intelligent Document Processing with Apache Camel: Docling meets LangChain4j" +date: 2025-10-15 +draft: false +authors: [ oscerd ] +categories: ["Camel", "AI"] +preview: "The new camel-docling component meets camel-langchain4j" +--- + +In the rapidly evolving landscape of AI-powered applications, the ability to process and understand documents has become increasingly crucial. Whether you're dealing with PDFs, Word documents, or PowerPoint presentations, extracting meaningful insights from unstructured data is a challenge many developers face daily. + +In this post, we'll explore how Apache Camel's new AI components enable developers to build sophisticated RAG (Retrieval Augmented Generation) pipelines with minimal code. We'll combine the power of Docling for document conversion with LangChain4j for AI orchestration, all orchestrated through Camel's YAML DSL. + +## The Challenge: Document Intelligence at Scale + +Companies are drowning in documents. Legal firms process contracts, healthcare providers manage medical records, and financial institutions analyze reports. The traditional approach of manual document review simply doesn't scale. + +So this a possible space where we could apply RAG and Apache Camel. The steps: + +* Convert documents from any format to structured text +* Extract key insights and summaries +* Answer questions about document content +* Process documents in real-time as they arrive + +This is where the combination of Docling and LangChain4j shines, and Apache Camel provides the perfect integration layer to bring them together. + +## Meet the Components + +### Camel-Docling: Enterprise Document Conversion + +The `camel-docling` component integrates IBM's Docling library, an AI-powered document parser that can handle various formats including PDF, Word, PowerPoint, and more. What makes Docling special is its ability to preserve document structure while converting to clean Markdown, HTML, or JSON. + +Key features: + +* **Multiple Operations**: Convert to Markdown, HTML, JSON, or extract structured data +* **Flexible Deployment**: Works with both CLI and API (docling-serve) modes +* **Content Control**: Return content directly in the message body or as file paths +* **OCR Support**: Handle scanned documents with optical character recognition + +### Camel-LangChain4j: AI Orchestration Made Simple + +The `camel-langchain4j-chat` component provides seamless integration with Large Language Models through the LangChain4j framework. It supports various LLM providers including OpenAI, Ollama, and more. + +Perfect for: + +* Document analysis and summarization +* Question-answering systems +* Content generation +* RAG implementations + +## Building a RAG Pipeline with YAML + +Let's walk through a complete example that demonstrates the power of combining these components. Our goal is to create a system that automatically processes documents, analyzes them with AI, and generates comprehensive reports: a classic example. + +### Architecture Overview + +The flow is straightforward: + +1. Watch a directory for new documents +2. Convert documents to Markdown using Docling +3. Send the converted content to an LLM for analysis +4. Generate a comprehensive analysis report +5. Clean up processed files + +All of this is defined declaratively in YAML, making it easy to understand and modify. + +### Setting Up the Infrastructure + +First, we need our services running. Thanks to camel infra command, this is pretty simple: + +```shell +# Start Docling (if camel infra supports it) +$ jbang -Dcamel.jbang.version=4.16.0-SNAPSHOT camel@apache/camel infra run docling Review Comment: @gansheer fyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
