Hello Airflow Community, I am starting this discussion to propose the addition of a new community provider for *Docling*.
Docling is an open-source service that specializes in high-performance document conversion. This provider aims to integrate Docling's capabilities directly into Airflow DAGs, making it easy for users to build powerful data processing and RAG (Retrieval-Augmented Generation) pipelines. The initial Pull Request with the implementation, tests, and documentation can be found here: https://github.com/apache/airflow/pull/55780 Provider Details: The initial version of the provider includes the following components: - *DoclingConvertOperator:* An operator to convert a document from a local file path. - *DoclingConvertSourceOperator:* An operator to convert a document from a public URL. - *DoclingHook:* A hook to manage the connection with the Docling webserver. Justification: As AI-native workflows become more common, the initial preparation of data is a critical step. This provider will enable data engineers and ML practitioners to: - Seamlessly convert raw documents (like PDFs, DOCX, etc.) into clean text as a preprocessing step in their ETL/ELT processes. - Build and automate RAG pipelines that require an initial document cleaning and text extraction stage before embedding generation (e.g., using the *Voyage AI* provider) and storage (e.g., using the *Weaviate* provider). - Leverage Airflow's orchestration capabilities for complex document workflows without writing extensive boilerplate code. I am looking forward to hearing your feedback, thoughts, and suggestions on this proposal. Thank you! -- *Arthur Raulino Kretzer* Desenv De Software | Sw CDM | Centro de Convergência Digital e Mecatrônica Fundação CERTI [email protected] (48) 9926-3500 www.certi.org.br <https://www.certi.org.br/> <https://www.facebook.com/FundacaoCerti> <https://www.instagram.com/fundacaocerti/> <https://www.youtube.com/user/FundacaoCERTI> <https://www.linkedin.com/company/fundacao-certi/> Esta mensagem (incluindo arquivos anexos) contém informações confidenciais e é dirigida exclusivamente ao seu destinatário, sendo proibido e sujeito a sanções penais qualquer ato de divulgação, utilização, ou reprodução (total ou parcial) das informações nela contidas, caso não seja seu destinatário.
