Hello Airflow Community,

I am starting this discussion to propose the addition of a new community
provider for *Docling*.

Docling is an open-source service that specializes in high-performance
document conversion. This provider aims to integrate Docling's capabilities
directly into Airflow DAGs, making it easy for users to build powerful data
processing and RAG (Retrieval-Augmented Generation) pipelines.

The initial Pull Request with the implementation, tests, and documentation
can be found here: https://github.com/apache/airflow/pull/55780
Provider Details:

The initial version of the provider includes the following components:

   -

   *DoclingConvertOperator:* An operator to convert a document from a local
   file path.
   -

   *DoclingConvertSourceOperator:* An operator to convert a document from a
   public URL.
   -

   *DoclingHook:* A hook to manage the connection with the Docling
   webserver.

Justification:

As AI-native workflows become more common, the initial preparation of data
is a critical step. This provider will enable data engineers and ML
practitioners to:

   -

   Seamlessly convert raw documents (like PDFs, DOCX, etc.) into clean text
   as a preprocessing step in their ETL/ELT processes.
   -

   Build and automate RAG pipelines that require an initial document
   cleaning and text extraction stage before embedding generation (e.g., using
   the *Voyage AI* provider) and storage (e.g., using the *Weaviate*
   provider).
   -

   Leverage Airflow's orchestration capabilities for complex document
   workflows without writing extensive boilerplate code.

I am looking forward to hearing your feedback, thoughts, and suggestions on
this proposal.

Thank you!
-- 
*Arthur Raulino Kretzer*
Desenv De Software | Sw
CDM | Centro de Convergência Digital e Mecatrônica
Fundação CERTI [email protected]
(48) 9926-3500

www.certi.org.br <https://www.certi.org.br/>
<https://www.facebook.com/FundacaoCerti>
<https://www.instagram.com/fundacaocerti/>
<https://www.youtube.com/user/FundacaoCERTI>
<https://www.linkedin.com/company/fundacao-certi/>
Esta mensagem (incluindo arquivos anexos) contém informações confidenciais
e é dirigida exclusivamente ao seu destinatário, sendo proibido e sujeito a
sanções penais qualquer ato de divulgação, utilização, ou reprodução (total
ou parcial) das informações nela contidas, caso não seja seu destinatário.

Reply via email to