Hi Arthur, Similar to https://lists.apache.org/thread/48qvz77cm9nrwnyf9921qo0v2wm05dx3 , we would love if you give us a little time to get back to you?
Certainly interested in this contribution but would like some time to get Airflow 3.1.0 released and then discuss the newly proposed Provider Governance model <https://lists.apache.org/thread/qrv0j4dxp2yg09gds40vh49dhkbrj5q9>. Regards, Kaxil On Wed, 17 Sept 2025 at 15:29, Arthur Raulino Kretzer <[email protected]> wrote: > Hello Airflow Community, > > I am starting this discussion to propose the addition of a new community > provider for *Docling*. > > Docling is an open-source service that specializes in high-performance > document conversion. This provider aims to integrate Docling's capabilities > directly into Airflow DAGs, making it easy for users to build powerful data > processing and RAG (Retrieval-Augmented Generation) pipelines. > > The initial Pull Request with the implementation, tests, and documentation > can be found here: https://github.com/apache/airflow/pull/55780 > Provider Details: > > The initial version of the provider includes the following components: > > - > > *DoclingConvertOperator:* An operator to convert a document from a local > file path. > - > > *DoclingConvertSourceOperator:* An operator to convert a document from a > public URL. > - > > *DoclingHook:* A hook to manage the connection with the Docling > webserver. > > Justification: > > As AI-native workflows become more common, the initial preparation of data > is a critical step. This provider will enable data engineers and ML > practitioners to: > > - > > Seamlessly convert raw documents (like PDFs, DOCX, etc.) into clean text > as a preprocessing step in their ETL/ELT processes. > - > > Build and automate RAG pipelines that require an initial document > cleaning and text extraction stage before embedding generation (e.g., > using > the *Voyage AI* provider) and storage (e.g., using the *Weaviate* > provider). > - > > Leverage Airflow's orchestration capabilities for complex document > workflows without writing extensive boilerplate code. > > I am looking forward to hearing your feedback, thoughts, and suggestions on > this proposal. > > Thank you! > -- > *Arthur Raulino Kretzer* > Desenv De Software | Sw > CDM | Centro de Convergência Digital e Mecatrônica > Fundação CERTI [email protected] > (48) 9926-3500 > > www.certi.org.br <https://www.certi.org.br/> > <https://www.facebook.com/FundacaoCerti> > <https://www.instagram.com/fundacaocerti/> > <https://www.youtube.com/user/FundacaoCERTI> > <https://www.linkedin.com/company/fundacao-certi/> > Esta mensagem (incluindo arquivos anexos) contém informações confidenciais > e é dirigida exclusivamente ao seu destinatário, sendo proibido e sujeito a > sanções penais qualquer ato de divulgação, utilização, ou reprodução (total > ou parcial) das informações nela contidas, caso não seja seu destinatário. >
