The GitHub Actions job "Tests (AMD)" on airflow.git/aip99-doc-loader has failed. Run started by GitHub user vikramkoka (triggered by vikramkoka).
Head commit for run: 5e1c8abea8ed7b75f0adcc5cc56bff90820e570e / Vikram Koka <[email protected]> Add DocumentLoaderOperator to common.ai provider - Adds DocumentLoaderOperator, a framework-agnostic file parser that bridges Airflow's connectivity layer (hooks returning bytes/files) and the AI embedding layer (operators needing list[dict(text, metadata)]). No LlamaIndex, LangChain, or other AI framework dependency. - Built-in parsers for .txt, .md, .csv, .json with zero extra deps. PDF (via pypdf, BSD) and DOCX (via python-docx, MIT) available as optional extras: pip install apache-airflow-providers-common-ai[pdf] / [docx]. - Supports two input modes: source_path (local file, directory, or glob pattern) and source_bytes (raw bytes from XCom). Output is list[dict(text, metadata)], the same shape consumed by downstream embedding operators. Motivation File parsing is the highest-volume gap in Airflow's AI story Every RAG pipeline on Airflow currently requires custom parsing code. This operator makes it a single line in a Dag. What's included ┌────────────────────────────────────┬───────────────────────────────────────────┐ │ File │ Purpose │ ├────────────────────────────────────┼───────────────────────────────────────────┤ │ operators/document_loader.py │ Operator (~270 lines) │ ├────────────────────────────────────┼───────────────────────────────────────────┤ │ tests/.../test_document_loader.py │ 26 unit tests │ ├────────────────────────────────────┼───────────────────────────────────────────┤ │ docs/operators/document_loader.rst │ Usage docs │ ├────────────────────────────────────┼───────────────────────────────────────────┤ │ provider.yaml │ Operator registration + how-to-guide link │ ├────────────────────────────────────┼───────────────────────────────────────────┤ │ pyproject.toml │ [pdf] and [docx] optional dependencies │ ├────────────────────────────────────┼───────────────────────────────────────────┤ │ docs/operators/index.rst │ Chooser table row │ └────────────────────────────────────┴───────────────────────────────────────────┘ Test plan - uv run --project providers/common/ai pytest providers/common/ai/tests/unit/common/ai/operators/test_document_loader.py -xvs (26 tests) - Built-in parsers: txt, md, csv (one doc per row), json (single object and array) - PDF/DOCX parsers: mocked via sys.modules injection (packages not installed in test env) - ImportError guidance when optional packages are missing - Init validation: mutual exclusion of source_path/source_bytes, file_type required with source_bytes - File discovery: glob patterns, extension filtering, empty directories - Output shape: every item has text and metadata, file_name/file_path in metadata, custom metadata_fields merged Report URL: https://github.com/apache/airflow/actions/runs/26041307998 With regards, GitHub Actions via GitBox --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
