[GH] (airflow/main): Workflow run "Update constraints on push for main (only when uv.lock changes)" is working again!

GitBox Wed, 20 May 2026 13:16:03 -0700


The GitHub Actions job "Update constraints on push for main (only when uv.lock 
changes)" on airflow.git/main has succeeded.
Run started by GitHub user kaxil (triggered by kaxil).


Head commit for run:
eec2f75e556d9339e4fb059c07d3f356fe8b4984 / Vikram Koka <[email protected]>
Add `DocumentLoaderOperator` to `common.ai` provider (#67120)

* Add DocumentLoaderOperator to common.ai provider

 - Adds DocumentLoaderOperator, a framework-agnostic file parser that bridges 
Airflow's connectivity layer (hooks returning bytes/files) and the
  AI embedding layer (operators needing list[dict(text, metadata)]). No 
LlamaIndex, LangChain, or other AI framework dependency.
  - Built-in parsers for .txt, .md, .csv, .json with zero extra deps. PDF (via 
pypdf, BSD) and DOCX (via python-docx, MIT) available as optional
  extras: pip install apache-airflow-providers-common-ai[pdf] / [docx].
  - Supports two input modes: source_path (local file, directory, or glob 
pattern) and source_bytes (raw bytes from XCom). Output is
  list[dict(text, metadata)], the same shape consumed by downstream embedding 
operators.

  Motivation

  File parsing is the highest-volume gap in Airflow's AI story
  Every RAG pipeline on Airflow currently requires custom parsing code. This 
operator makes it a single line in a Dag.

  What's included

  
┌────────────────────────────────────┬───────────────────────────────────────────┐
  │                File                │                  Purpose               
   │
  
├────────────────────────────────────┼───────────────────────────────────────────┤
  │ operators/document_loader.py       │ Operator (~270 lines)                  
   │
  
├────────────────────────────────────┼───────────────────────────────────────────┤
  │ tests/.../test_document_loader.py  │ 26 unit tests                          
   │
  
├────────────────────────────────────┼───────────────────────────────────────────┤
  │ docs/operators/document_loader.rst │ Usage docs                             
   │
  
├────────────────────────────────────┼───────────────────────────────────────────┤
  │ provider.yaml                      │ Operator registration + how-to-guide 
link │
  
├────────────────────────────────────┼───────────────────────────────────────────┤
  │ pyproject.toml                     │ [pdf] and [docx] optional dependencies 
   │
  
├────────────────────────────────────┼───────────────────────────────────────────┤
  │ docs/operators/index.rst           │ Chooser table row                      
   │
  
└────────────────────────────────────┴───────────────────────────────────────────┘

  Test plan

  - uv run --project providers/common/ai pytest 
providers/common/ai/tests/unit/common/ai/operators/test_document_loader.py -xvs 
(26 tests)
  - Built-in parsers: txt, md, csv (one doc per row), json (single object and 
array)
  - PDF/DOCX parsers: mocked via sys.modules injection (packages not installed 
in test env)
  - ImportError guidance when optional packages are missing
  - Init validation: mutual exclusion of source_path/source_bytes, file_type 
required with source_bytes
  - File discovery: glob patterns, extension filtering, empty directories
  - Output shape: every item has text and metadata, file_name/file_path in 
metadata, custom metadata_fields merged

* Addressed PR feedback

Addressed Kaxil's feedback on PR. thank you Kaxil

- Remove source_bytes from template_fields (Jinja breaks bytes)
  - Use `is not None` validation instead of truthiness checks
  - Raise FileNotFoundError when no files match source_path
  - Normalize file_extensions filter to case-insensitive
  - Fix temp file leak when write fails before try block
  - Return unquoted text for JSON string primitives
  - Use AirflowOptionalProviderFeatureException for missing extras
  - Document DOCX paragraph-only extraction limitation
  - Rewrite XCom docs example to @task pattern for source_bytes
  - Update tests for all behavioral changes (30 tests pass)

* Refactor DocumentLoaderOperator: streams, encoding, JSON shape, skip rules

Rebases onto main to recover the 0.3.0 release entries that were rolled
back on the original branch, and applies the review feedback the user-
side review surfaced.

Operator
- Replace the temp-file dance for PDF/DOCX bytes with in-memory streams.
  ``pypdf.PdfReader`` and ``docx.Document`` both accept binary streams, so
  ``source_bytes`` now goes through ``io.BytesIO`` directly. No more
  ``NamedTemporaryFile(delete=False)`` + ``os.unlink``.
- Add ``encoding`` and ``encoding_errors`` parameters for non-UTF-8 input
  (Windows-1252 CSVs, files with a leading byte-order mark, ...). Failed
  decodes raise a ``ValueError`` that includes the offending file path so
  directory-mode runs are diagnosable.
- Add ``json_text_field``: when set, the named key on each JSON item
  becomes the embedding text and every other key lands in ``metadata``.
  When unset, JSON dicts are flattened to ``"k: v, k: v"`` (matches the
  CSV parser) instead of being dumped back to JSON syntax tokens.
- Directory-mode ``source_path`` now silently ignores files whose name
  starts with ``.`` (``.DS_Store``, editor swap files, ``.gitkeep``) and
  skips unknown-extension files with a warning rather than crashing on
  the first stray file.
- ``glob.glob(source_path, recursive=True)`` so ``**`` patterns walk
  subdirectories (the docs already advertised this).
- Auto-extracted metadata (``file_name``, ``file_path``, ``row_index``,
  ``item_index``, ``page_number``) now takes precedence over
  ``metadata_fields`` with the same key (via ``setdefault``).
- Expanded ``template_fields`` to include ``file_type``,
  ``file_extensions``, ``parser`` so they can be driven from Jinja.
- Hoisted ``AirflowOptionalProviderFeatureException`` import to the
  module top so the lazy ``pypdf`` / ``docx`` blocks are 2 lines each.

Docs
- Switched all inline ``code-block:: python`` snippets to
  ``exampleinclude::`` directives pointing at a new
  ``example_document_loader.py`` (basic, directory, bytes,
  ``json_text_field`` patterns), matching the convention every other
  operator in this provider uses.
- New sections documenting encoding handling, metadata precedence, and
  the directory-mode skip rules (files whose name starts with a ``.`` /
  unknown-extension warn-and-skip).

Tests
- Dropped the tautological ``test_template_fields`` that just round-
  tripped the class attribute; replaced with a behavioural check
  confirming the templated fields are actually in the templated set.
- New coverage for: dot-prefixed-name skip, unknown-extension warn +
  skip, ``encoding`` / ``encoding_errors``, ``json_text_field``, JSON
  dict flatten, CSV empty-cell skip, ``metadata_fields`` precedence
  (auto wins), recursive ``**`` glob.
- PDF/DOCX bytes tests assert the library was called with a
  ``BytesIO``, locking in the no-temp-file behaviour.

* Add cloud-storage URI support, no-chunking note, and format roadmap

Addresses three follow-ups from the post-rewrite review (after #67120's
initial refactor landed in 8f3aee40f0):

1. Cloud storage URIs via ObjectStoragePath
- ``source_path`` now accepts any URI ObjectStoragePath resolves through
  fsspec (``s3://``, ``gs://``, ``azure://``, ``file://``). Falls back to
  the existing ``pathlib`` + ``glob`` code path for bare local paths so no
  existing behaviour changes.
- New ``source_conn_id`` parameter to point at the Airflow connection
  that holds the cloud credentials (``aws_default``, ``google_cloud_default``,
  ...). Templated so it can be set per-DAG-run.
- Parsers stay polymorphic over ``Path`` / ``ObjectStoragePath`` -- both
  expose ``read_bytes``, ``open``, ``name``, ``suffix`` so the existing
  read paths work unchanged.
- Cross-directory globs in cloud URIs are explicitly not supported in
  this version; ``source_path`` accepts a single object or a directory.
  Documented.

2. Loader-not-chunker explicit
- Operator docstring and new "No chunking" docs section make it clear
  the operator parses files into documents but never splits them. The
  right chunking strategy depends on the embedding model, so it stays
  in the downstream operator's hands (LlamaIndex EmbeddingOperator,
  LangChain text splitters, ...).

3. Format coverage roadmap
- New docs section enumerates the formats deferred to follow-ups
  (.pptx, .epub, .xlsx, .html, image OCR, audio transcription), each
  behind its own optional extra, so reviewers and users see the scope
  choice explicitly rather than guessing what's missing.

Tests
- New ``TestCloudUriDispatch`` class covering: single-object URI returns
  one document, directory URI iterates children, neither-file-nor-dir
  URI raises with a clear error. ObjectStoragePath is mocked so the
  tests don't touch real cloud storage.

Other ecosystems compared (LangChain BaseLoader + per-format classes;
LlamaIndex BaseReader / SimpleDirectoryReader with fsspec; OpenAI /
Anthropic / pydantic-ai don't have document-loader abstractions and
delegate parsing to the model) -- this commit closes the remaining gap
vs LlamaIndex on cloud storage and matches the LangChain naming /
output-shape convention.

* Drop test-only ``__str__`` overrides that tripped CI mypy

CI's MyPy providers job flagged the `mock.__str__ = lambda ...` and
`mock.__str__.return_value = ...` patterns in TestCloudUriDispatch with
``[method-assign]`` -- mypy treats `__str__` as a real method that
shouldn't be reassigned at the instance level, even on a MagicMock.

The tests only assert on `file_name`, the dispatched call args, and text
content; they never check `metadata.file_path` (which is what `str(path)`
would feed). Removing the overrides keeps the assertions intact and
lets mypy pass.

---------

Co-authored-by: Kaxil Naik <[email protected]>

Report URL: https://github.com/apache/airflow/actions/runs/26185936062

With regards,
GitHub Actions via GitBox


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GH] (airflow/main): Workflow run "Update constraints on push for main (only when uv.lock changes)" is working again!

Reply via email to