The GitHub Actions job "Publish Docs to S3" on airflow.git/main has failed.
Run started by GitHub user vatsrahul1001 (triggered by vatsrahul1001).

Head commit for run:
b64c302b2e9ab8d6810494551862b3b423b2b6d3 / Vikram Koka <[email protected]>
Add LlamaIndex operators to common.ai provider (#67121)

* Add LlamaIndex operators to common.ai provider

 - Adds LlamaIndexHook to bridge Airflow connections to LlamaIndex's Settings 
singleton. Reuses the pydanticai connection type, supports separate
  embedding and LLM connections.
  - Adds EmbeddingOperator to chunk documents and produce embedding vectors via 
LlamaIndex's SentenceSplitter. Input is list[dict(text, metadata)]
  (same shape as DocumentLoaderOperator output), output includes chunks with 
vectors ready for downstream vector store ingest operators (pgvector,
  Pinecone, Weaviate).
  - Adds RetrievalOperator to load a persisted LlamaIndex index and perform 
similarity search. Output is scored chunks ready for synthesis via
  LLMOperator.

  Design notes

  All LlamaIndex imports are lazy (inside execute() / method bodies), so 
modules parse without llama-index installed. The hook currently hardcodes
  OpenAI embedding/LLM providers; a follow-up PR will refactor to use 
BaseAIHook for provider-agnostic model resolution when it lands.

  What's included

  
┌─────────────────────────────────────────┬──────────────────────────────────────────┐
  │                  File                   │                 Purpose           
       │
  
├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ hooks/llamaindex.py                     │ Hook (~110 lines)                 
       │
  
├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ operators/llamaindex_embedding.py       │ EmbeddingOperator (~110 lines)    
       │
  
├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ operators/llamaindex_retrieval.py       │ RetrievalOperator (~90 lines)     
       │
  
├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ tests/.../test_llamaindex.py            │ 12 hook tests                     
       │
  
├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ tests/.../test_llamaindex_embedding.py  │ 10 operator tests                 
       │
  
├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ tests/.../test_llamaindex_retrieval.py  │ 8 operator tests                  
       │
  
├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ docs/hooks/llamaindex.rst               │ Hook docs                         
       │
  
├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ docs/operators/llamaindex_embedding.rst │ EmbeddingOperator docs            
       │
  
├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ docs/operators/llamaindex_retrieval.rst │ RetrievalOperator docs            
       │
  
├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ provider.yaml                           │ Integration, hook, operator 
registration │
  
├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ docs/index.rst                          │ LlamaIndex Hook in Guides toctree 
       │
  
├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ docs/operators/index.rst                │ Chooser table rows                
       │
  
└─────────────────────────────────────────┴──────────────────────────────────────────┘

  Test plan

  - uv run --project providers/common/ai pytest 
providers/common/ai/tests/unit/common/ai/hooks/test_llamaindex.py -xvs (12 
tests)
  - uv run --project providers/common/ai pytest 
providers/common/ai/tests/unit/common/ai/operators/test_llamaindex_embedding.py
  
providers/common/ai/tests/unit/common/ai/operators/test_llamaindex_retrieval.py 
-xvs (18 tests)
  - Hook: init defaults, separate embed_conn_id, connection kwargs extraction, 
embedding model, LLM, Settings configuration
  - EmbeddingOperator: output shape, chunking, index persistence, vector 
inclusion/omission, splitter params
  - RetrievalOperator: output shape, chunk keys, top_k forwarding, multiple 
results, storage context

  ---
  Was generative AI tooling used to co-author this PR?

  - Yes — Claude Code (Opus 4.6)

  Generated-by: Claude Code (Opus 4.6) following
  
https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions

* Refactor LlamaIndex hook + operators: no Settings mutation, BYO models, cloud 
URIs

Same playbook as #67192 (LangChain) and #67120 (DocumentLoader) plus
three LlamaIndex-specific architectural fixes:

Critical fixes
- Stop mutating LlamaIndex's global ``Settings`` singleton. The previous
  ``LlamaIndexHook.configure_settings()`` wrote ``Settings.embed_model``
  / ``Settings.llm`` process-wide, which leaks across concurrent tasks
  in the same worker. Replaced with per-call ``embed_model=`` /
  ``llm=`` parameters on ``VectorStoreIndex(...)`` and
  ``load_index_from_storage(...)``.
- Own ``llamaindex`` connection type instead of squatting on
  ``pydanticai``. Mirrors the LangChain / CrewAI fix.
- Remove ``documents`` from ``EmbeddingOperator.template_fields``.
  ``list[dict]`` doesn't survive Jinja stringification, and worse, a
  user document containing literal ``{{ var.value.api_key }}`` would
  leak secrets into the embedding store. Bind via ``loader.output``
  instead.

BYO embedding/LLM for non-OpenAI vendors
- LlamaIndex doesn't ship an ``init_chat_model`` / ``init_embedding_model``
  equivalent (verified in 
``llama_index.core.embeddings.utils.resolve_embed_model``
  -- only ``"default"`` / ``"local"`` / ``"clip:"`` dispatch). The hook
  therefore covers OpenAI (matching LlamaIndex's own
  ``resolve_embed_model("default")`` behaviour) and operators accept a
  pre-built ``BaseEmbedding`` / ``LLM`` instance to bypass the hook for
  Cohere / Bedrock / Vertex / HuggingFace / etc.

Cloud-URI persistence
- ``EmbeddingOperator.persist_dir`` and
  ``RetrievalOperator.index_persist_dir`` accept storage URIs
  (``s3://``, ``gs://``, ``azure://``) resolved via
  ``ObjectStoragePath`` and fsspec, matching the merged
  ``DocumentLoaderOperator`` pattern.

Hook plumbing playbook (mirrors LangChain / CrewAI / DocumentLoader)
- ``conn_type = "llamaindex"`` + new ``connection-types`` entry in
  ``provider.yaml`` with ``embed_model`` / ``llm_model`` conn-fields.
- ``default_conn_name`` resolves at runtime via
  ``llm_conn_id: str | None = None``.
- ``_resolve_model`` honours ``conn.extra_dejson`` for parity with the
  sibling hooks (swallows ``JSONDecodeError``, applies secret masking).
- ``get_ui_field_behaviour`` added.
- ``[llamaindex]`` extra in ``pyproject.toml`` pinning
  ``llama-index-core``, ``llama-index-embeddings-openai``,
  ``llama-index-llms-openai`` (enough to back the hook's default
  OpenAI return values). Same in the ``dev`` group.

Misc operator/test fixes
- Wrap lazy ``llama_index`` imports with
  ``AirflowOptionalProviderFeatureException`` so missing extras surface
  cleanly.
- ``RetrievalOperator`` returns ``{"query": ..., "chunks": [...]}``
  (was ``"question"``) and ``chunks[*].node_id`` (was the misleading
  ``"source"`` key).
- ``RetrievalOperator`` raises ``FileNotFoundError`` with a "did you
  run EmbeddingOperator first?" hint when ``index_persist_dir`` is
  missing.
- All three test files get an autouse fixture stubbing
  ``llama_index.*`` in ``sys.modules`` so ``@patch`` resolves without
  ``llama-index-*`` packages installed in CI's non-DB test env
  (mirrors apache/airflow#67237).
- New ``example_llamaindex_hook.py`` with ``[START howto_*]`` markers
  for the docs to ``exampleinclude``.

* Rename LlamaIndex operators with framework prefix; fold in #67189 RAG examples

Per Kaxil's review r3267387604: ``RetrievalOperator`` / ``EmbeddingOperator``
are too generic in the common.ai namespace -- they risk colliding when
other frameworks add their own embedding/retrieval operators. Renamed
both with the LlamaIndex prefix:

- ``EmbeddingOperator`` -> ``LlamaIndexEmbeddingOperator``
- ``RetrievalOperator`` -> ``LlamaIndexRetrievalOperator``

Renames applied across the two operator modules, three docs RSTs, the
two test files, both example DAGs, and the cross-refs in
``docs/operators/index.rst``, ``docs/hooks/llamaindex.rst``,
``docs/operators/document_loader.rst``, and ``docs/hooks/index.rst``.

Folds in #67189 (``example_llamaindex_rag.py``) which would otherwise
sit blocked waiting for this PR to merge. Rewritten for the new API:

- Uses the renamed classes
- Drops ``documents="{{ ti.xcom_pull(...) }}"`` Jinja templating
  (template_fields removed; bind via ``loader.output`` direct)
- Switches LlamaIndex operators to ``llamaindex_default`` conn (was
  ``pydanticai_default``); the synthesis-step ``LLMOperator`` keeps
  ``pydanticai_default`` because it's pydantic-ai-backed (different
  framework, intentional split documented in the module docstring)
- Adds explicit ``embed_model="text-embedding-3-small"`` to every
  embedding/retrieval call (new operator validation requires it)
- Fixes the string-reference task chains (``load >> "build_index"`` ->
  ``load >> build_index``) which weren't valid task dependencies

Closes #67189.

* Address code-review findings on LlamaIndex operators

- Fix ObjectStoragePath conn_id mangling: pass raw URI to LlamaIndex
  persist_dir= and supply target.fs separately. str(target) returns
  s3://<conn_id>@<bucket>/..., which fsspec misinterprets.
- Add documents / embed_model / embed_conn_id to template_fields so
  XComArg resolution fires. The previous "list[dict] doesn't survive
  stringification" rationale was wrong; Templater unwraps resolvables
  before Jinja.
- Default llm_conn_id to None on both operators; LlamaIndexHook
  resolves to default_conn_name at runtime. Hard-coding
  "llamaindex_default" undid the hook's careful runtime resolution.
- Add embed_conn_id pass-through for separate embedding credentials.
- Replace isinstance(str) duck-typing with hasattr-based BaseEmbedding
  check; raise TypeError with a clear pointer instead of letting an
  unresolved XComArg or random object explode later.
- Hoist 'import os' and 'from pathlib import Path' to module top.
- Pad RST title underlines and refresh docs/tests to match the new
  surface.

* Fix mypy on LlamaIndex embedding operator

- Pass persist_dir as a typed str arg to _persist so the existing
  None-narrowing # type: ignore comments can go away.
- Cast SentenceSplitter nodes to list[TextNode] for the .text access:
  the splitter only ever returns TextNode, but the base
  get_nodes_from_documents signature is typed as list[BaseNode].

* Install llama-index in tests instead of stubbing sys.modules

llama-index-core / -embeddings-openai / -llms-openai were declared in
the common.ai provider's dev dependency group but missing from uv.lock,
so CI never actually installed them. The tests papered over that by
faking out llama_index.* in sys.modules with MagicMocks.

Refresh uv.lock so the packages get installed, then drop the
sys.modules manipulation:

- test_llamaindex.py: remove the autouse _stub_llama_index_modules
  fixture entirely; @patch resolves against the real modules.
- test_llamaindex_embedding.py / test_llamaindex_retrieval.py: replace
  the _stub_li fixture (sys.modules setitem) with a smaller _li fixture
  that uses monkeypatch.setattr against real llama_index.core symbols.

* Apply ruff lint/format fixes

---------

Co-authored-by: Kaxil Naik <[email protected]>

Report URL: https://github.com/apache/airflow/actions/runs/26216508080

With regards,
GitHub Actions via GitBox


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to