kaxil opened a new pull request, #68491: URL: https://github.com/apache/airflow/pull/68491
Fixes #68416. `LlamaIndexEmbeddingOperator.execute()` documents `chunks[*].vector` as the embedding for each chunk, but every chunk came back with `vector: None`. The operator relied on `VectorStoreIndex(nodes, ...)` setting `.embedding` on the nodes it was given as a side effect -- but [`_get_node_with_embedding()`](https://github.com/run-llama/llama_index/blob/v0.14.22/llama-index-core/llama_index/core/indices/vector_store/base.py#L139-L150) attaches embeddings to `model_copy()` copies, never the originals (same behavior from v0.10 through v0.14.22, so no version pin avoids it). The vectors only ever existed inside the index's vector store. The fix embeds the original nodes explicitly before any index work: - `embed_model.get_text_embedding_batch()` over `node.get_content(metadata_mode=MetadataMode.EMBED)` -- the exact content [`embed_nodes()`](https://github.com/run-llama/llama_index/blob/v0.14.22/llama-index-core/llama_index/core/indices/utils.py#L120-L147) embeds (includes node metadata, respects `excluded_embed_metadata_keys`), so the vectors are identical to what `VectorStoreIndex` produced internally before. - `VectorStoreIndex` is now built only when `persist_dir` is set. `embed_nodes()` skips nodes whose `.embedding` is already set, so persisting reuses the vectors instead of re-calling the embedding API (pinned by a test that counts embed calls through the real llama-index persist path). Without `persist_dir`, the index was built and immediately discarded. ## Design rationale Pre-embedding beats the alternative of reading vectors back from `index.vector_store.data.embedding_dict`: that couples the operator to `SimpleVectorStore` internals (`.data` doesn't exist on other stores) and still builds a throwaway index when not persisting. Pre-embedding only uses the public `BaseEmbedding` API. The duck-type check in `_resolve_embed_model` now requires `get_text_embedding_batch` instead of `get_text_embedding`, since that's what `execute()` calls. It's a concrete method on `BaseEmbedding` (subclasses override `_get_text_embeddings`, not the public method), so every real embedding class passes the check across the supported llama-index range. <img width="1600" height="1000" alt="image" src="https://github.com/user-attachments/assets/931379cd-9cf2-43d0-9c27-e44e384f52bb" /> <img width="1600" height="1000" alt="image" src="https://github.com/user-attachments/assets/372ecb27-10ea-4d29-8c63-ac8fb2c38e74" /> closes https://github.com/apache/airflow/pull/68488 closes https://github.com/apache/airflow/pull/68434 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
