kaxil opened a new pull request, #68491:
URL: https://github.com/apache/airflow/pull/68491

   Fixes #68416.
   
   `LlamaIndexEmbeddingOperator.execute()` documents `chunks[*].vector` as the 
embedding for each chunk, but every chunk came back with `vector: None`. The 
operator relied on `VectorStoreIndex(nodes, ...)` setting `.embedding` on the 
nodes it was given as a side effect -- but 
[`_get_node_with_embedding()`](https://github.com/run-llama/llama_index/blob/v0.14.22/llama-index-core/llama_index/core/indices/vector_store/base.py#L139-L150)
 attaches embeddings to `model_copy()` copies, never the originals (same 
behavior from v0.10 through v0.14.22, so no version pin avoids it). The vectors 
only ever existed inside the index's vector store.
   
   The fix embeds the original nodes explicitly before any index work:
   
   - `embed_model.get_text_embedding_batch()` over 
`node.get_content(metadata_mode=MetadataMode.EMBED)` -- the exact content 
[`embed_nodes()`](https://github.com/run-llama/llama_index/blob/v0.14.22/llama-index-core/llama_index/core/indices/utils.py#L120-L147)
 embeds (includes node metadata, respects `excluded_embed_metadata_keys`), so 
the vectors are identical to what `VectorStoreIndex` produced internally before.
   - `VectorStoreIndex` is now built only when `persist_dir` is set. 
`embed_nodes()` skips nodes whose `.embedding` is already set, so persisting 
reuses the vectors instead of re-calling the embedding API (pinned by a test 
that counts embed calls through the real llama-index persist path). Without 
`persist_dir`, the index was built and immediately discarded.
   
   ## Design rationale
   
   Pre-embedding beats the alternative of reading vectors back from 
`index.vector_store.data.embedding_dict`: that couples the operator to 
`SimpleVectorStore` internals (`.data` doesn't exist on other stores) and still 
builds a throwaway index when not persisting. Pre-embedding only uses the 
public `BaseEmbedding` API.
   
   The duck-type check in `_resolve_embed_model` now requires 
`get_text_embedding_batch` instead of `get_text_embedding`, since that's what 
`execute()` calls. It's a concrete method on `BaseEmbedding` (subclasses 
override `_get_text_embeddings`, not the public method), so every real 
embedding class passes the check across the supported llama-index range.
   
   <img width="1600" height="1000" alt="image" 
src="https://github.com/user-attachments/assets/931379cd-9cf2-43d0-9c27-e44e384f52bb";
 />
   <img width="1600" height="1000" alt="image" 
src="https://github.com/user-attachments/assets/372ecb27-10ea-4d29-8c63-ac8fb2c38e74";
 />
   
   closes https://github.com/apache/airflow/pull/68488
   closes https://github.com/apache/airflow/pull/68434


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to