vojay-dev opened a new issue, #68416:
URL: https://github.com/apache/airflow/issues/68416

   ### Under which category would you file this issue?
   
   Providers
   
   ### Apache Airflow version
   
   3.2.2+astro.1
   
   ### What happened and how to reproduce it?
   
   **Versions (as tested):**
   - apache-airflow-providers-common-ai **0.4.0** (bug also present on `main` 
as of 2026-06-11)
   - llama-index-core **0.14.22**, llama-index-embeddings-openai 0.6.0
   - Apache Airflow 3.2.2 (Astro Runtime 3.2-5), Python 3.13
   
   **Summary:**
   `LlamaIndexEmbeddingOperator.execute()` returns `{"chunks": [{"text", 
"metadata", "vector"}], ...}`, but `vector` is always `None`. Downstream tasks 
consuming the documented chunk output (e.g. inserting vectors into a vector 
table) fail or silently store nulls.
   
   **Root cause:**
   The operator builds the index and then reads embeddings back off its own 
local `nodes` list, relying on a side effect that doesn't exist 
([`llamaindex_embedding.py` lines ~128–149 at tag 
`providers-common-ai/0.4.0`](https://github.com/apache/airflow/blob/providers-common-ai/0.4.0/providers/common/ai/src/airflow/providers/common/ai/operators/llamaindex_embedding.py)):
   
   ```python
   # ``VectorStoreIndex(...)`` populates each node's ``.embedding`` as a
   # side effect of building the index; ...
   index = VectorStoreIndex(nodes, embed_model=embed_model, show_progress=False)
   ...
   chunks = [{"text": node.text, "metadata": node.metadata, "vector": 
node.embedding} for node in text_nodes]
   ```
   
   But `VectorStoreIndex._get_node_with_embedding()` in llama-index-core 
attaches embeddings to **copies**, never the originals:
   
   ```python
   result = node.model_copy()
   result.embedding = embedding
   ```
   
   I checked llama-index-core tags v0.10.68, v0.11.23, v0.12.52, v0.13.5, and 
0.14.22, all copy (older ones via `node.copy()`). So the side-effect assumption 
has never held; no version pin fixes it. The embeddings end up only inside the 
index's vector store (`index.vector_store.data.embedding_dict` for 
`SimpleVectorStore`, keyed by node_id).
   
   **Minimal reproduction (no API key needed):**
   
   ```python
   from llama_index.core import Document, VectorStoreIndex
   from llama_index.core.node_parser import SentenceSplitter
   from llama_index.core.embeddings.mock_embed_model import MockEmbedding
   
   docs = [Document(text="hello world", metadata={"id": 1})]
   nodes = SentenceSplitter(chunk_size=512, 
chunk_overlap=50).get_nodes_from_documents(docs)
   index = VectorStoreIndex(nodes, embed_model=MockEmbedding(embed_dim=8))
   
   print(nodes[0].embedding)                          # None  <-- what the 
operator returns
   print(index.vector_store.data.embedding_dict)      # {node_id: [0.5, ...]}  
<-- where vectors actually are
   ```
   
   Or via the operator itself with any real connection: every entry in 
`result["chunks"]` has `"vector": None`.
   
   ### What you think should happen instead?
   
   **Suggested fixes (either works):**
   1. After building the index, read vectors back from the store: 
`index.vector_store.data.embedding_dict[node.node_id]` (works for the 
`SimpleVectorStore` default; needs a fallback for stores that don't retain 
`data`).
   2. Pre-embed before building: call 
`embed_model.get_text_embedding_batch([...])` and assign `node.embedding` on 
the original nodes first. `llama_index.core.indices.utils.embed_nodes()` skips 
nodes whose `.embedding` is already set, so `VectorStoreIndex` reuses them — no 
duplicate API calls, and the existing return code works unchanged. (I verified 
the skip behavior in 0.14.22.)
   
   **Workaround for users:** set `persist_dir`, then load vectors downstream 
via `StorageContext.from_defaults(persist_dir=...)` → 
`ctx.vector_store.data.embedding_dict` + 
`ctx.docstore.get_node(node_id).metadata`.
   
   ### Operating System
   
   _No response_
   
   ### Deployment
   
   None
   
   ### Apache Airflow Provider(s)
   
   common-ai
   
   ### Versions of Apache Airflow Providers
   
   **Providers** (`pip freeze | grep apache-airflow-providers`):
   ```
   apache-airflow-providers-celery==3.20.0
   apache-airflow-providers-common-ai==0.4.0
   apache-airflow-providers-common-compat==1.15.0
   apache-airflow-providers-common-io==1.7.2
   apache-airflow-providers-common-sql==1.30.2
   apache-airflow-providers-elasticsearch==6.5.4
   apache-airflow-providers-openlineage==2.17.0
   apache-airflow-providers-smtp==3.0.1
   apache-airflow-providers-standard==1.13.1
   ```
   
   Other:
   ```
   llama-index-core==0.14.22
   llama-index-embeddings-openai==0.6.0
   ```
   
   ### Official Helm Chart version
   
   Not Applicable
   
   ### Kubernetes Version
   
   _No response_
   
   ### Helm Chart configuration
   
   _No response_
   
   ### Docker Image customizations
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to