vojay-dev opened a new issue, #68416:
URL: https://github.com/apache/airflow/issues/68416
### Under which category would you file this issue?
Providers
### Apache Airflow version
3.2.2+astro.1
### What happened and how to reproduce it?
**Versions (as tested):**
- apache-airflow-providers-common-ai **0.4.0** (bug also present on `main`
as of 2026-06-11)
- llama-index-core **0.14.22**, llama-index-embeddings-openai 0.6.0
- Apache Airflow 3.2.2 (Astro Runtime 3.2-5), Python 3.13
**Summary:**
`LlamaIndexEmbeddingOperator.execute()` returns `{"chunks": [{"text",
"metadata", "vector"}], ...}`, but `vector` is always `None`. Downstream tasks
consuming the documented chunk output (e.g. inserting vectors into a vector
table) fail or silently store nulls.
**Root cause:**
The operator builds the index and then reads embeddings back off its own
local `nodes` list, relying on a side effect that doesn't exist
([`llamaindex_embedding.py` lines ~128–149 at tag
`providers-common-ai/0.4.0`](https://github.com/apache/airflow/blob/providers-common-ai/0.4.0/providers/common/ai/src/airflow/providers/common/ai/operators/llamaindex_embedding.py)):
```python
# ``VectorStoreIndex(...)`` populates each node's ``.embedding`` as a
# side effect of building the index; ...
index = VectorStoreIndex(nodes, embed_model=embed_model, show_progress=False)
...
chunks = [{"text": node.text, "metadata": node.metadata, "vector":
node.embedding} for node in text_nodes]
```
But `VectorStoreIndex._get_node_with_embedding()` in llama-index-core
attaches embeddings to **copies**, never the originals:
```python
result = node.model_copy()
result.embedding = embedding
```
I checked llama-index-core tags v0.10.68, v0.11.23, v0.12.52, v0.13.5, and
0.14.22, all copy (older ones via `node.copy()`). So the side-effect assumption
has never held; no version pin fixes it. The embeddings end up only inside the
index's vector store (`index.vector_store.data.embedding_dict` for
`SimpleVectorStore`, keyed by node_id).
**Minimal reproduction (no API key needed):**
```python
from llama_index.core import Document, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.embeddings.mock_embed_model import MockEmbedding
docs = [Document(text="hello world", metadata={"id": 1})]
nodes = SentenceSplitter(chunk_size=512,
chunk_overlap=50).get_nodes_from_documents(docs)
index = VectorStoreIndex(nodes, embed_model=MockEmbedding(embed_dim=8))
print(nodes[0].embedding) # None <-- what the
operator returns
print(index.vector_store.data.embedding_dict) # {node_id: [0.5, ...]}
<-- where vectors actually are
```
Or via the operator itself with any real connection: every entry in
`result["chunks"]` has `"vector": None`.
### What you think should happen instead?
**Suggested fixes (either works):**
1. After building the index, read vectors back from the store:
`index.vector_store.data.embedding_dict[node.node_id]` (works for the
`SimpleVectorStore` default; needs a fallback for stores that don't retain
`data`).
2. Pre-embed before building: call
`embed_model.get_text_embedding_batch([...])` and assign `node.embedding` on
the original nodes first. `llama_index.core.indices.utils.embed_nodes()` skips
nodes whose `.embedding` is already set, so `VectorStoreIndex` reuses them — no
duplicate API calls, and the existing return code works unchanged. (I verified
the skip behavior in 0.14.22.)
**Workaround for users:** set `persist_dir`, then load vectors downstream
via `StorageContext.from_defaults(persist_dir=...)` →
`ctx.vector_store.data.embedding_dict` +
`ctx.docstore.get_node(node_id).metadata`.
### Operating System
_No response_
### Deployment
None
### Apache Airflow Provider(s)
common-ai
### Versions of Apache Airflow Providers
**Providers** (`pip freeze | grep apache-airflow-providers`):
```
apache-airflow-providers-celery==3.20.0
apache-airflow-providers-common-ai==0.4.0
apache-airflow-providers-common-compat==1.15.0
apache-airflow-providers-common-io==1.7.2
apache-airflow-providers-common-sql==1.30.2
apache-airflow-providers-elasticsearch==6.5.4
apache-airflow-providers-openlineage==2.17.0
apache-airflow-providers-smtp==3.0.1
apache-airflow-providers-standard==1.13.1
```
Other:
```
llama-index-core==0.14.22
llama-index-embeddings-openai==0.6.0
```
### Official Helm Chart version
Not Applicable
### Kubernetes Version
_No response_
### Helm Chart configuration
_No response_
### Docker Image customizations
_No response_
### Anything else?
_No response_
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]