Re: [PR] fix: handle empty gremlin examples when building index [hugegraph-ai]

via GitHub Wed, 06 May 2026 06:28:46 -0700


imbajin commented on code in PR #323:
URL: https://github.com/apache/hugegraph-ai/pull/323#discussion_r3195771529



##########
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_gremlin_example_index.py:
##########
@@ -38,14 +38,24 @@ def __init__(
         self.vector_index = vector_index
 
     def run(self, context: Dict[str, Any]) -> Dict[str, Any]:
-        # !: We have assumed that self.example is not empty
+        if not self.examples:
+            self.vector_index.clean(self.vector_index_name)
+            context["embed_dim"] = 0
+            return context
+
         queries = [example["query"] for example in self.examples]
         # TODO: refactor function chain async to avoid blocking
         examples_embedding = 
asyncio.run(get_embeddings_parallel(self.embedding, queries))
+
+        if not examples_embedding:
+            self.vector_index.clean(self.vector_index_name)

Review Comment:
   ‼️ **This branch silently hides embedding failures and destroys the existing 
index.**
   
   ### When this branch is actually reached
   
   Control only gets here when `self.examples` is **non-empty** but 
`get_embeddings_parallel(...)` returns `[]`. For a non-empty query list, an 
empty embedding result is almost never a legitimate business outcome — it 
typically indicates:
   
   - The embedding provider is unreachable (network / DNS / timeout),
   - Authentication / quota failure (bad API key, rate limit, exhausted credit),
   - A bug in the embedding client returning an empty list instead of raising.
   
   In other words, this is an **error signal**, not an empty-input case.
   
   ### Why the current handling is wrong
   
   The code currently:
   
   1. Silently calls `clean()`, deleting any previously built index, **and**
   2. Returns `embed_dim=0` as if the operation succeeded.
   
   From the caller's perspective, the workflow looks successful. But on disk, a 
working `gremlin_examples` index has just been wiped, and the user has no way 
to tell something went wrong.
   
   ### Suggested fix
   
   At minimum, do not `clean()`. Prefer one of the two options below:
   
   **Option A — log loudly and skip (preserves existing index):**
   
   ```suggestion
           if not examples_embedding:
               log.error(
                   "Embedding service returned empty result for %d examples; 
skipping index build",
                   len(self.examples),
               )
               context["embed_dim"] = 0
               return context
   ```
   
   **Option B — fail fast, which is what the previous `IndexError` was 
accidentally doing:**
   
   ```python
           if not examples_embedding:
               raise RuntimeError(
                   f"Embedding service returned empty result for 
{len(self.examples)} examples"
               )
   ```
   
   Either is acceptable; the key point is that a silent `clean()` here turns a 
recoverable failure into irrecoverable data loss.



##########
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_gremlin_example_index.py:
##########
@@ -38,14 +38,24 @@ def __init__(
         self.vector_index = vector_index
 
     def run(self, context: Dict[str, Any]) -> Dict[str, Any]:
-        # !: We have assumed that self.example is not empty
+        if not self.examples:
+            self.vector_index.clean(self.vector_index_name)
+            context["embed_dim"] = 0
+            return context
+
         queries = [example["query"] for example in self.examples]
         # TODO: refactor function chain async to avoid blocking
         examples_embedding = 
asyncio.run(get_embeddings_parallel(self.embedding, queries))
+
+        if not examples_embedding:
+            self.vector_index.clean(self.vector_index_name)
+            context["embed_dim"] = 0
+            return context
+
         embed_dim = len(examples_embedding[0])
-        if len(self.examples) > 0:
-            vector_index = self.vector_index.from_name(embed_dim, 
self.vector_index_name)
-            vector_index.add(examples_embedding, self.examples)
-            vector_index.save_index_by_name(self.vector_index_name)
+        vector_index = self.vector_index.from_name(embed_dim, 
self.vector_index_name)
+        vector_index.add(examples_embedding, self.examples)
+        vector_index.save_index_by_name(self.vector_index_name)
+
         context["embed_dim"] = embed_dim
-        return context
+        return context

Review Comment:
   ⚠️ **Missing newline at end of file.**
   
   The diff shows `\ No newline at end of file` on the final `return context` 
line. This violates PEP 8 (rule W292) and will likely trip the project's 
pre-commit / lint hooks. Please add a trailing newline.



##########
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_gremlin_example_index.py:
##########
@@ -38,14 +38,24 @@ def __init__(
         self.vector_index = vector_index
 
     def run(self, context: Dict[str, Any]) -> Dict[str, Any]:
-        # !: We have assumed that self.example is not empty
+        if not self.examples:
+            self.vector_index.clean(self.vector_index_name)

Review Comment:
   ‼️ **Calling `clean()` on empty examples is a destructive side effect that 
the PR description does not mention.**
   
   ### Why this is a problem
   
   In this repository, `vector_index.clean(name)` is the API used to **destroy 
an existing index on disk**. Look at how it is used elsewhere:
   
   - `utils/graph_index_utils.py:49-50` — used to wipe `graph_vids` and 
`gremlin_examples` when the user explicitly asks for a reset.
   - `utils/vector_index_utils.py:77` — used to wipe the `chunks` index before 
a rebuild.
   
   So the intent of `clean()` is always "tear down", never "skip this step".
   
   ### What this PR actually does
   
   Previously, if a caller accidentally passed `examples=[]`, `run()` would 
raise `IndexError` — noisy, but non-destructive. With this change, the same 
call path will now:
   
   1. Return `embed_dim=0` silently, and
   2. **Delete the existing `gremlin_examples` index on disk.**
   
   A user who has already built a working index and then re-triggers the 
workflow with an empty example list will lose their index without any warning. 
The PR title says *"handle empty gremlin examples when building index"*, which 
readers will reasonably interpret as a no-op guard, not a reset.
   
   ### Suggested fix
   
   Drop the `clean()` call; a pure early return preserves existing state and 
still fixes the `IndexError`:
   
   ```suggestion
           if not self.examples:
               context["embed_dim"] = 0
               return context
   ```
   
   If the intent really is "empty input should reset the index", please:
   1. State this explicitly in the PR description so reviewers and downstream 
users are aware, and
   2. Confirm no upstream caller (e.g. `operators/operator_list.py:69`) can 
reach this path with an unintended empty list.



##########
hugegraph-llm/src/hugegraph_llm/nodes/index_node/build_gremlin_example_index.py:
##########
@@ -34,7 +34,7 @@ def node_init(self):
         # pylint: disable=import-outside-toplevel
         from hugegraph_llm.utils.vector_index_utils import 
get_vector_index_class
 
-        if not self.wk_input.examples:
+        if self.wk_input.examples is None:

Review Comment:
   ‼️ **`not` → `is None` relaxes the check and contradicts the error message.**
   
   ### The behavior change
   
   - **Before:** `if not self.wk_input.examples:` — both `None` and `[]` are 
rejected, caller gets `CStatus(-1, "examples is required...")`.
   - **After:** `if self.wk_input.examples is None:` — only `None` is rejected; 
`[]` is accepted and flows through to the operator.
   
   ### Why this matters (interacts with the operator change)
   
   Combined with the new empty-examples branch in the operator, the effective 
path for `wk_input.examples = []` is now:
   
   1. `node_init` accepts the empty list (no longer returns `CStatus(-1)`).
   2. The operator's `run()` enters its new early-return branch.
   3. That branch calls `self.vector_index.clean("gremlin_examples")` — 
**silently wiping the user's existing index.**
   
   So an empty list, which the error message still claims is invalid 
(*"examples is required in BuildGremlinExampleIndexNode"*), in fact now acts as 
an implicit *reset* command. The validation message and the runtime behavior 
contradict each other.
   
   ### Suggested resolution — pick one
   
   - **Keep the stricter check** (recommended unless there is an explicit 
reason to accept `[]`). Revert this line to `if not self.wk_input.examples:` 
and, per the operator-side comments, remove the `clean()` calls.
   - **Intentionally allow empty list as a reset command.** Then update the 
error message to something like *"examples cannot be None"*, document this 
behavior in the PR description, and keep (some form of) the operator-side reset 
logic — but make the reset explicit and logged, not silent.



##########
hugegraph-llm/src/tests/operators/index_op/test_build_gremlin_example_index.py:
##########
@@ -54,7 +54,10 @@ def test_init(self):
         self.assertEqual(self.index_builder.vector_index_name, 
"gremlin_examples")
 
     @patch('asyncio.run')
-    @patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel')
+    @patch(
+        
'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel',

Review Comment:
   🧹 **Patch target fix is correct — nice catch.**
   
   Rebinding the patch from 
`hugegraph_llm.utils.embedding_utils.get_embeddings_parallel` to 
`hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel`
 is the right move. The rule of thumb for `unittest.mock.patch` is *"patch 
where the name is looked up, not where it is defined"* — since the operator 
does `from hugegraph_llm.utils.embedding_utils import get_embeddings_parallel`, 
the function is bound to the operator module's namespace at import time, and 
that's the binding the test must replace.
   
   The previous patch target looked plausible but never actually intercepted 
the call, so the test was implicitly relying on `asyncio.run` being mocked to 
avoid real work. Using `new_callable=MagicMock` here is also correct — it 
prevents the mock from being treated as an awaitable and suppresses the 
"coroutine was never awaited" warning. No change needed to this line.



##########
hugegraph-llm/src/tests/operators/index_op/test_build_gremlin_example_index.py:
##########
@@ -78,7 +81,10 @@ def test_run_with_examples(self, 
mock_get_embeddings_parallel, mock_asyncio_run)
         self.assertEqual(context["embed_dim"], 3)
 
     @patch('asyncio.run')
-    @patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel')
+    @patch(
+        
'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel',
+        new_callable=MagicMock,

Review Comment:
   ⚠️ **This test locks in the destructive behavior as a contract; please 
reconsider after the operator changes.**
   
   ### What this test currently asserts
   
   ```python
   mock_vector_store_class.clean.assert_called_once_with("gremlin_examples")
   mock_vector_store_class.from_name.assert_not_called()
   # ...
   ```
   
   This effectively turns "empty examples → delete the on-disk index" into a 
frozen contract enforced by CI. If the operator-side comments are accepted and 
`clean()` is removed, this assertion must flip:
   
   ```suggestion
           mock_vector_store_class.clean.assert_not_called()
   ```
   
   ### Missing coverage
   
   The two early-return branches in `run()` (empty `examples` vs. empty 
`examples_embedding`) are semantically distinct — the second one implies an 
embedding-service failure, not an empty-input case. Please add a dedicated 
test, e.g. `test_run_with_non_empty_examples_but_empty_embeddings`, that:
   
   - Constructs the builder with non-empty `examples`,
   - Makes `asyncio.run(...)` return `[]`,
   - Asserts the chosen error/logging behavior (raise, or log + skip),
   - Asserts `clean()` is **not** called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: handle empty gremlin examples when building index [hugegraph-ai]

Reply via email to