(incubator-hugegraph-ai) branch main updated: fix(llm): align regex extraction of json to json format of prompt (#211)

jin Mon, 21 Apr 2025 20:25:13 -0700

This is an automated email from the ASF dual-hosted git repository.

jin pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-hugegraph-ai.git



The following commit(s) were added to refs/heads/main by this push:
     new aa83ec2  fix(llm): align regex extraction of json to json format of 
prompt (#211)
aa83ec2 is described below

commit aa83ec2a8596ff86ded04e154f32e38252de0574
Author: John <thesp...@qq.com>
AuthorDate: Tue Apr 22 11:25:02 2025 +0800

    fix(llm): align regex extraction of json to json format of prompt (#211)
    
    See #210
    
    Main change of regex: matching `(\[.*])` -> matching `({.*})`.
    
    tested models:
    - qwen-max
    - qwen-plus
    - deepseek-v3
    
    ---------
    
    Co-authored-by: imbajin <j...@apache.org>
---
 .asf.yaml                                          |  1 +
 .github/workflows/hugegraph-python-client.yml      |  2 +-
 hugegraph-llm/README.md                            | 26 ++++-----
 .../operators/llm_op/property_graph_extract.py     | 61 ++++++++++++----------
 4 files changed, 49 insertions(+), 41 deletions(-)

diff --git a/.asf.yaml b/.asf.yaml
index e92e30b..9fa40a0 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -58,6 +58,7 @@ github:
     - HJ-Young
     - afterimagex
     - returnToInnocence
+    - Thespica
 
 # refer 
https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-Notificationsettingsforrepositories
 notifications:
diff --git a/.github/workflows/hugegraph-python-client.yml 
b/.github/workflows/hugegraph-python-client.yml
index e05708a..60c84dd 100644
--- a/.github/workflows/hugegraph-python-client.yml
+++ b/.github/workflows/hugegraph-python-client.yml
@@ -20,7 +20,7 @@ jobs:
     - name: Prepare HugeGraph Server Environment
       run: |
         docker run -d --name=graph -p 8080:8080 -e PASSWORD=admin 
hugegraph/hugegraph:1.3.0
-        sleep 5
+        sleep 10
 
     - uses: actions/checkout@v4
 
diff --git a/hugegraph-llm/README.md b/hugegraph-llm/README.md
index 1714f08..0251f79 100644
--- a/hugegraph-llm/README.md
+++ b/hugegraph-llm/README.md
@@ -8,12 +8,12 @@ This project includes runnable demos, it can also be used as 
a third-party libra
 As we know, graph systems can help large models address challenges like 
timeliness and hallucination,
 while large models can help graph systems with cost-related issues.
 
-With this project, we aim to reduce the cost of using graph systems, and 
decrease the complexity of 
+With this project, we aim to reduce the cost of using graph systems and 
decrease the complexity of 
 building knowledge graphs. This project will offer more applications and 
integration solutions for 
 graph systems and large language models.
 1.  Construct knowledge graph by LLM + HugeGraph
 2.  Use natural language to operate graph databases (Gremlin/Cypher)
-3.  Knowledge graph supplements answer context (GraphRAG -> Graph Agent)
+3.  Knowledge graph supplements answer context (GraphRAG → Graph Agent)
 
 ## 2. Environment Requirements
 > [!IMPORTANT]
@@ -24,7 +24,7 @@ graph systems and large language models.
 ## 3. Preparation
 
 1. Start the HugeGraph database, you can run it via 
[Docker](https://hub.docker.com/r/hugegraph/hugegraph)/[Binary 
Package](https://hugegraph.apache.org/docs/download/download/).  
-    Refer to detailed 
[doc](https://hugegraph.apache.org/docs/quickstart/hugegraph-server/#31-use-docker-container-convenient-for-testdev)
 for more guidance
+    Refer to a detailed 
[doc](https://hugegraph.apache.org/docs/quickstart/hugegraph-server/#31-use-docker-container-convenient-for-testdev)
 for more guidance
 
 2. Configuring the poetry environment, Use the official installer to install 
Poetry, See the [poetry 
documentation](https://poetry.pythonlang.cn/docs/#installing-with-pipx) for 
other installation methods   
     ```bash
@@ -32,11 +32,11 @@ graph systems and large language models.
     curl -sSL https://install.python-poetry.org | python3 - # install the 
latest version like 2.0+
     ```
 
-2. Clone this project
+3. Clone this project
     ```bash
     git clone https://github.com/apache/incubator-hugegraph-ai.git
     ```
-3. Install [hugegraph-python-client](../hugegraph-python-client) and 
[hugegraph_llm](src/hugegraph_llm), poetry officially recommends using virtual 
environments
+4. Install [hugegraph-python-client](../hugegraph-python-client) and 
[hugegraph_llm](src/hugegraph_llm), poetry officially recommends using virtual 
environments
     ```bash
     cd ./incubator-hugegraph-ai/hugegraph-llm
     poetry config --list # List/check the current configuration (Optional)
@@ -48,11 +48,11 @@ graph systems and large language models.
     poetry shell # use 'exit' to leave the shell
     ```  
     If `poetry install` fails or too slow due to network issues, it is 
recommended to modify `tool.poetry.source` of `hugegraph-llm/pyproject.toml`
-4. Enter the project directory(`./incubator-hugegraph-ai/hugegraph-llm/src`)
+5. Enter the project directory(`./incubator-hugegraph-ai/hugegraph-llm/src`)
     ```bash
     cd ./src
     ```
-5. Start the gradio interactive demo of **Graph RAG**, you can run with the 
following command, and open http://127.0.0.1:8001 after starting
+6. Start the gradio interactive demo of **Graph RAG**, you can run with the 
following command and open http://127.0.0.1:8001 after starting
     ```bash
     python -m hugegraph_llm.demo.rag_demo.app  # same as "poetry run xxx"
     ```
@@ -61,23 +61,23 @@ graph systems and large language models.
     python -m hugegraph_llm.demo.rag_demo.app --host 127.0.0.1 --port 18001
     ```
    
-6. After running the web demo, the config file `.env` will be automatically 
generated at the path `hugegraph-llm/.env`.    Additionally, a prompt-related 
configuration file `config_prompt.yaml` will also be generated at the path 
`hugegraph-llm/src/hugegraph_llm/resources/demo/config_prompt.yaml`.
+7. After running the web demo, the config file `.env` will be automatically 
generated at the path `hugegraph-llm/.env`.    Additionally, a prompt-related 
configuration file `config_prompt.yaml` will also be generated at the path 
`hugegraph-llm/src/hugegraph_llm/resources/demo/config_prompt.yaml`.
     You can modify the content on the web page, and it will be automatically 
saved to the configuration file after the corresponding feature is triggered.  
You can also modify the file directly without restarting the web application; 
refresh the page to load your latest changes.  
     (Optional)To regenerate the config file, you can use `config.generate` 
with `-u` or `--update`.  
     ```bash
     python -m hugegraph_llm.config.generate --update
     ```
     Note: `Litellm` support multi-LLM provider, refer 
[litellm.ai](https://docs.litellm.ai/docs/providers) to config it
-7. (__Optional__) You could use 
+8. (__Optional__) You could use 
     
[hugegraph-hubble](https://hugegraph.apache.org/docs/quickstart/hugegraph-hubble/#21-use-docker-convenient-for-testdev)
 
     to visit the graph data, could run it via 
[Docker/Docker-Compose](https://hub.docker.com/r/hugegraph/hubble) 
-    for guidance. (Hubble is a graph-analysis dashboard include data 
loading/schema management/graph traverser/display).
-8. (__Optional__) offline download NLTK stopwords  
+    for guidance. (Hubble is a graph-analysis dashboard that includes data 
loading/schema management/graph traverser/display).
+9. (__Optional__) offline download NLTK stopwords  
     ```bash
     python ./hugegraph_llm/operators/common_op/nltk_helper.py
     ```
 > [!TIP]   
-> You can also refer our [quick-start](./quick_start.md) doc to understand how 
to use it & the basic query logic 🚧
+> You can also refer to our [quick-start](./quick_start.md) doc to understand 
how to use it & the basic query logic 🚧
 
 ## 4 Examples
 
@@ -124,7 +124,7 @@ This can be obtained from the `LLMs` class.
     )
     ```
     ![gradio-config](https://hugegraph.apache.org/docs/images/kg-uml.png)
-2. **Import Schema**: The `import_schema` method is used to import a schema 
from a source. The source can be a HugeGraph instance, a user-defined schema or 
an extraction result. The method `print_result` can be chained to print the 
result.
+2. **Import Schema**: The `import_schema` method is used to import a schema 
from a source. The source can be a HugeGraph instance, a user-defined schema, 
or an extraction result. The method `print_result` can be chained to print the 
result.
     ```python
     # Import schema from a HugeGraph instance
     builder.import_schema(from_hugegraph="xxx").print_result()
diff --git 
a/hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py 
b/hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
index 945fd30..faff1c6 100644
--- a/hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
+++ b/hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
@@ -26,7 +26,6 @@ from hugegraph_llm.document.chunk_split import ChunkSplitter
 from hugegraph_llm.models.llms.base import BaseLLM
 from hugegraph_llm.utils.log import log
 
-
 """
 TODO: It is not clear whether there is any other dependence on the 
SCHEMA_EXAMPLE_PROMPT variable. 
 Because the SCHEMA_EXAMPLE_PROMPT variable will no longer change based on 
@@ -88,9 +87,9 @@ def filter_item(schema, items) -> List[Dict[str, Any]]:
 
 class PropertyGraphExtract:
     def __init__(
-            self,
-            llm: BaseLLM,
-            example_prompt: str = prompt.extract_graph_prompt
+        self,
+        llm: BaseLLM,
+        example_prompt: str = prompt.extract_graph_prompt
     ) -> None:
         self.llm = llm
         self.example_prompt = example_prompt
@@ -125,33 +124,41 @@ class PropertyGraphExtract:
         return self.llm.generate(prompt=prompt)
 
     def _extract_and_filter_label(self, schema, text) -> List[Dict[str, Any]]:
-        # analyze llm generated text to JSON
-        json_strings = re.findall(r'(\[.*?])', text, re.DOTALL)
-        longest_json = max(json_strings, key=lambda x: len(''.join(x)), 
default=('', ''))
-
-        longest_json_str = ''.join(longest_json).strip()
+        # Use regex to extract a JSON object with curly braces
+        json_match = re.search(r'({.*})', text, re.DOTALL)
+        if not json_match:
+            log.critical("Invalid property graph! No JSON object found, "
+                         "please check the output format example in prompt.")
+            return []
+        json_str = json_match.group(1).strip()
 
         items = []
         try:
-            property_graph = json.loads(longest_json_str)
+            property_graph = json.loads(json_str)
+            # Expect property_graph to be a dict with keys "vertices" and 
"edges"
+            if not (isinstance(property_graph, dict) and "vertices" in 
property_graph and "edges" in property_graph):
+                log.critical("Invalid property graph format; expecting 
'vertices' and 'edges'.")
+                return items
+
+            # Create sets for valid vertex and edge labels based on the schema
             vertex_label_set = {vertex["name"] for vertex in 
schema["vertexlabels"]}
             edge_label_set = {edge["name"] for edge in schema["edgelabels"]}
-            for item in property_graph:
-                if not isinstance(item, dict):
-                    log.warning("Invalid property graph item type '%s'.", 
type(item))
-                    continue
-                if not self.NECESSARY_ITEM_KEYS.issubset(item.keys()):
-                    log.warning("Invalid item keys '%s'.", item.keys())
-                    continue
-                if item["type"] == "vertex" or item["type"] == "edge":
-                    if (item["label"] not in vertex_label_set
-                            and item["label"] not in edge_label_set):
-                        log.warning("Invalid '%s' label '%s' has been 
ignored.", item["type"], item["label"])
-                    else:
-                        items.append(item)
-                else:
-                    log.warning("Invalid item type '%s' has been ignored.", 
item["type"])
-        except json.JSONDecodeError:
-            log.critical("Invalid property graph! Please check the extracted 
JSON data carefully")
 
+            def process_items(item_list, valid_labels, item_type):
+                for item in item_list:
+                    if not isinstance(item, dict):
+                        log.warning("Invalid property graph item type '%s'.", 
type(item))
+                        continue
+                    if not self.NECESSARY_ITEM_KEYS.issubset(item.keys()):
+                        log.warning("Invalid item keys '%s'.", item.keys())
+                        continue
+                    if item["label"] not in valid_labels:
+                        log.warning("Invalid %s label '%s' has been ignored.", 
item_type, item["label"])
+                        continue
+                    items.append(item)
+
+            process_items(property_graph["vertices"], vertex_label_set, 
"vertex")
+            process_items(property_graph["edges"], edge_label_set, "edge")
+        except json.JSONDecodeError:
+            log.critical("Invalid property graph JSON! Please check the 
extracted JSON data carefully")
         return items

(incubator-hugegraph-ai) branch main updated: fix(llm): align regex extraction of json to json format of prompt (#211)

Reply via email to