Re: [PR] feat: refactor construct knowledge graph task [incubator-hugegraph-ai]

via GitHub Thu, 01 Feb 2024 22:58:16 -0800


liuxiaocs7 commented on code in PR #29:
URL: 
https://github.com/apache/incubator-hugegraph-ai/pull/29#discussion_r1475611968



##########
hugegraph-llm/README.md:
##########
@@ -19,8 +19,64 @@ graph systems and large language models.
 
 ## Examples (knowledge graph construction by llm)
 
-1. Start the HugeGraph database, you can do it via Docker. Refer to this 
[link](https://hub.docker.com/r/hugegraph/hugegraph) for guidance
-2. Run example like `python hugegraph-llm/examples/build_kg_test.py`
+> 1. Start the HugeGraph database, you can do it via Docker. Refer to this 
[link](https://hub.docker.com/r/hugegraph/hugegraph) for guidance
+> 2. Run example like `python hugegraph-llm/examples/build_kg_test.py`
+> 
+> Note: If you need a proxy to access OpenAI's API, please set your HTTP proxy 
in `build_kg_test.py`.
 
-Note: If you need a proxy to access OpenAI's API, please set your HTTP proxy 
in `build_kg_test.py`.
+The `KgBuilder` class is used to construct a knowledge graph. Here is a brief 
usage guide:
 
+1. **Initialization**: The `KgBuilder` class is initialized with an instance 
of a language model. This can be obtained from the `LLMs` class.
+
+```python
+from hugegraph_llm.llms.init_llm import LLMs
+from hugegraph_llm.operators.kg_construction_task import KgBuilder
+
+TEXT = ""
+builder = KgBuilder(LLMs().get_llm())
+(
+    builder
+    .import_schema(from_hugegraph="talent_graph").print_result()
+    .extract_triples(TEXT).print_result()
+    .disambiguate_word_sense().print_result()
+    .commit_to_hugegraph()
+    .run()
+)
+```
+
+2. **Import Schema**: The `import_schema` method is used to import a schema 
from a source. The source can be a HugeGraph instance,a user-defined schema or 
an extraction result. The method `print_result` can be chained to print the 
result.

Review Comment:
   ```suggestion
   2. **Import Schema**: The `import_schema` method is used to import a schema 
from a source. The source can be a HugeGraph instance, a user-defined schema or 
an extraction result. The method `print_result` can be chained to print the 
result.
   ```



##########
hugegraph-llm/examples/build_kg_test.py:
##########
@@ -31,47 +32,29 @@
         " their distinctive digital presence through their respective 
webpages, showcasing their"
         " varied interests and experiences."
     )
-    builder = KgBuilder(default_llm)
 
-    # spo triple extract
-    
builder.extract_spo_triple(TEXT).print_result().commit_to_hugegraph(spo=True).run()
-    # build kg with only text
-    
builder.extract_nodes_relationships(TEXT).disambiguate_word_sense().commit_to_hugegraph().run()
-    # build kg with text and schemas
-    nodes_schemas = [
-        {
-            "label": "Person",
-            "primary_key": "name",
-            "properties": {
-                "age": "int",
-                "name": "text",
-                "occupation": "text",
-            },
-        },
-        {
-            "label": "Webpage",
-            "primary_key": "name",
-            "properties": {"name": "text", "url": "text"},
-        },
-    ]
-    relationships_schemas = [
-        {
-            "start": "Person",
-            "end": "Person",
-            "type": "roommate",
-            "properties": {"start": "int"},
-        },
-        {
-            "start": "Person",
-            "end": "Webpage",
-            "type": "owns",
-            "properties": {},
-        },
-    ]
+    schema = {
+        "vertices": [
+            {"vertex_label": "person", "properties": ["name", "age", 
"occupation"]},
+            {"vertex_label": "webpage", "properties": ["name", "url"]},
+        ],
+        "edges": [
+            {
+                "edge_label": "roommate",
+                "source_vertex_label": "person",
+                "target_vertex_label": "person",
+                "properties": {},
+            }
+        ],
+    }
 
     (
-        builder.parse_text_to_data_with_schemas(TEXT, nodes_schemas, 
relationships_schemas)
-        .disambiguate_data_with_schemas()
-        .commit_data_to_kg()
+        builder
+        .import_schema(from_hugegraph="talent_graph").print_result()
+        # .import_schema(from_extraction="fefe").print_result().run()
+        # .import_schema(from_input=schema).print_result()

Review Comment:
   Why does the comment here not match what is described in the documentation?
   
   ```python
   # Import schema from a HugeGraph instance
   import_schema(from_hugegraph="talent_graph").print_result()
   # Import schema from an extraction result
   import_schema(from_extraction="xxx").print_result()
   # Import schema from user-defined schema
   import_schema(from_user_defined="xxx").print_result()
   ```



##########
hugegraph-llm/src/tests/operators/llm_op/test_disambiguate_data.py:
##########
@@ -0,0 +1,83 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import unittest
+
+from hugegraph_llm.llms.init_llm import LLMs
+from hugegraph_llm.operators.llm_op.disambiguate_data import DisambiguateData
+
+
+class TestDisambiguateData(unittest.TestCase):
+    def setUp(self):
+        self.triples = {
+            "triples": [
+                (' "Alice "', ' "Age "', ' "25 "'),
+                (' "Alice "', ' "Profession "', ' "lawyer "'),
+                (' "Bob "', ' "Job "', ' "journalist "'),
+                (' "Alice "', ' "Roommate of "', ' "Bob "'),
+                (' "lucy "', "roommate", ' "Bob "'),
+                (' "Alice "', ' "is the ownner of "', ' "http://www.alice.com 
"'),
+                (' "Bob "', ' "Owns "', ' "http://www.bob.com "'),
+            ]
+        }
+
+        self.triples_with_schema = {
+            "vertices": [
+                {
+                    "name": "Alice",
+                    "label": "person",
+                    "properties": {"name": "Alice", "age": "25", "occupation": 
"lawyer"},
+                },
+                {
+                    "name": "Bob",
+                    "label": "person",
+                    "properties": {"name": "Bob", "occupation": "journalist"},
+                },
+                {
+                    "name": "www.alice.com",
+                    "label": "webpage",
+                    "properties": {"name": "www.alice.com", "url": 
"www.alice.com"},
+                },
+                {
+                    "name": "www.bob.com",
+                    "label": "webpage",
+                    "properties": {"name": "www.bob.com", "url": 
"www.bob.com"},
+                },
+            ],
+            "edges": [{"start": "Alice", "end": "Bob", "type": "roommate", 
"properties": {}}],
+            "schema": {
+                "vertices": [
+                    {"vertex_label": "person", "properties": ["name", "age", 
"occupation"]},
+                    {"vertex_label": "webpage", "properties": ["name", "url"]},
+                ],
+                "edges": [
+                    {
+                        "edge_label": "roommate",
+                        "source_vertex_label": "person",
+                        "target_vertex_label": "person",
+                        "properties": [],
+                    }
+                ],
+            },
+        }
+        self.llm = LLMs().get_llm()
+        # self.llm =  None

Review Comment:
   remove?



##########
hugegraph-llm/src/tests/operators/llm_op/test_disambiguate_data.py:
##########
@@ -0,0 +1,83 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import unittest
+
+from hugegraph_llm.llms.init_llm import LLMs
+from hugegraph_llm.operators.llm_op.disambiguate_data import DisambiguateData
+
+
+class TestDisambiguateData(unittest.TestCase):
+    def setUp(self):
+        self.triples = {
+            "triples": [
+                (' "Alice "', ' "Age "', ' "25 "'),

Review Comment:
   Why do we need extra quotes and spaces? Can't `('Alice', 'Age', '25')` 
represent a SPO?
   



##########
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py:
##########
@@ -165,110 +65,62 @@ def split_string_to_fit_token_space(
     return combined_chunks
 
 
-def get_spo_from_result(result):
-    res = []
-    for row in result:
-        row = row.replace("\\n", "").replace("\\", "")
-        pattern = r'\("(.*?)", "(.*?)", "(.*?)"\)'
-        res += re.findall(pattern, row)
-    return res
+def extract_by_regex(text, triples):

Review Comment:
   ```suggestion
   def extract_triples_by_regex(text, triples):
   ```
   
   `v-n-adv` phrase maybe better?



##########
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/commit_to_hugegraph.py:
##########
@@ -220,9 +104,9 @@ def run(self, data: dict):
         ).secondary().ifNotExist().create()
 
         for item in data:
-            s = item[0]
-            p = item[1]
-            o = item[2]
+            s = item[0].strip()
+            p = item[1].strip()
+            o = item[2].strip()

Review Comment:
   ```suggestion
               s, p, o = (element.strip() for element in item)
   ```
   
   WDYT?



##########
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py:
##########
@@ -165,110 +65,62 @@ def split_string_to_fit_token_space(
     return combined_chunks
 
 
-def get_spo_from_result(result):
-    res = []
-    for row in result:
-        row = row.replace("\\n", "").replace("\\", "")
-        pattern = r'\("(.*?)", "(.*?)", "(.*?)"\)'
-        res += re.findall(pattern, row)
-    return res
+def extract_by_regex(text, triples):
+    text = text.replace("\\n", " ").replace("\\", " ").replace("\n", " ")
+    pattern = r"\((.*?), (.*?), (.*?)\)"
+    triples["triples"] += re.findall(pattern, text)
+
 
+def extract_by_regex_with_schema(schema, text, graph):

Review Comment:
   same above?



##########
hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py:
##########
@@ -0,0 +1,63 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+from typing import Any
+
+
+class CheckSchema:
+    def __init__(self, data):
+        self.result = None
+        self.data = data
+
+    def run(self, schema=None) -> Any:
+        data = self.data or schema
+        if not isinstance(data, dict):
+            raise ValueError("Input data is not a dictionary.")

Review Comment:
   ```suggestion
               raise ValueError("Input schema is not a dictionary.")
   ```
   
   Emphasizing `schema` might be better? Same below.



##########
hugegraph-llm/src/tests/operators/llm_op/test_info_extract.py:
##########
@@ -0,0 +1,142 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import unittest
+
+from hugegraph_llm.operators.llm_op.info_extract import (
+    InfoExtract,
+    extract_by_regex_with_schema,
+    extract_by_regex,
+)
+
+
+class TestInfoExtract(unittest.TestCase):
+    def setUp(self):
+        self.schema = {
+            "vertices": [
+                {"vertex_label": "person", "properties": ["name", "age", 
"occupation"]},
+                {"vertex_label": "webpage", "properties": ["name", "url"]},
+            ],
+            "edges": [
+                {
+                    "edge_label": "roommate",
+                    "source_vertex_label": "person",
+                    "target_vertex_label": "person",
+                    "properties": [],
+                }
+            ],
+        }
+        # self.llm = LLMs().get_llm()

Review Comment:
   remove?



##########
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py:
##########
@@ -0,0 +1,54 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+from hugegraph_llm.utils.config import Config
+from hugegraph_llm.utils.constants import Constants
+from pyhugegraph.client import PyHugeClient
+
+
+class SchemaManager:
+    def __init__(self, graph_name: str):
+        config = Config(section=Constants.HUGEGRAPH_CONFIG)
+        self.graph_name = graph_name
+        self.client = PyHugeClient(
+            config.get_graph_ip(),
+            config.get_graph_port(),
+            graph_name,
+            config.get_graph_user(),
+            config.get_graph_pwd(),
+        )
+        self.schema = self.client.schema()
+
+    def run(self, data: dict):
+        schema = self.schema.getSchema()
+        vertices = []
+        for vl in schema["vertexlabels"]:
+            vertex = {"vertex_label": vl["name"], "properties": 
vl["properties"]}
+            vertices.append(vertex)
+        edges = []
+        for el in schema["edgelabels"]:
+            edge = {
+                "edge_label": el["name"],
+                "source_vertex_label": el["source_label"],
+                "target_vertex_label": el["target_label"],
+                "properties": el["properties"],
+            }
+            edges.append(edge)
+        if not vertices and not edges:
+            raise Exception(f"Can not get {self.graph_name}'s schema from 
HugeGraph!")

Review Comment:
   The schema will fail if there are no edges in the graph?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: refactor construct knowledge graph task [incubator-hugegraph-ai]

Reply via email to