[ https://issues.apache.org/jira/browse/GSOC-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Imba Jin updated GSOC-283: -------------------------- Description: +Apache+ {+}[HugeGraph|https://hugegraph.apache.org/]{+}{+}(incubating){+} is a fast-speed and highly-scalable {+}[graph database|https://en.wikipedia.org/wiki/Graph_database]{+}{+}/{+}computing/AI ecosystem. Billions of vertices and edges can be easily stored into and queried from HugeGraph due to its excellent OLTP/OLAP ability. Website: [https://hugegraph.apache.org/] GitHub: - [https://github.com/apache/incubator-hugegraph/] - [https://github.com/apache/incubator-hugegraph-ai/|https://github.com/apache/incubator-hugegraph/] h2. Description Currently, we have implemented a basic GraphRAG that relies on fixed processing workflows (e.g., knowledge retrieval & graph structure updates using the same execution pipeline), leading to insufficient flexibility and high overhead in complex scenarios. The proposed task introduces an Agentic architecture based on the principles of "dynamic awareness, lightweight scheduling, concurrent execution," focusing on solving the following issues: # {*}Rigid Intent Recognition{*}: Existing systems cannot effectively distinguish between simple retrievals (e.g., entity queries) and complex operations (e.g., multi-hop reasoning), often defaulting to BFS-based template subgraph searches. # {*}Coupled Execution Resources{*}: Memory/computational resources are not isolated based on task characteristics, causing long-tail tasks to block high-priority requests. # {*}Lack of Feedback Mechanisms{*}: Absence of self-correction capabilities for erroneous operations (e.g., automatically switching to similar vertices/entities after path retrieval failures). The task will include three core parts: *1. Dynamic Awareness Layer* * Implement an LLM-based real-time (as of February 14, 2025) intent classifier that categorizes tasks (L1 simple retrieval/L2 path reasoning/L3 graph computation/L4+ etc.) based on semantic features (verb types/entity complexity/temporal modifiers). * Build a lightweight operation cache to generate feature hashes for high-frequency requests, enabling millisecond-level intent matching. *2. Task Orchestration Layer* * Introduce a suitable workflow/taskflow framework emphasizing low coupling, high performance, and flexibility. * Adopt a preemptive scheduling mechanism allowing high-priority tasks to pause non-critical phases of low-priority tasks (e.g., suspending subgraph preloading without interrupting core computations). *3. Concurrent Execution* * Decouple traditional RAG pipelines into composable operations (entity recall → path validation → context enhancement → result refinement), with dynamic enable/disable support for each component. * Implement automatic execution engine degradation, triggering fallback strategies upon sub-operation failures (e.g., switching to alternative methods if Gremlin queries timeout). h2. *Recommended Skills* # Proficiency in Python and familiarity with at least one open/closed-source LLM. # Experience with one LLM RAG/Agent framework like LangGraph/RAGflow/LLamaindex/Dify. # Knowledge of LLM optimization techniques and RAG construction (KG extraction/construction experience is a plus). # Strong algorithmic engineering skills (problem abstraction, algorithm research, big data processing, model tuning). # Familiarity with VectorDB/Graph/KG/HugeGraph read-write workflows and principles. # Understanding of graph algorithms (e.g., community detection, centrality, PageRank) and open-source community experience preferred. *Task List* * Develop a hierarchical triggering mechanism for the intent classifier to categorize L1~LN tasks within milliseconds (accuracy >90%). * Semi-automatically generate Graph Schema/extraction prompts. * Support dynamic routing and query decomposition. * Design an execution trace tracker to log micro-operation resource consumption and generate optimization reports. * Enhance retrieval with graph algorithms: Apply node importance evaluation, path search, etc., to optimize knowledge recall. * Implement a dialogue memory management module for context-aware state tracking and information reuse. h3. Size * Difficulty: Hard * Project size: ~350 hours (full-time/large) h2. Potential Mentors * Imba Jin: [j...@apache.org|mailto:j...@apache.org] (Apache HugeGraph PPMC) * Simon: [m...@apache.org|mailto:m...@apache.org] (Apache HugeGraph PPMC) was: +Apache+ {+}[HugeGraph|https://hugegraph.apache.org/]{+}{+}(incubating){+} is a fast-speed and highly-scalable {+}[graph database|https://en.wikipedia.org/wiki/Graph_database]{+}{+}/{+}computing/AI ecosystem. Billions of vertices and edges can be easily stored into and queried from HugeGraph due to its excellent OLTP/OLAP ability. Website: [https://hugegraph.apache.org/] GitHub: - [https://github.com/apache/incubator-hugegraph/] - [https://github.com/apache/incubator-hugegraph-ai/|https://github.com/apache/incubator-hugegraph/] h2. Description Currently, we have implemented a basic GraphRAG that relies on fixed processing workflows (e.g., knowledge retrieval & graph structure updates using the same execution pipeline), leading to insufficient flexibility and high overhead in complex scenarios. The proposed task introduces an Agentic architecture based on the principles of "dynamic awareness, lightweight scheduling, concurrent execution," focusing on solving the following issues: # {*}Rigid Intent Recognition{*}: Existing systems cannot effectively distinguish between simple retrievals (e.g., entity queries) and complex operations (e.g., multi-hop reasoning), often defaulting to BFS-based template subgraph searches. # {*}Coupled Execution Resources{*}: Memory/computational resources are not isolated based on task characteristics, causing long-tail tasks to block high-priority requests. # {*}Lack of Feedback Mechanisms{*}: Absence of self-correction capabilities for erroneous operations (e.g., automatically switching to similar vertices/entities after path retrieval failures). The task will include three core parts: # *Dynamic Awareness Layer* * ** Implement an LLM-based real-time (as of February 14, 2025) intent classifier that categorizes tasks (L1 simple retrieval/L2 path reasoning/L3 graph computation/L4+ etc.) based on semantic features (verb types/entity complexity/temporal modifiers). * ** Build a lightweight operation cache to generate feature hashes for high-frequency requests, enabling millisecond-level intent matching. # *Task Orchestration Layer* * ** Introduce a suitable workflow/taskflow framework emphasizing low coupling, high performance, and flexibility. * ** Adopt a preemptive scheduling mechanism allowing high-priority tasks to pause non-critical phases of low-priority tasks (e.g., suspending subgraph preloading without interrupting core computations). # *Concurrent Execution* * ** Decouple traditional RAG pipelines into composable operations (entity recall → path validation → context enhancement → result refinement), with dynamic enable/disable support for each component. * ** Implement automatic execution engine degradation, triggering fallback strategies upon sub-operation failures (e.g., switching to alternative methods if Gremlin queries timeout). h2. *Recommended Skills* # Proficiency in Python and familiarity with at least one open/closed-source LLM. # Experience with one LLM RAG/Agent framework like LangGraph/RAGflow/LLamaindex/Dify. # Knowledge of LLM optimization techniques and RAG construction (KG extraction/construction experience is a plus). # Strong algorithmic engineering skills (problem abstraction, algorithm research, big data processing, model tuning). # Familiarity with VectorDB/Graph/KG/HugeGraph read-write workflows and principles. # Understanding of graph algorithms (e.g., community detection, centrality, PageRank) and open-source community experience preferred. *Task List* * Develop a hierarchical triggering mechanism for the intent classifier to categorize L1~LN tasks within milliseconds (accuracy >90%). * Semi-automatically generate Graph Schema/extraction prompts. * Support dynamic routing and query decomposition. * Design an execution trace tracker to log micro-operation resource consumption and generate optimization reports. * Enhance retrieval with graph algorithms: Apply node importance evaluation, path search, etc., to optimize knowledge recall. * Implement a dialogue memory management module for context-aware state tracking and information reuse. h3. Size * Difficulty: Hard * Project size: ~350 hours (full-time/large) h2. Potential Mentors * Imba Jin: [j...@apache.org|mailto:j...@apache.org] (Apache HugeGraph PPMC) * Simon: [m...@apache.org|mailto:m...@apache.org] (Apache HugeGraph PPMC) > [GSoC][HugeGraph] Implement Agentic GraphRAG Architecture > --------------------------------------------------------- > > Key: GSOC-283 > URL: https://issues.apache.org/jira/browse/GSOC-283 > Project: Comdev GSOC > Issue Type: Task > Reporter: Imba Jin > Priority: Major > Labels: RAG, agent, graph, gsoc2025 > > +Apache+ {+}[HugeGraph|https://hugegraph.apache.org/]{+}{+}(incubating){+} is > a fast-speed and highly-scalable {+}[graph > database|https://en.wikipedia.org/wiki/Graph_database]{+}{+}/{+}computing/AI > ecosystem. Billions of vertices and edges can be easily stored into and > queried from HugeGraph due to its excellent OLTP/OLAP ability. > > Website: [https://hugegraph.apache.org/] > GitHub: > - [https://github.com/apache/incubator-hugegraph/] > - > [https://github.com/apache/incubator-hugegraph-ai/|https://github.com/apache/incubator-hugegraph/] > > h2. Description > Currently, we have implemented a basic GraphRAG that relies on fixed > processing workflows (e.g., knowledge retrieval & graph structure updates > using the same execution pipeline), leading to insufficient flexibility and > high overhead in complex scenarios. The proposed task introduces an Agentic > architecture based on the principles of "dynamic awareness, lightweight > scheduling, concurrent execution," focusing on solving the following issues: > # {*}Rigid Intent Recognition{*}: Existing systems cannot effectively > distinguish between simple retrievals (e.g., entity queries) and complex > operations (e.g., multi-hop reasoning), often defaulting to BFS-based > template subgraph searches. > # {*}Coupled Execution Resources{*}: Memory/computational resources are not > isolated based on task characteristics, causing long-tail tasks to block > high-priority requests. > # {*}Lack of Feedback Mechanisms{*}: Absence of self-correction capabilities > for erroneous operations (e.g., automatically switching to similar > vertices/entities after path retrieval failures). > The task will include three core parts: > *1. Dynamic Awareness Layer* > * Implement an LLM-based real-time (as of February 14, 2025) intent > classifier that categorizes tasks (L1 simple retrieval/L2 path reasoning/L3 > graph computation/L4+ etc.) based on semantic features (verb types/entity > complexity/temporal modifiers). > * Build a lightweight operation cache to generate feature hashes for > high-frequency requests, enabling millisecond-level intent matching. > *2. Task Orchestration Layer* > * Introduce a suitable workflow/taskflow framework emphasizing low coupling, > high performance, and flexibility. > * Adopt a preemptive scheduling mechanism allowing high-priority tasks to > pause non-critical phases of low-priority tasks (e.g., suspending subgraph > preloading without interrupting core computations). > *3. Concurrent Execution* > * Decouple traditional RAG pipelines into composable operations (entity > recall → path validation → context enhancement → result refinement), with > dynamic enable/disable support for each component. > * Implement automatic execution engine degradation, triggering fallback > strategies upon sub-operation failures (e.g., switching to alternative > methods if Gremlin queries timeout). > h2. *Recommended Skills* > # Proficiency in Python and familiarity with at least one open/closed-source > LLM. > # Experience with one LLM RAG/Agent framework like > LangGraph/RAGflow/LLamaindex/Dify. > # Knowledge of LLM optimization techniques and RAG construction (KG > extraction/construction experience is a plus). > # Strong algorithmic engineering skills (problem abstraction, algorithm > research, big data processing, model tuning). > # Familiarity with VectorDB/Graph/KG/HugeGraph read-write workflows and > principles. > # Understanding of graph algorithms (e.g., community detection, centrality, > PageRank) and open-source community experience preferred. > *Task List* > * Develop a hierarchical triggering mechanism for the intent classifier to > categorize L1~LN tasks within milliseconds (accuracy >90%). > * Semi-automatically generate Graph Schema/extraction prompts. > * Support dynamic routing and query decomposition. > * Design an execution trace tracker to log micro-operation resource > consumption and generate optimization reports. > * Enhance retrieval with graph algorithms: Apply node importance evaluation, > path search, etc., to optimize knowledge recall. > * Implement a dialogue memory management module for context-aware state > tracking and information reuse. > h3. Size > * Difficulty: Hard > * Project size: ~350 hours (full-time/large) > h2. Potential Mentors > * Imba Jin: [j...@apache.org|mailto:j...@apache.org] (Apache HugeGraph PPMC) > * Simon: [m...@apache.org|mailto:m...@apache.org] (Apache HugeGraph PPMC) > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: gsoc-unsubscr...@community.apache.org For additional commands, e-mail: gsoc-h...@community.apache.org