Re: Feature: Support gremlin adapter

柳尘 Mon, 25 Dec 2023 16:14:29 -0800

I invited pengzhiwei2018, who is an expert in the field of graph computing
and a calcite committer. I hope he can bring better suggestions in his
later reply.


柳尘 <[email protected]> 于2023年12月24日周日 11:19写道：

> Motivation
>
> Hi, community. Currently, more and more users are using some graph
> databases, such as JanusGraph, HugeGraph, etc.
> To do some relationship representation of personnel social networks,
> It is used to count the activity of each user in the social network. Most
> graph databases are in the graph building and graph traversal stage.
> All will be implemented using Gremlin syntax.
> However, this is very unfriendly to users who are not familiar with
> gremlin syntax. While calcite exists as a query framework,
> It also provides an adapter interface to adapt to different database
> dialects, such as parsing, relational algebra conversion, and query plan
> binding.
> Our company has solved the problem of adapting various graph databases.
> This is my warehouse: https://github.com/kaori-seasons/calcite-gremlin-sql
>
>
> Background
>
> Calcite itself supports the database language expansion of the adapter,
> which enables users to understand the meaning of the grammar.
> Complete the simplification of the dialect. For example, expand SqlNode to
> implement syntax analysis, and expand RelNode to implement logical plan
> mapping.
>
> thinkerpop is an adaptation framework for various graph databases. In this
> framework, gremlin syntax is mentioned for the first time.
> It has now become a universal query layer for graph databases. While
> expanding query statements through calcite’s adapter interface,
> We will also use thinkerpop's universal graph database API to provide
> dialect compatibility for different graph databases.
>
> Give a simple example:
> From
>
> SELECT "key" FROM inttype
>
>  maps to
>
> g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
>  
> by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
>  %\$"))))
>
>
>
>
>
> The design architecture is divided into three layers.
>
> Analytical syntax layer, relational algebra transformation layer, logical
> plan binding layer.
>
> Parsing syntax layer: In the parsing query statement phase, fields and
> equivalent conditions are split and converted into points and edges.
>
> Relational algebra layer: Split it into a collection of points and edges,
> and convert it into a TableScan during the aggregation/sorting/query stage
> where calcite abstracts it.
> It is convenient to generate query plans based on conditions and field
> information.
> Connection scanning/single table filtering and other methods can be used
> to traverse from any edge/any starting point in the graph
>
> Logical plan binding layer: Bind behaviors such as connection
> scanning/single table filtering/projection to calcite’s planner to generate
> query plans.
>
> The process of generating Gremlin logical plan using select statement:
>
> 1. First of all, all graph databases start from a source point to build
> the graph. We will use the GraphTraversalSource provided by thinkerpop.
> As the origin, extract the syntax of the incoming point and side
> information. This step will be implemented in SqlSchemaGrabber
> 2. Secondly, for select/where/having/order by/group by our plan in the
> parsing phase is as follows:
>
>   - group by: for a point. There are out-degree and in-degree attributes in 
> the graph. From the perspective of the data table, it is equivalent to 
> converting the table data into IN or OUT.
> These two dimensions are aggregated. This behavior also corresponds to the 
> table traversal graph operation. During the graph traversal process, 
> fold/unfold tags will be generated, representing the direction of graph 
> traversal.
>   - select: For the select operation, the operation of scanning the entire 
> table can be regarded as projecting all columns into point attributes. The 
> value changes of each column are mapped to the gremlin operation of adding 
> points.
>   - where: The filter operation is used in graph computing semantic 
> scenarios. It can be regarded as the edges connected by the out-degree and 
> in-degree of the filter point, so it does not involve the conversion of 
> relational algebra.
>   Instead, it is pushed directly to the logical plan.
>   - order by: In the process of graph traversal, we mentioned that 
> fold/unflod will be generated on the path to represent the forward/backward 
> direction.
>   If a field is encountered and there is no value that can be sorted, we will 
> fall back to the origin of GraphTraversalSource and end the sorting operation.
>   If there are values that can be sorted, they will be unified in 
> SqlTraversalEngine, in-degree and out-degree will be counted separately for 
> aggregation, and then used with group by according to label (IN/OUT)
>   - having: The meaning is the same as group by, but the label is different 
> (in addition to the IN/OUT columns, specific point fields need to be 
> specified)
>
>  Currently, I have only completed unit tests that translate from SQL to
> Gremlin execution plan, among which test cases for group by and where are
> to be added. In addition, I will also use mainstream graph databases such
> as neo4j and JanusGraph as examples to write sound integration tests. ,
> ensuring that the API of the graph database is successfully called after
> converting the sql request into the gremlin execution plan.
>
> Finally, community members are welcome to give suggestions.
>
>
>

Re: Feature: Support gremlin adapter

Reply via email to