Motivation
Hi, community. Currently, more and more users are using some graph
databases, such as JanusGraph, HugeGraph, etc.
To do some relationship representation of personnel social networks,
It is used to count the activity of each user in the social network. Most
graph databases are in the graph building and graph traversal stage.
All will be implemented using Gremlin syntax.
However, this is very unfriendly to users who are not familiar with gremlin
syntax. While calcite exists as a query framework,
It also provides an adapter interface to adapt to different database
dialects, such as parsing, relational algebra conversion, and query plan
binding.
Our company has solved the problem of adapting various graph databases.
This is my warehouse: https://github.com/kaori-seasons/calcite-gremlin-sql
Background
Calcite itself supports the database language expansion of the adapter,
which enables users to understand the meaning of the grammar.
Complete the simplification of the dialect. For example, expand SqlNode to
implement syntax analysis, and expand RelNode to implement logical plan
mapping.
thinkerpop is an adaptation framework for various graph databases. In this
framework, gremlin syntax is mentioned for the first time.
It has now become a universal query layer for graph databases. While
expanding query statements through calcite’s adapter interface,
We will also use thinkerpop's universal graph database API to provide
dialect compatibility for different graph databases.
Give a simple example:
From
SELECT "key" FROM inttype
maps to
g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
%\$"))))
The design architecture is divided into three layers.
Analytical syntax layer, relational algebra transformation layer, logical
plan binding layer.
Parsing syntax layer: In the parsing query statement phase, fields and
equivalent conditions are split and converted into points and edges.
Relational algebra layer: Split it into a collection of points and edges,
and convert it into a TableScan during the aggregation/sorting/query stage
where calcite abstracts it.
It is convenient to generate query plans based on conditions and field
information.
Connection scanning/single table filtering and other methods can be used to
traverse from any edge/any starting point in the graph
Logical plan binding layer: Bind behaviors such as connection
scanning/single table filtering/projection to calcite’s planner to generate
query plans.
The process of generating Gremlin logical plan using select statement:
1. First of all, all graph databases start from a source point to build the
graph. We will use the GraphTraversalSource provided by thinkerpop.
As the origin, extract the syntax of the incoming point and side
information. This step will be implemented in SqlSchemaGrabber
2. Secondly, for select/where/having/order by/group by our plan in the
parsing phase is as follows:
- group by: for a point. There are out-degree and in-degree
attributes in the graph. From the perspective of the data table, it is
equivalent to converting the table data into IN or OUT.
These two dimensions are aggregated. This behavior also corresponds to
the table traversal graph operation. During the graph traversal
process, fold/unfold tags will be generated, representing the
direction of graph traversal.
- select: For the select operation, the operation of scanning the
entire table can be regarded as projecting all columns into point
attributes. The value changes of each column are mapped to the gremlin
operation of adding points.
- where: The filter operation is used in graph computing semantic
scenarios. It can be regarded as the edges connected by the out-degree
and in-degree of the filter point, so it does not involve the
conversion of relational algebra.
Instead, it is pushed directly to the logical plan.
- order by: In the process of graph traversal, we mentioned that
fold/unflod will be generated on the path to represent the
forward/backward direction.
If a field is encountered and there is no value that can be sorted,
we will fall back to the origin of GraphTraversalSource and end the
sorting operation.
If there are values that can be sorted, they will be unified in
SqlTraversalEngine, in-degree and out-degree will be counted
separately for aggregation, and then used with group by according to
label (IN/OUT)
- having: The meaning is the same as group by, but the label is
different (in addition to the IN/OUT columns, specific point fields
need to be specified)
Currently, I have only completed unit tests that translate from SQL to
Gremlin execution plan, among which test cases for group by and where are
to be added. In addition, I will also use mainstream graph databases such
as neo4j and JanusGraph as examples to write sound integration tests. ,
ensuring that the API of the graph database is successfully called after
converting the sql request into the gremlin execution plan.
Finally, community members are welcome to give suggestions.