Re: Feature: Support gremlin adapter

Mihai Budiu Tue, 26 Dec 2023 11:02:59 -0800

I cannot really figure out from this description what you plan to contribute.
Can you write a one paragraph description, perhaps a list of bullet points?
Something like: "parser for Gremlin dialect, converter from Gremlin IR to 
Calcite IR, etc."?


Mihai
________________________________
From: 柳尘 <[email protected]>
Sent: Monday, December 25, 2023 3:57 PM
To: [email protected] <[email protected]>
Subject: Re: Feature: Support gremlin adapter

Thank you, thank you very much for your reply. Before I open a new PR for
this requirement, obviously I need to write some test cases to illustrate.
Currently there are some test cases in the repository but no related
integration tests. This work is in progress .

As an example:
As shown below, there is a representation of a point set and an edge set.
You can see that an edge is represented by an out-degree node, an in-degree
node, and a label.

{
    "tables": [

{ "name": "company", "columns": [ \{"name": "name", "type": "string"}
        ]
      },

{ "name": "country", "columns": [ \{"name": "name", "type": "string"}
        ]
      },

{ "name": "planet", "columns": [ \{"name": "name", "type": "string"}
        ]
      },

{ "name": "person", "columns": [ \{"name": "name", "type": "string"}
,
          {"name": "age", "type": "integer"}
        ]
      },

{ "name": "spaceship", "columns": [ \{"name": "name", "type": "string"}
,
          {"name": "model", "type": "string"}
        ]
      },

{ "name": "satellite", "columns": [ \{"name": "name", "type": "string"}
        ]
      },

{ "name": "sensor", "columns": [ \{"name": "name", "type": "string"}
,
          {"name": "type", "type": "string"}
        ]
      },

{ "name": "sensorReading", "columns": [ \{"name": "tstamp", "type":
"long_timestamp", "propertyName": "timestamp"}
,
          {"name": "dt", "type": "long_date", "propertyName": "date"},
          {"name": "value", "type": "double"}
        ]
      },

{ "name": "fliesTo", "columns":[ \{"name": "trips", "type": "integer"}
        ]
      },

{ "name": "orbits", "columns":[ \{"name": "launched", "type": "integer"}
        ]
      }
      ],
    "relationships": [
      {"outTable": "company", "inTable": "country", "edgeLabel": "baseIn"},
      {"outTable": "person", "inTable": "company", "edgeLabel": "worksFor"},
      {"outTable": "person", "inTable": "planets", "edgeLabel":
"travelledTo"},
      {"outTable": "company", "inTable": "spaceship", "edgeLabel": "owns"},
      {"outTable": "person", "inTable": "spaceship", "edgeLabel": "pilots"},
      {"outTable": "sensor", "inTable": "sensorReading", "edgeLabel":
"hasReading", "fkTable": "sensorReading"},
      {"outTable": "person", "inTable": "planet", "edgeLabel": "fliesTo"},
      {"outTable": "satellite", "inTable": "planet", "edgeLabel": "orbits"},
      {"outTable": "person", "inTable": "person", "edgeLabel":
"friendsWith"}
    ]
}

Please allow me to introduce some basic grammar concepts of Gremlin first

- V(): Query vertices, for example: g.V(), g.V(‘v_id’), query all points
and specific points
- E(): query edge, there are many types of statements that can be continued
later
- id(): Get the id of vertices and edges. Example: g.V().id(), query the
ids of all vertices
- label(): Get the labels of vertices and edges. Example: g.V().label(),
you can query the labels of all vertices

In the traversal operation of the graph database, it is mainly divided into
the following operations:

1. Based on the "vertex"

    1) out(label): Access the [vertex's OUT direction adjacent points]
according to the specified Edge Label (zero Edge Label, representing all
types of edges; it can also be one or more Edge Labels, representing any
given Edge Label Side, the same below);

     2) in(label): Access [the IN direction adjacent points of the vertex]
according to the specified Edge Label;

     3) both(label): Access the [bidirectional adjacent points of the
vertex] according to the specified Edge Label;

     4) outE(label): Access [the OUT direction adjacent edge of the vertex]
according to the specified Edge Label;

      5) inE(label): Access [the IN direction adjacent edge of the vertex]
according to the specified Edge Label;

      6). bothE(label): Access [the two-way adjacent edges of the vertex]
according to the specified Edge Label;


2. Based on “edge”
    1) outV(): access [the outgoing vertex of the edge], which refers to
the [starting vertex of the edge];

     2) inV(): access [the incoming vertex of the edge], the incoming
vertex refers to the [target vertex of the edge], the vertex pointed by the
arrow;

     3) bothV(): access [bidirectional vertices of edges];

     4) otherV(): access the [partner vertex of the edge], that is, the
vertex at the other end relative to the base vertex;


Let us classify and discuss in terms of points and edges respectively,

If the user chooses to traverse based on points, they need to query and
perform aggregation operations based on Edge Label. In the above example,
if the user needs to count the in-degree and out-degree (both types) based
on company, then the upper layer uses select count(1) from company can get
the desired query results (abstracting GremlinSqlSelectSingle by rewriting
graph traversal operations, such as GroupBy/select, OrderBy, Having/Where
and other sql operators to ensure the generation of execution plans). For
those who only want to count out-degrees Or if you can only count the
in-degree results, it means that you need to join the table operation. In
other words, if you are counting the out-degree of the company table to the
country table. Using sql, it is expressed as select count(1) from company
left join country on company.id = country.id. This corresponds to
GremlinSqlSelectMulti in the code. It will abstract an assembly class
(GremlinVertexTable) of a point collection table, including inEdges and
outEdges two sets, pre-traverse the out-degree edges and in-degree edges
that match the label and put them into sqlMetadata for caching. When the
user queries, take out the edge set that meets the conditions from
sqlMetadata and push down the execution plan.

If the user chooses to traverse based on edges, they need to consider
aggregating using an attribute in the company's point collection as the
dimension under the condition that the query is based on the Edge Label.
That is, the name field in the columns collection. You can write select
count(1) from company group by name to count out-degree and in-degree (both
types).


This made me realize that I need to write a more complete document. For any
questions about the above, friends in the community are welcome to ask
questions and add.

Best wishes to you

Benchao Li <[email protected]> 于2023年12月25日周一 22:48写道：

> This sounds very interesting to me, although I'm not familiar with
> graph databases.
>
> I'm more interested in how to represent graph data structure and graph
> query in SQL, is there any relational database/SQL query engine that
> has done this before? Do we need to add some special data
> types/operators for graph processing?
>
> Forward Xu <[email protected]> 于2023年12月25日周一 09:51写道：
> >
> > hi
> >
> > This is a great feature to extend calcite from regular data queries to
> > graph queries (calculations),
> > +1 for looking forward to it.
> >
> > forwardxu
> >
> > 柳尘 <[email protected]> 于2023年12月24日周日 11:20写道：
> >
> > > Motivation
> > >
> > > Hi, community. Currently, more and more users are using some graph
> > > databases, such as JanusGraph, HugeGraph, etc.
> > > To do some relationship representation of personnel social networks,
> > > It is used to count the activity of each user in the social network.
> Most
> > > graph databases are in the graph building and graph traversal stage.
> > > All will be implemented using Gremlin syntax.
> > > However, this is very unfriendly to users who are not familiar with
> gremlin
> > > syntax. While calcite exists as a query framework,
> > > It also provides an adapter interface to adapt to different database
> > > dialects, such as parsing, relational algebra conversion, and query
> plan
> > > binding.
> > > Our company has solved the problem of adapting various graph databases.
> > > This is my warehouse:
> https://github.com/kaori-seasons/calcite-gremlin-sql
> > >
> > >
> > > Background
> > >
> > > Calcite itself supports the database language expansion of the adapter,
> > > which enables users to understand the meaning of the grammar.
> > > Complete the simplification of the dialect. For example, expand
> SqlNode to
> > > implement syntax analysis, and expand RelNode to implement logical plan
> > > mapping.
> > >
> > > thinkerpop is an adaptation framework for various graph databases. In
> this
> > > framework, gremlin syntax is mentioned for the first time.
> > > It has now become a universal query layer for graph databases. While
> > > expanding query statements through calcite’s adapter interface,
> > > We will also use thinkerpop's universal graph database API to provide
> > > dialect compatibility for different graph databases.
> > >
> > > Give a simple example:
> > > From
> > >
> > > SELECT "key" FROM inttype
> > >
> > >  maps to
> > >
> > >
> > >
> g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
> > >
> > >
> by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
> > > %\$"))))
> > >
> > >
> > >
> > >
> > >
> > > The design architecture is divided into three layers.
> > >
> > > Analytical syntax layer, relational algebra transformation layer,
> logical
> > > plan binding layer.
> > >
> > > Parsing syntax layer: In the parsing query statement phase, fields and
> > > equivalent conditions are split and converted into points and edges.
> > >
> > > Relational algebra layer: Split it into a collection of points and
> edges,
> > > and convert it into a TableScan during the aggregation/sorting/query
> stage
> > > where calcite abstracts it.
> > > It is convenient to generate query plans based on conditions and field
> > > information.
> > > Connection scanning/single table filtering and other methods can be
> used to
> > > traverse from any edge/any starting point in the graph
> > >
> > > Logical plan binding layer: Bind behaviors such as connection
> > > scanning/single table filtering/projection to calcite’s planner to
> generate
> > > query plans.
> > >
> > > The process of generating Gremlin logical plan using select statement:
> > >
> > > 1. First of all, all graph databases start from a source point to
> build the
> > > graph. We will use the GraphTraversalSource provided by thinkerpop.
> > > As the origin, extract the syntax of the incoming point and side
> > > information. This step will be implemented in SqlSchemaGrabber
> > > 2. Secondly, for select/where/having/order by/group by our plan in the
> > > parsing phase is as follows:
> > >
> > >   - group by: for a point. There are out-degree and in-degree
> > > attributes in the graph. From the perspective of the data table, it is
> > > equivalent to converting the table data into IN or OUT.
> > > These two dimensions are aggregated. This behavior also corresponds to
> > > the table traversal graph operation. During the graph traversal
> > > process, fold/unfold tags will be generated, representing the
> > > direction of graph traversal.
> > >   - select: For the select operation, the operation of scanning the
> > > entire table can be regarded as projecting all columns into point
> > > attributes. The value changes of each column are mapped to the gremlin
> > > operation of adding points.
> > >   - where: The filter operation is used in graph computing semantic
> > > scenarios. It can be regarded as the edges connected by the out-degree
> > > and in-degree of the filter point, so it does not involve the
> > > conversion of relational algebra.
> > >   Instead, it is pushed directly to the logical plan.
> > >   - order by: In the process of graph traversal, we mentioned that
> > > fold/unflod will be generated on the path to represent the
> > > forward/backward direction.
> > >   If a field is encountered and there is no value that can be sorted,
> > > we will fall back to the origin of GraphTraversalSource and end the
> > > sorting operation.
> > >   If there are values that can be sorted, they will be unified in
> > > SqlTraversalEngine, in-degree and out-degree will be counted
> > > separately for aggregation, and then used with group by according to
> > > label (IN/OUT)
> > >   - having: The meaning is the same as group by, but the label is
> > > different (in addition to the IN/OUT columns, specific point fields
> > > need to be specified)
> > >
> > >  Currently, I have only completed unit tests that translate from SQL to
> > > Gremlin execution plan, among which test cases for group by and where
> are
> > > to be added. In addition, I will also use mainstream graph databases
> such
> > > as neo4j and JanusGraph as examples to write sound integration tests. ,
> > > ensuring that the API of the graph database is successfully called
> after
> > > converting the sql request into the gremlin execution plan.
> > >
> > > Finally, community members are welcome to give suggestions.
> > >
>
>
>
> --
>
> Best,
> Benchao Li
>

Re: Feature: Support gremlin adapter

Reply via email to