Re: Feature: Support gremlin adapter

柳尘 Tue, 06 Feb 2024 02:00:41 -0800

Hello Julian Hyde, I just saw your reply now.

I understand that the second implementation method of gremlin syntax
mentioned in your reply should be related to the function of dialect
conversion. This proposal aims to provide some operations based on point
tables and edge tables for users who are not used to writing Gremlin sql.
For example Creation, traversal, association, and aggregation methods. With
the help of calcite's syntax analysis and query plan push-down, the user's
SQL is converted into Gremlin API and handed over to the specified graph
data for execution. To sum up, I think this is a Gremlin Adapter


Julian Hyde <[email protected]> 于2024年1月4日周四 05:16写道：

> I had a similar question to Mihai.
>
> After your change, would Calcite be able to translate a user's SQL
> query and execute it against an existing implementation of the Gremlin
> API (say neo4j), or would users be able to talk to Calcite in the
> Gremlin API and have Calcite convert that, via relational algebra, to
> SQL (or whatever language the back-end speaks).
>
> The first of these would be called a 'Gremlin adapter', and the second
> would be called a 'Gremlin front-end'.
>
> Julian
>
> On Tue, Dec 26, 2023 at 5:17 PM 柳尘 <[email protected]> wrote:
> >
> > Hello Mihai Budiu,I'm sorry for the late reply because the time zone
> > relationship is inconsistent. I am recently trying to integrate the code
> in
> > the warehouse into the calcite source code project, and there are some
> > dependency conflicts that need to be resolved.
> >
> > Regarding your question about the unclear description of my
> contribution, I
> > am working hard to improve it. Simply put, I plan to implement the
> relevant
> > adapter for adapting Gremlin syntax (
> > https://calcite.apache.org/docs/adapter.html) Module. Because Gremlin
> is a
> > common syntax for many graph databases, in the description of the above
> > content, I spent a lot of space introducing the background of this
> syntax.
> > After that, I will open a pull request and reply to the mailing list for
> > specific changes.
> >
> > Mihai Budiu <[email protected]> 于2023年12月27日周三 03:02写道：
> >
> > > I cannot really figure out from this description what you plan to
> > > contribute.
> > > Can you write a one paragraph description, perhaps a list of bullet
> points?
> > > Something like: "parser for Gremlin dialect, converter from Gremlin IR
> to
> > > Calcite IR, etc."?
> > >
> > > Mihai
> > > ________________________________
> > > From: 柳尘 <[email protected]>
> > > Sent: Monday, December 25, 2023 3:57 PM
> > > To: [email protected] <[email protected]>
> > > Subject: Re: Feature: Support gremlin adapter
> > >
> > > Thank you, thank you very much for your reply. Before I open a new PR
> for
> > > this requirement, obviously I need to write some test cases to
> illustrate.
> > > Currently there are some test cases in the repository but no related
> > > integration tests. This work is in progress .
> > >
> > > As an example:
> > > As shown below, there is a representation of a point set and an edge
> set.
> > > You can see that an edge is represented by an out-degree node, an
> in-degree
> > > node, and a label.
> > >
> > > {
> > >     "tables": [
> > >
> > > { "name": "company", "columns": [ \{"name": "name", "type": "string"}
> > >         ]
> > >       },
> > >
> > > { "name": "country", "columns": [ \{"name": "name", "type": "string"}
> > >         ]
> > >       },
> > >
> > > { "name": "planet", "columns": [ \{"name": "name", "type": "string"}
> > >         ]
> > >       },
> > >
> > > { "name": "person", "columns": [ \{"name": "name", "type": "string"}
> > > ,
> > >           {"name": "age", "type": "integer"}
> > >         ]
> > >       },
> > >
> > > { "name": "spaceship", "columns": [ \{"name": "name", "type": "string"}
> > > ,
> > >           {"name": "model", "type": "string"}
> > >         ]
> > >       },
> > >
> > > { "name": "satellite", "columns": [ \{"name": "name", "type": "string"}
> > >         ]
> > >       },
> > >
> > > { "name": "sensor", "columns": [ \{"name": "name", "type": "string"}
> > > ,
> > >           {"name": "type", "type": "string"}
> > >         ]
> > >       },
> > >
> > > { "name": "sensorReading", "columns": [ \{"name": "tstamp", "type":
> > > "long_timestamp", "propertyName": "timestamp"}
> > > ,
> > >           {"name": "dt", "type": "long_date", "propertyName": "date"},
> > >           {"name": "value", "type": "double"}
> > >         ]
> > >       },
> > >
> > > { "name": "fliesTo", "columns":[ \{"name": "trips", "type": "integer"}
> > >         ]
> > >       },
> > >
> > > { "name": "orbits", "columns":[ \{"name": "launched", "type":
> "integer"}
> > >         ]
> > >       }
> > >       ],
> > >     "relationships": [
> > >       {"outTable": "company", "inTable": "country", "edgeLabel":
> "baseIn"},
> > >       {"outTable": "person", "inTable": "company", "edgeLabel":
> > > "worksFor"},
> > >       {"outTable": "person", "inTable": "planets", "edgeLabel":
> > > "travelledTo"},
> > >       {"outTable": "company", "inTable": "spaceship", "edgeLabel":
> "owns"},
> > >       {"outTable": "person", "inTable": "spaceship", "edgeLabel":
> > > "pilots"},
> > >       {"outTable": "sensor", "inTable": "sensorReading", "edgeLabel":
> > > "hasReading", "fkTable": "sensorReading"},
> > >       {"outTable": "person", "inTable": "planet", "edgeLabel":
> "fliesTo"},
> > >       {"outTable": "satellite", "inTable": "planet", "edgeLabel":
> > > "orbits"},
> > >       {"outTable": "person", "inTable": "person", "edgeLabel":
> > > "friendsWith"}
> > >     ]
> > > }
> > >
> > > Please allow me to introduce some basic grammar concepts of Gremlin
> first
> > >
> > > - V(): Query vertices, for example: g.V(), g.V(‘v_id’), query all
> points
> > > and specific points
> > > - E(): query edge, there are many types of statements that can be
> continued
> > > later
> > > - id(): Get the id of vertices and edges. Example: g.V().id(), query
> the
> > > ids of all vertices
> > > - label(): Get the labels of vertices and edges. Example:
> g.V().label(),
> > > you can query the labels of all vertices
> > >
> > > In the traversal operation of the graph database, it is mainly divided
> into
> > > the following operations:
> > >
> > > 1. Based on the "vertex"
> > >
> > >     1) out(label): Access the [vertex's OUT direction adjacent points]
> > > according to the specified Edge Label (zero Edge Label, representing
> all
> > > types of edges; it can also be one or more Edge Labels, representing
> any
> > > given Edge Label Side, the same below);
> > >
> > >      2) in(label): Access [the IN direction adjacent points of the
> vertex]
> > > according to the specified Edge Label;
> > >
> > >      3) both(label): Access the [bidirectional adjacent points of the
> > > vertex] according to the specified Edge Label;
> > >
> > >      4) outE(label): Access [the OUT direction adjacent edge of the
> vertex]
> > > according to the specified Edge Label;
> > >
> > >       5) inE(label): Access [the IN direction adjacent edge of the
> vertex]
> > > according to the specified Edge Label;
> > >
> > >       6). bothE(label): Access [the two-way adjacent edges of the
> vertex]
> > > according to the specified Edge Label;
> > >
> > >
> > > 2. Based on “edge”
> > >     1) outV(): access [the outgoing vertex of the edge], which refers
> to
> > > the [starting vertex of the edge];
> > >
> > >      2) inV(): access [the incoming vertex of the edge], the incoming
> > > vertex refers to the [target vertex of the edge], the vertex pointed
> by the
> > > arrow;
> > >
> > >      3) bothV(): access [bidirectional vertices of edges];
> > >
> > >      4) otherV(): access the [partner vertex of the edge], that is, the
> > > vertex at the other end relative to the base vertex;
> > >
> > >
> > > Let us classify and discuss in terms of points and edges respectively,
> > >
> > > If the user chooses to traverse based on points, they need to query and
> > > perform aggregation operations based on Edge Label. In the above
> example,
> > > if the user needs to count the in-degree and out-degree (both types)
> based
> > > on company, then the upper layer uses select count(1) from company can
> get
> > > the desired query results (abstracting GremlinSqlSelectSingle by
> rewriting
> > > graph traversal operations, such as GroupBy/select, OrderBy,
> Having/Where
> > > and other sql operators to ensure the generation of execution plans).
> For
> > > those who only want to count out-degrees Or if you can only count the
> > > in-degree results, it means that you need to join the table operation.
> In
> > > other words, if you are counting the out-degree of the company table
> to the
> > > country table. Using sql, it is expressed as select count(1) from
> company
> > > left join country on company.id = country.id. This corresponds to
> > > GremlinSqlSelectMulti in the code. It will abstract an assembly class
> > > (GremlinVertexTable) of a point collection table, including inEdges and
> > > outEdges two sets, pre-traverse the out-degree edges and in-degree
> edges
> > > that match the label and put them into sqlMetadata for caching. When
> the
> > > user queries, take out the edge set that meets the conditions from
> > > sqlMetadata and push down the execution plan.
> > >
> > > If the user chooses to traverse based on edges, they need to consider
> > > aggregating using an attribute in the company's point collection as the
> > > dimension under the condition that the query is based on the Edge
> Label.
> > > That is, the name field in the columns collection. You can write select
> > > count(1) from company group by name to count out-degree and in-degree
> (both
> > > types).
> > >
> > >
> > > This made me realize that I need to write a more complete document.
> For any
> > > questions about the above, friends in the community are welcome to ask
> > > questions and add.
> > >
> > > Best wishes to you
> > >
> > > Benchao Li <[email protected]> 于2023年12月25日周一 22:48写道：
> > >
> > > > This sounds very interesting to me, although I'm not familiar with
> > > > graph databases.
> > > >
> > > > I'm more interested in how to represent graph data structure and
> graph
> > > > query in SQL, is there any relational database/SQL query engine that
> > > > has done this before? Do we need to add some special data
> > > > types/operators for graph processing?
> > > >
> > > > Forward Xu <[email protected]> 于2023年12月25日周一 09:51写道：
> > > > >
> > > > > hi
> > > > >
> > > > > This is a great feature to extend calcite from regular data
> queries to
> > > > > graph queries (calculations),
> > > > > +1 for looking forward to it.
> > > > >
> > > > > forwardxu
> > > > >
> > > > > 柳尘 <[email protected]> 于2023年12月24日周日 11:20写道：
> > > > >
> > > > > > Motivation
> > > > > >
> > > > > > Hi, community. Currently, more and more users are using some
> graph
> > > > > > databases, such as JanusGraph, HugeGraph, etc.
> > > > > > To do some relationship representation of personnel social
> networks,
> > > > > > It is used to count the activity of each user in the social
> network.
> > > > Most
> > > > > > graph databases are in the graph building and graph traversal
> stage.
> > > > > > All will be implemented using Gremlin syntax.
> > > > > > However, this is very unfriendly to users who are not familiar
> with
> > > > gremlin
> > > > > > syntax. While calcite exists as a query framework,
> > > > > > It also provides an adapter interface to adapt to different
> database
> > > > > > dialects, such as parsing, relational algebra conversion, and
> query
> > > > plan
> > > > > > binding.
> > > > > > Our company has solved the problem of adapting various graph
> > > databases.
> > > > > > This is my warehouse:
> > > > https://github.com/kaori-seasons/calcite-gremlin-sql
> > > > > >
> > > > > >
> > > > > > Background
> > > > > >
> > > > > > Calcite itself supports the database language expansion of the
> > > adapter,
> > > > > > which enables users to understand the meaning of the grammar.
> > > > > > Complete the simplification of the dialect. For example, expand
> > > > SqlNode to
> > > > > > implement syntax analysis, and expand RelNode to implement
> logical
> > > plan
> > > > > > mapping.
> > > > > >
> > > > > > thinkerpop is an adaptation framework for various graph
> databases. In
> > > > this
> > > > > > framework, gremlin syntax is mentioned for the first time.
> > > > > > It has now become a universal query layer for graph databases.
> While
> > > > > > expanding query statements through calcite’s adapter interface,
> > > > > > We will also use thinkerpop's universal graph database API to
> provide
> > > > > > dialect compatibility for different graph databases.
> > > > > >
> > > > > > Give a simple example:
> > > > > > From
> > > > > >
> > > > > > SELECT "key" FROM inttype
> > > > > >
> > > > > >  maps to
> > > > > >
> > > > > >
> > > > > >
> > > >
> > >
> g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
> > > > > >
> > > > > >
> > > >
> > >
> by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
> > > > > > %\$"))))
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > The design architecture is divided into three layers.
> > > > > >
> > > > > > Analytical syntax layer, relational algebra transformation layer,
> > > > logical
> > > > > > plan binding layer.
> > > > > >
> > > > > > Parsing syntax layer: In the parsing query statement phase,
> fields
> > > and
> > > > > > equivalent conditions are split and converted into points and
> edges.
> > > > > >
> > > > > > Relational algebra layer: Split it into a collection of points
> and
> > > > edges,
> > > > > > and convert it into a TableScan during the
> aggregation/sorting/query
> > > > stage
> > > > > > where calcite abstracts it.
> > > > > > It is convenient to generate query plans based on conditions and
> > > field
> > > > > > information.
> > > > > > Connection scanning/single table filtering and other methods can
> be
> > > > used to
> > > > > > traverse from any edge/any starting point in the graph
> > > > > >
> > > > > > Logical plan binding layer: Bind behaviors such as connection
> > > > > > scanning/single table filtering/projection to calcite’s planner
> to
> > > > generate
> > > > > > query plans.
> > > > > >
> > > > > > The process of generating Gremlin logical plan using select
> > > statement:
> > > > > >
> > > > > > 1. First of all, all graph databases start from a source point to
> > > > build the
> > > > > > graph. We will use the GraphTraversalSource provided by
> thinkerpop.
> > > > > > As the origin, extract the syntax of the incoming point and side
> > > > > > information. This step will be implemented in SqlSchemaGrabber
> > > > > > 2. Secondly, for select/where/having/order by/group by our plan
> in
> > > the
> > > > > > parsing phase is as follows:
> > > > > >
> > > > > >   - group by: for a point. There are out-degree and in-degree
> > > > > > attributes in the graph. From the perspective of the data table,
> it
> > > is
> > > > > > equivalent to converting the table data into IN or OUT.
> > > > > > These two dimensions are aggregated. This behavior also
> corresponds
> > > to
> > > > > > the table traversal graph operation. During the graph traversal
> > > > > > process, fold/unfold tags will be generated, representing the
> > > > > > direction of graph traversal.
> > > > > >   - select: For the select operation, the operation of scanning
> the
> > > > > > entire table can be regarded as projecting all columns into point
> > > > > > attributes. The value changes of each column are mapped to the
> > > gremlin
> > > > > > operation of adding points.
> > > > > >   - where: The filter operation is used in graph computing
> semantic
> > > > > > scenarios. It can be regarded as the edges connected by the
> > > out-degree
> > > > > > and in-degree of the filter point, so it does not involve the
> > > > > > conversion of relational algebra.
> > > > > >   Instead, it is pushed directly to the logical plan.
> > > > > >   - order by: In the process of graph traversal, we mentioned
> that
> > > > > > fold/unflod will be generated on the path to represent the
> > > > > > forward/backward direction.
> > > > > >   If a field is encountered and there is no value that can be
> sorted,
> > > > > > we will fall back to the origin of GraphTraversalSource and end
> the
> > > > > > sorting operation.
> > > > > >   If there are values that can be sorted, they will be unified in
> > > > > > SqlTraversalEngine, in-degree and out-degree will be counted
> > > > > > separately for aggregation, and then used with group by
> according to
> > > > > > label (IN/OUT)
> > > > > >   - having: The meaning is the same as group by, but the label is
> > > > > > different (in addition to the IN/OUT columns, specific point
> fields
> > > > > > need to be specified)
> > > > > >
> > > > > >  Currently, I have only completed unit tests that translate from
> SQL
> > > to
> > > > > > Gremlin execution plan, among which test cases for group by and
> where
> > > > are
> > > > > > to be added. In addition, I will also use mainstream graph
> databases
> > > > such
> > > > > > as neo4j and JanusGraph as examples to write sound integration
> > > tests. ,
> > > > > > ensuring that the API of the graph database is successfully
> called
> > > > after
> > > > > > converting the sql request into the gremlin execution plan.
> > > > > >
> > > > > > Finally, community members are welcome to give suggestions.
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Best,
> > > > Benchao Li
> > > >
> > >
>

Re: Feature: Support gremlin adapter

Reply via email to