Re: Feature: Support gremlin adapter

柳尘 Tue, 26 Dec 2023 17:17:02 -0800

Hello Mihai Budiu,I'm sorry for the late reply because the time zone
relationship is inconsistent. I am recently trying to integrate the code in
the warehouse into the calcite source code project, and there are some
dependency conflicts that need to be resolved.


Regarding your question about the unclear description of my contribution, I
am working hard to improve it. Simply put, I plan to implement the relevant
adapter for adapting Gremlin syntax (
https://calcite.apache.org/docs/adapter.html) Module. Because Gremlin is a
common syntax for many graph databases, in the description of the above
content, I spent a lot of space introducing the background of this syntax.
After that, I will open a pull request and reply to the mailing list for
specific changes.

Mihai Budiu <mbu...@gmail.com> 于2023年12月27日周三 03:02写道：

> I cannot really figure out from this description what you plan to
> contribute.
> Can you write a one paragraph description, perhaps a list of bullet points?
> Something like: "parser for Gremlin dialect, converter from Gremlin IR to
> Calcite IR, etc."?
>
> Mihai
> ________________________________
> From: 柳尘 <yuluoxinsh...@gmail.com>
> Sent: Monday, December 25, 2023 3:57 PM
> To: dev@calcite.apache.org <dev@calcite.apache.org>
> Subject: Re: Feature: Support gremlin adapter
>
> Thank you, thank you very much for your reply. Before I open a new PR for
> this requirement, obviously I need to write some test cases to illustrate.
> Currently there are some test cases in the repository but no related
> integration tests. This work is in progress .
>
> As an example:
> As shown below, there is a representation of a point set and an edge set.
> You can see that an edge is represented by an out-degree node, an in-degree
> node, and a label.
>
> {
>     "tables": [
>
> { "name": "company", "columns": [ \{"name": "name", "type": "string"}
>         ]
>       },
>
> { "name": "country", "columns": [ \{"name": "name", "type": "string"}
>         ]
>       },
>
> { "name": "planet", "columns": [ \{"name": "name", "type": "string"}
>         ]
>       },
>
> { "name": "person", "columns": [ \{"name": "name", "type": "string"}
> ,
>           {"name": "age", "type": "integer"}
>         ]
>       },
>
> { "name": "spaceship", "columns": [ \{"name": "name", "type": "string"}
> ,
>           {"name": "model", "type": "string"}
>         ]
>       },
>
> { "name": "satellite", "columns": [ \{"name": "name", "type": "string"}
>         ]
>       },
>
> { "name": "sensor", "columns": [ \{"name": "name", "type": "string"}
> ,
>           {"name": "type", "type": "string"}
>         ]
>       },
>
> { "name": "sensorReading", "columns": [ \{"name": "tstamp", "type":
> "long_timestamp", "propertyName": "timestamp"}
> ,
>           {"name": "dt", "type": "long_date", "propertyName": "date"},
>           {"name": "value", "type": "double"}
>         ]
>       },
>
> { "name": "fliesTo", "columns":[ \{"name": "trips", "type": "integer"}
>         ]
>       },
>
> { "name": "orbits", "columns":[ \{"name": "launched", "type": "integer"}
>         ]
>       }
>       ],
>     "relationships": [
>       {"outTable": "company", "inTable": "country", "edgeLabel": "baseIn"},
>       {"outTable": "person", "inTable": "company", "edgeLabel":
> "worksFor"},
>       {"outTable": "person", "inTable": "planets", "edgeLabel":
> "travelledTo"},
>       {"outTable": "company", "inTable": "spaceship", "edgeLabel": "owns"},
>       {"outTable": "person", "inTable": "spaceship", "edgeLabel":
> "pilots"},
>       {"outTable": "sensor", "inTable": "sensorReading", "edgeLabel":
> "hasReading", "fkTable": "sensorReading"},
>       {"outTable": "person", "inTable": "planet", "edgeLabel": "fliesTo"},
>       {"outTable": "satellite", "inTable": "planet", "edgeLabel":
> "orbits"},
>       {"outTable": "person", "inTable": "person", "edgeLabel":
> "friendsWith"}
>     ]
> }
>
> Please allow me to introduce some basic grammar concepts of Gremlin first
>
> - V(): Query vertices, for example: g.V(), g.V(‘v_id’), query all points
> and specific points
> - E(): query edge, there are many types of statements that can be continued
> later
> - id(): Get the id of vertices and edges. Example: g.V().id(), query the
> ids of all vertices
> - label(): Get the labels of vertices and edges. Example: g.V().label(),
> you can query the labels of all vertices
>
> In the traversal operation of the graph database, it is mainly divided into
> the following operations:
>
> 1. Based on the "vertex"
>
>     1) out(label): Access the [vertex's OUT direction adjacent points]
> according to the specified Edge Label (zero Edge Label, representing all
> types of edges; it can also be one or more Edge Labels, representing any
> given Edge Label Side, the same below);
>
>      2) in(label): Access [the IN direction adjacent points of the vertex]
> according to the specified Edge Label;
>
>      3) both(label): Access the [bidirectional adjacent points of the
> vertex] according to the specified Edge Label;
>
>      4) outE(label): Access [the OUT direction adjacent edge of the vertex]
> according to the specified Edge Label;
>
>       5) inE(label): Access [the IN direction adjacent edge of the vertex]
> according to the specified Edge Label;
>
>       6). bothE(label): Access [the two-way adjacent edges of the vertex]
> according to the specified Edge Label;
>
>
> 2. Based on “edge”
>     1) outV(): access [the outgoing vertex of the edge], which refers to
> the [starting vertex of the edge];
>
>      2) inV(): access [the incoming vertex of the edge], the incoming
> vertex refers to the [target vertex of the edge], the vertex pointed by the
> arrow;
>
>      3) bothV(): access [bidirectional vertices of edges];
>
>      4) otherV(): access the [partner vertex of the edge], that is, the
> vertex at the other end relative to the base vertex;
>
>
> Let us classify and discuss in terms of points and edges respectively,
>
> If the user chooses to traverse based on points, they need to query and
> perform aggregation operations based on Edge Label. In the above example,
> if the user needs to count the in-degree and out-degree (both types) based
> on company, then the upper layer uses select count(1) from company can get
> the desired query results (abstracting GremlinSqlSelectSingle by rewriting
> graph traversal operations, such as GroupBy/select, OrderBy, Having/Where
> and other sql operators to ensure the generation of execution plans). For
> those who only want to count out-degrees Or if you can only count the
> in-degree results, it means that you need to join the table operation. In
> other words, if you are counting the out-degree of the company table to the
> country table. Using sql, it is expressed as select count(1) from company
> left join country on company.id = country.id. This corresponds to
> GremlinSqlSelectMulti in the code. It will abstract an assembly class
> (GremlinVertexTable) of a point collection table, including inEdges and
> outEdges two sets, pre-traverse the out-degree edges and in-degree edges
> that match the label and put them into sqlMetadata for caching. When the
> user queries, take out the edge set that meets the conditions from
> sqlMetadata and push down the execution plan.
>
> If the user chooses to traverse based on edges, they need to consider
> aggregating using an attribute in the company's point collection as the
> dimension under the condition that the query is based on the Edge Label.
> That is, the name field in the columns collection. You can write select
> count(1) from company group by name to count out-degree and in-degree (both
> types).
>
>
> This made me realize that I need to write a more complete document. For any
> questions about the above, friends in the community are welcome to ask
> questions and add.
>
> Best wishes to you
>
> Benchao Li <libenc...@apache.org> 于2023年12月25日周一 22:48写道：
>
> > This sounds very interesting to me, although I'm not familiar with
> > graph databases.
> >
> > I'm more interested in how to represent graph data structure and graph
> > query in SQL, is there any relational database/SQL query engine that
> > has done this before? Do we need to add some special data
> > types/operators for graph processing?
> >
> > Forward Xu <forwardxu...@gmail.com> 于2023年12月25日周一 09:51写道：
> > >
> > > hi
> > >
> > > This is a great feature to extend calcite from regular data queries to
> > > graph queries (calculations),
> > > +1 for looking forward to it.
> > >
> > > forwardxu
> > >
> > > 柳尘 <yuluoxinsh...@gmail.com> 于2023年12月24日周日 11:20写道：
> > >
> > > > Motivation
> > > >
> > > > Hi, community. Currently, more and more users are using some graph
> > > > databases, such as JanusGraph, HugeGraph, etc.
> > > > To do some relationship representation of personnel social networks,
> > > > It is used to count the activity of each user in the social network.
> > Most
> > > > graph databases are in the graph building and graph traversal stage.
> > > > All will be implemented using Gremlin syntax.
> > > > However, this is very unfriendly to users who are not familiar with
> > gremlin
> > > > syntax. While calcite exists as a query framework,
> > > > It also provides an adapter interface to adapt to different database
> > > > dialects, such as parsing, relational algebra conversion, and query
> > plan
> > > > binding.
> > > > Our company has solved the problem of adapting various graph
> databases.
> > > > This is my warehouse:
> > https://github.com/kaori-seasons/calcite-gremlin-sql
> > > >
> > > >
> > > > Background
> > > >
> > > > Calcite itself supports the database language expansion of the
> adapter,
> > > > which enables users to understand the meaning of the grammar.
> > > > Complete the simplification of the dialect. For example, expand
> > SqlNode to
> > > > implement syntax analysis, and expand RelNode to implement logical
> plan
> > > > mapping.
> > > >
> > > > thinkerpop is an adaptation framework for various graph databases. In
> > this
> > > > framework, gremlin syntax is mentioned for the first time.
> > > > It has now become a universal query layer for graph databases. While
> > > > expanding query statements through calcite’s adapter interface,
> > > > We will also use thinkerpop's universal graph database API to provide
> > > > dialect compatibility for different graph databases.
> > > >
> > > > Give a simple example:
> > > > From
> > > >
> > > > SELECT "key" FROM inttype
> > > >
> > > >  maps to
> > > >
> > > >
> > > >
> >
> g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
> > > >
> > > >
> >
> by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
> > > > %\$"))))
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > The design architecture is divided into three layers.
> > > >
> > > > Analytical syntax layer, relational algebra transformation layer,
> > logical
> > > > plan binding layer.
> > > >
> > > > Parsing syntax layer: In the parsing query statement phase, fields
> and
> > > > equivalent conditions are split and converted into points and edges.
> > > >
> > > > Relational algebra layer: Split it into a collection of points and
> > edges,
> > > > and convert it into a TableScan during the aggregation/sorting/query
> > stage
> > > > where calcite abstracts it.
> > > > It is convenient to generate query plans based on conditions and
> field
> > > > information.
> > > > Connection scanning/single table filtering and other methods can be
> > used to
> > > > traverse from any edge/any starting point in the graph
> > > >
> > > > Logical plan binding layer: Bind behaviors such as connection
> > > > scanning/single table filtering/projection to calcite’s planner to
> > generate
> > > > query plans.
> > > >
> > > > The process of generating Gremlin logical plan using select
> statement:
> > > >
> > > > 1. First of all, all graph databases start from a source point to
> > build the
> > > > graph. We will use the GraphTraversalSource provided by thinkerpop.
> > > > As the origin, extract the syntax of the incoming point and side
> > > > information. This step will be implemented in SqlSchemaGrabber
> > > > 2. Secondly, for select/where/having/order by/group by our plan in
> the
> > > > parsing phase is as follows:
> > > >
> > > >   - group by: for a point. There are out-degree and in-degree
> > > > attributes in the graph. From the perspective of the data table, it
> is
> > > > equivalent to converting the table data into IN or OUT.
> > > > These two dimensions are aggregated. This behavior also corresponds
> to
> > > > the table traversal graph operation. During the graph traversal
> > > > process, fold/unfold tags will be generated, representing the
> > > > direction of graph traversal.
> > > >   - select: For the select operation, the operation of scanning the
> > > > entire table can be regarded as projecting all columns into point
> > > > attributes. The value changes of each column are mapped to the
> gremlin
> > > > operation of adding points.
> > > >   - where: The filter operation is used in graph computing semantic
> > > > scenarios. It can be regarded as the edges connected by the
> out-degree
> > > > and in-degree of the filter point, so it does not involve the
> > > > conversion of relational algebra.
> > > >   Instead, it is pushed directly to the logical plan.
> > > >   - order by: In the process of graph traversal, we mentioned that
> > > > fold/unflod will be generated on the path to represent the
> > > > forward/backward direction.
> > > >   If a field is encountered and there is no value that can be sorted,
> > > > we will fall back to the origin of GraphTraversalSource and end the
> > > > sorting operation.
> > > >   If there are values that can be sorted, they will be unified in
> > > > SqlTraversalEngine, in-degree and out-degree will be counted
> > > > separately for aggregation, and then used with group by according to
> > > > label (IN/OUT)
> > > >   - having: The meaning is the same as group by, but the label is
> > > > different (in addition to the IN/OUT columns, specific point fields
> > > > need to be specified)
> > > >
> > > >  Currently, I have only completed unit tests that translate from SQL
> to
> > > > Gremlin execution plan, among which test cases for group by and where
> > are
> > > > to be added. In addition, I will also use mainstream graph databases
> > such
> > > > as neo4j and JanusGraph as examples to write sound integration
> tests. ,
> > > > ensuring that the API of the graph database is successfully called
> > after
> > > > converting the sql request into the gremlin execution plan.
> > > >
> > > > Finally, community members are welcome to give suggestions.
> > > >
> >
> >
> >
> > --
> >
> > Best,
> > Benchao Li
> >
>

Re: Feature: Support gremlin adapter

Reply via email to