This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 9e65010  flesh out Neo4j description
9e65010 is described below

commit 9e65010e3cf9a067d7e967846703197e5a4c009d
Author: Paul King <[email protected]>
AuthorDate: Thu Aug 29 19:46:58 2024 +1000

    flesh out Neo4j description
---
 site/src/site/blog/groovy-graph-databases.adoc | 153 +++++++++++++++++++------
 1 file changed, 115 insertions(+), 38 deletions(-)

diff --git a/site/src/site/blog/groovy-graph-databases.adoc 
b/site/src/site/blog/groovy-graph-databases.adoc
index 61bbd7a..7288a62 100644
--- a/site/src/site/blog/groovy-graph-databases.adoc
+++ b/site/src/site/blog/groovy-graph-databases.adoc
@@ -1,7 +1,7 @@
 = Using Graph Databases with Groovy
 Paul King
 :revdate: 2024-08-20T10:18:00+00:00
-:keywords: tugraph, tinkerpop, gremlin, neo4j, apache age, graph databases, 
orientdb, groovy
+:keywords: tugraph, tinkerpop, gremlin, neo4j, apache age, graph databases, 
apache hugegraph, orientdb, arcadedb, orientdb, groovy
 :draft: true
 :description: This post illustrates using graph databases with Groovy.
 
@@ -26,8 +26,9 @@ This allows you to work with numerous graph database 
implementations in a consis
 TinkerPop also provides its own graph engine implementation, called 
TinkerGraph, which is what
 we'll use initially.
 
-We'll look at the swims in the 2021 and 2024 Olympic finals as well as any 
preliminary swims
-where the Olympic record was broken.
+We'll look at the swims for the medalists and record breakers at the Tokyo 
2021 and Paris 2024 Olympics
+in the women's 100m backstroke. For reference purposes, we'll also include the 
previous swim that
+set an olympic record.
 
 We'll start by creating a new in-memory graph database and
 create a helper object for traversing the graph:
@@ -53,8 +54,8 @@ by querying the properties of two nodes respectively:
 
 [source,groovy]
 ----
-var (name, country) = ['name', 'country'].collect { g.V(es).values(it)[0] }
-var (at, event, time) = ['at', 'event', 'time'].collect { 
g.V(swim1).values(it)[0] }
+var (name, country) = ['name', 'country'].collect { es.property(it).value() }
+var (at, event, time) = ['at', 'event', 'time'].collect { 
swim1.property(it).value() }
 println "$name from $country swam a time of $time in $event at the $at 
Olympics"
 ----
 
@@ -66,14 +67,14 @@ Emily Seebohm from πŸ‡¦πŸ‡Ί swam a time of 58.23 in Heat 4 at the 
London 2012 Ol
 
 So far, we've just been using the Java API from TinkerPop.
 It also provides some additional syntactic sugar for Groovy.
-We can enable that with:
+We can enable the syntactic sugar with:
 
 [source,groovy]
 ----
 SugarLoader.load()
 ----
 
-Which lets us write the slightly shorter:
+Which then lets us write the slightly shorter:
 
 [source,groovy]
 ----
@@ -154,7 +155,9 @@ assert successInParis == ['πŸ‡ΊπŸ‡Έ', 'πŸ‡¦πŸ‡Ί'] as Set
 By way of explanation, we find all nodes with an outgoing `swam` edge
 pointing to a swim that was at the Paris 2024 olympics, i.e.
 all the swimmers from Paris 2024. We then find the set of countries
-represented.
+represented. We are using sets here to remove duplicates, and also
+we aren't imposing an ordering on the returned results so we compare
+sets on both sides.
 
 Similarly, we can find the olympic records set during heat swims:
 
@@ -170,7 +173,7 @@ Or, we can find the times of the records set during finals:
 
 [source,groovy]
 ----
-var recordTimesInFinals = g.V().has('event', 
'Final').as('ev').out('supercedes')
+var recordTimesInFinals = g.V().has('event', 
'Final').as('ev').out('supersedes')
     .select('ev').values('time').toSet()
 assert recordTimesInFinals == [57.47, 57.33] as Set
 ----
@@ -182,10 +185,10 @@ Making use of the Groovy syntactic sugar gives simpler 
versions:
 var successInParis = g.V.out('swam').has('at', 'Paris 2024').in.country.toSet
 assert successInParis == ['πŸ‡ΊπŸ‡Έ', 'πŸ‡¦πŸ‡Ί'] as Set
 
-var recordSetInHeat = g.V.hasLabel('swim').filter { 
it.event.startsWith('Heat') }.at.toSet
+var recordSetInHeat = g.V.hasLabel('Swim').filter { 
it.event.startsWith('Heat') }.at.toSet
 assert recordSetInHeat == ['London 2012', 'Tokyo 2021'] as Set
 
-var recordTimesInFinals = g.V.has('event', 
'Final').as('ev').out('supercedes').select('ev').time.toSet
+var recordTimesInFinals = g.V.has('event', 
'Final').as('ev').out('supersedes').select('ev').time.toSet
 assert recordTimesInFinals == [57.47, 57.33] as Set
 ----
 
@@ -196,7 +199,7 @@ at all the olympic records set in 2021 and 2024:
 [source,groovy]
 ----
 println "Olympic records after ${g.V(swim1).values('at', 
'event').toList().join(' ')}: "
-println g.V(swim1).repeat(in('supercedes')).as('sw').emit()
+println g.V(swim1).repeat(in('supersedes')).as('sw').emit()
     .values('at').concat(' ')
     .concat(select('sw').values('event')).toList().join('\n')
 ----
@@ -205,7 +208,7 @@ Or after using the Groovy syntactic sugar, the query 
becomes:
 
 [source,groovy]
 ----
-println g.V(swim1).repeat(in('supercedes')).as('sw').emit
+println g.V(swim1).repeat(in('supersedes')).as('sw').emit
     .at.concat(' ').concat(select('sw').event).toList.join('\n')
 ----
 
@@ -222,6 +225,9 @@ Paris 2024 Final
 Paris 2024 Relay leg1
 ----
 
+As a side note, TinkerPop has a `GraphMLWriter` class which can write out our
+graph in _GraphML_, which is how the above image was created.
+
 == Neo4j
 
 Our next technology to examine is
@@ -230,13 +236,24 @@ database storing nodes and edges. Nodes and edges may 
have a label and propertie
 
 
image:https://dist.neo4j.com/wp-content/uploads/20230926084108/Logo_FullColor_RGB_TransBG.svg[neo4j
 logo,50%]
 
+Neo4j models edge relationships using enums. Let's create an enum for our 
example:
+
+[source,groovy]
+----
+enum SwimmingRelationships implements RelationshipType {
+    swam, supersedes, runnerup
+}
+----
+
+Let's create our nodes and edges using Neo4j. First the existing Olympic 
record:
+
 [source,groovy]
 ----
-es = tx.createNode(label('swimmer'))
+es = tx.createNode(label('Swimmer'))
 es.setProperty('name', 'Emily Seebohm')
 es.setProperty('country', 'πŸ‡¦πŸ‡Ί')
 
-swim1 = tx.createNode(label('swim'))
+swim1 = tx.createNode(label('Swim'))
 swim1.setProperty('event', 'Heat 4')
 swim1.setProperty('at', 'London 2012')
 swim1.setProperty('result', 'First')
@@ -251,6 +268,9 @@ var time = swim1.getProperty('time')
 println "$name from $country swam a time of $time in $event at the $at 
Olympics"
 ----
 
+While there is nothing wrong with this code, Groovy has many features for 
making code more succinct.
+Let's use some dynamic metaprogramming to achieve just that.
+
 [source,groovy]
 ----
 Node.metaClass {
@@ -262,6 +282,9 @@ Node.metaClass {
 }
 ----
 
+Now we use normal Groovy property access for setting the node properties. It 
looks much cleaner.
+We define an edge relationship simply by calling a method having the 
relationship name.
+
 [source,groovy]
 ----
 km = tx.createNode(label('swimmer'))
@@ -284,25 +307,24 @@ swim3.at = 'Tokyo 2021'
 km.swam(swim3)
 ----
 
-[source,groovy]
-----
-static insertSwimmer(Transaction tx, name, country) {
-    var sr = tx.createNode(label('swimmer'))
-    sr.setProperty('name', name)
-    sr.setProperty('country', country)
-    sr
-}
+The code is certainly a lot cleaner, and it was quite a minimal amount of work 
to define the necessary
+metaprogramming. With a little bit more work, we could use static 
metaprogramming techniques.
+This would give us better IDE completion.
 
-static insertSwim(Transaction tx, at, event, time, result, swimmer) {
-    var sm = tx.createNode(label('swim'))
-    sm.setProperty('result', result)
-    sm.setProperty('event', event)
-    sm.setProperty('at', at)
-    sm.setProperty('time', time)
-    swimmer.createRelationshipTo(sm, swam)
-    sm
-}
-----
+Another interesting topic which we won't elaborate here is stronger type 
checking for graphs.
+For graph libraries which support schemas, the types for node and edge 
properties can be defined,
+as can the allowable nodes applicable to any edge relationship. For such 
systems, if you try to
+define a poorly-typed property, or incorrectly use a relationship, you will 
receive a runtime error.
+Groovy lets us take things further, if we want, and if we are willing to do a 
little more work.
+For example, if the schema is available at compile time, we could write a type 
checking extension
+which would fail compilation if any invalid edge or vertex definitions were 
detected.
+
+For now though, let's continue with defining the rest of our graph.
+We can redefine our `insertSwimmer` and `insertSwim` methods using Neo4j 
implementation
+calls, and then our earlier code could be used to create our graph. Now let's
+investigate what the queries look like.
+
+First, the successful countries in Paris 2024:
 
 [source,groovy]
 ----
@@ -313,24 +335,39 @@ var successInParis = swimmers.findAll { swimmer ->
     }
 }
 assert successInParis*.country.unique() == ['πŸ‡ΊπŸ‡Έ', 'πŸ‡¦πŸ‡Ί']
+----
 
+Then, at which olympics were records broken in heats:
+
+[source,groovy]
+----
 var swims = [swim1, swim2, swim3, swim4, swim5, swim6, swim7, swim8, swim9, 
swim10, swim11, swim12]
 var recordSetInHeat = swims.findAll { swim ->
     swim.event.startsWith('Heat')
 }*.at
 assert recordSetInHeat.unique() == ['London 2012', 'Tokyo 2021']
+----
+
+Now, what were the times for records broken in finals:
 
+[source,groovy]
+----
 var recordTimesInFinals = swims.findAll { swim ->
     swim.event == 'Final' && swim.hasRelationship(supercedes)
 }*.time
 assert recordTimesInFinals == [57.47d, 57.33d]
+----
 
+To see traversal in action, Neo4j has a special API for doing such queries:
+
+[source,groovy]
+----
 var info = { s -> "$s.at $s.event" }
 println "Olympic records following ${info(swim1)}:"
 
 for (Path p in tx.traversalDescription()
     .breadthFirst()
-    .relationships(supercedes)
+    .relationships(supersedes)
     .evaluator(Evaluators.fromDepth(1))
     .uniqueness(Uniqueness.NONE)
     .traverse(swim1)) {
@@ -338,27 +375,48 @@ for (Path p in tx.traversalDescription()
 }
 ----
 
+Earlier versions of Neo4j also supported Gremlin, so we could have written our 
queries in
+the same was as we did for TinkerPop. That technology is deprecated for Neo4j, 
and instead
+they now offer a Cypher query language. We can use that language for all of 
our previous queries
+as shown here:
+
 [source,groovy]
 ----
 assert tx.execute('''
-MATCH (s:swim WHERE s.event STARTS WITH 'Heat')
+MATCH (s:Swim WHERE s.event STARTS WITH 'Heat')
 WITH s.at as at
 WITH DISTINCT at
 RETURN at
 ''')*.at == ['London 2012', 'Tokyo 2021']
 
 assert tx.execute('''
-MATCH (s1:swim {event: 'Final'})-[:supercedes]->(s2:swim)
+MATCH (s1:Swim {event: 'Final'})-[:supersedes]->(s2:Swim)
 RETURN s1.time AS time
 ''')*.time == [57.47d, 57.33d]
 
 tx.execute('''
-MATCH (s1:swim)-[:supercedes]->{1,}(s2:swim { at: $at })
+MATCH (s1:Swim)-[:supersedes]->{1,}(s2:Swim { at: $at })
 RETURN s1
 ''', [at: swim1.at])*.s1.each { s ->
     println "$s.at $s.event"
 }
+----
+
+=== An aside on graph design
 
+Deciding which information should be stored as node properties and which as 
relationships
+still requires developer judgement. For example, we could have added a Boolean 
`olympicRecord`
+property to our `Swim` nodes. Certain queries might now become simpler, or at 
least more familiar
+to traditional RDBMS SQL developers, but other queries might become much 
harder to write
+and potentially much less efficient.
+This is the kind of thing which needs to be thought through and sometimes 
experimented with.
+
+Suppose, in the case where a record is broken, we wanted to see which other 
swimmers
+(in our case medallists in the final) also broke the previous record.
+We could write a query to find this as follows:
+
+[source,groovy]
+----
 assert tx.execute('''
 MATCH (sr1:swimmer)-[:swam]->(sm1:swim {event: 'Final'}), (sm2:swim {event: 
'Final'})-[:supercedes]->(sm3:swim)
 WHERE sm1.at = sm2.at AND sm1 <> sm2 AND sm1.time < sm3.time
@@ -366,6 +424,9 @@ RETURN sr1.name as name
 ''')*.name == ['Kylie Masse']
 ----
 
+It's not too bad, but if we had a much larger graph of data, it could be quite 
slow.
+We could instead opt to use an additional relationship, called `runnerup` in 
our graph.
+
 [source,groovy]
 ----
 swim6.runnerup(swim3)
@@ -374,8 +435,14 @@ swim12.runnerup(swim7)
 swim7.runnerup(swim11)
 ----
 
+The visualization is something like this:
+
 image:img/BackstrokeRecordsRunnerup.png[Additional runnerup relationship,60%]
 
+It essentially makes it easier to find the other medalists if we know any one 
of them.
+
+The resulting query becomes this:
+
 [source,groovy]
 ----
 assert tx.execute('''
@@ -385,6 +452,9 @@ RETURN sr1.name as name
 ''')*.name == ['Kylie Masse']
 ----
 
+The _MATCH_ clause is similar in complexity, the _WHERE_ clause is much 
simpler.
+The query is probably faster too, but it is a tradeoff that should be weighed 
up.
+
 == Apache AGE
 
 The next technology is the https://age.apache.org/[Apache AGEβ„’] graph database.
@@ -521,7 +591,8 @@ user interface for visualization of graph data stored in 
our database.
 Instructions for installation are available on the
 https://github.com/apache/age-viewer[GitHub site].
 The tool allows visualization of the results from any query.
-For our database, a query returning all nodes and edges looks like this:
+For our database, a query returning all nodes and edges creates
+a visualization like below (we chose to manually re-arrange the nodes):
 
 image:img/age-viewer.png[]
 
@@ -544,3 +615,9 @@ image:img/ArcadeStudio.png[ArcadeStudio]
 [source,groovy]
 ----
 ----
+
+== HugeGraph
+
+[source,groovy]
+----
+----

Reply via email to