Re: [PR] Scp5 [tinkerpop]

via GitHub Mon, 05 May 2025 09:49:12 -0700


Cole-Greer commented on code in PR #3115:
URL: https://github.com/apache/tinkerpop/pull/3115#discussion_r2073788046



##########
docs/src/dev/future/proposal-scoping-5.asciidoc:
##########
@@ -0,0 +1,189 @@
+////
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+////
+image::apache-tinkerpop-logo.png[width=500,link="https://tinkerpop.apache.org";]
+
+*x.y.z - Proposal 5*
+
+== Lazy vs. Eager Evaluation in TP4 ==
+
+=== Introduction ===
+
+Gremlin comes with conventions and mechanisms to control the flow strategy for 
traversal processing: _lazy evaluation_ is conceptually a depth-first 
evaluation paradigm that follows as a natural result from the pull-based 
stacked iterator model (as implemented in the Apache TinkerPop OLTP engine), 
whereas _eager evaluation_ enforces a Gremlin step to process all its incoming 
traversers before passing any results to the subsequent step.
+
+In many cases, switching between a lazy vs. eager flow strategy merely affects 
the internal order in which the engine processes traversers, yet there is no 
observable difference for end users in the final query result. However, there 
exist quite a few common use cases where lazy vs. eager evaluation may cause 
observable differences in the query results. These scenarios include (1) 
queries with side effects — where side effect variables are written and read, 
and the order in which these variables are updated and accessed changes 
observed values in these variables, (2) cases where queries aim to visit and 
return results in a given order — particularly queries with `limit()` steps to 
achieve top-k behavior, and (3) certain classes of update queries where the 
order in which updates are being applied affects the final state of the 
database. 
+
+To illustrate the difference between lazy and eager evaluation, consider the 
following simple query over the modern graph:
+
+[code]
+----
+gremlin> g.V().hasLabel('person').groupCount('x').select('x')
+----
+
+If a lazy flow strategy is used, the observed `x` values are reported 
incrementally in the output:
+
+[code]
+----
+==>[v[1]:1]
+==>[v[1]:1,v[2]:1]
+==>[v[1]:1,v[2]:1,v[4]:1]
+==>[v[1]:1,v[2]:1,v[4]:1,v[6]:1]
+----
+
+In contrast, an eager evaluation strategy would store the complete set of 
solutions in the side effect variable `x` before proceeding, in which case the 
output would change to the following:
+
+[code]
+----
+==>[v[1]:1,v[2]:1,v[4]:1,v[6]:1]
+==>[v[1]:1,v[2]:1,v[4]:1,v[6]:1]
+==>[v[1]:1,v[2]:1,v[4]:1,v[6]:1]
+==>[v[1]:1,v[2]:1,v[4]:1,v[6]:1]
+----
+
+While there are select Gremlin steps that provide explicit control over lazy 
vs. eager flow — for instance, switching the `Scope.local` default to 
`Scope.global` in the 
https://tinkerpop.apache.org/docs/current/reference/#aggregate-step[the side 
effect version of the aggregate step] allows users to enforce eager evaluation 
— the https://tinkerpop.apache.org/gremlin.html[Apache TinkerPop documentation] 
is rather vague when it comes to providing guarantees regarding the flow 
strategy that is used in the general case (the 
https://tinkerpop.apache.org/docs/current/dev/provider/#gremlin-semantics[Gremlin
 Semantics] section currently does not talk about this distinction). On the 
other hand, the Apache TinkerPop Gremlin OLTP processor as the de facto 
reference implementation, leverages a pull-based execution engine that 
typically (though not always) results in a lazy evaluation semantics -- yet it 
is not clear whether the Gremlin language as such aims to impose a _strong 
guarantee_ tha
 t queries have to be evaluated lazily or whether the observed lazy evaluation 
in the TinkerPop OLTP processor is just an implementation artifact. From our 
perspective, it is important for Gremlin users — who often seek to run queries 
and workloads across different engines and appreciate the freedom to switch 
implementations — to have a concise answer on the design intent and be explicit 
about guarantees that Gremlin implementations do vs. do not have to provide in 
order to be considered compliant with the language spec. 
+
+In fact, when looking at the specific question of lazy vs. eager flow 
guarantees, different Gremlin processors today come with different “degrees of 
compatibility” with the lazy execution behavior observed in the TinkerPop OLTP 
processor. The key reason for deviating from a rigorous lazy execution paradigm 
usually is performance: the problem with lazy evaluation is that it prescribes 
serial execution order, which in many cases complicates (or even prevents) 
common optimization techniques such as bulking, vectored execution, and 
parallelization. As a matter of fact, even the TinkerPop Gremlin OLAP graph 
processor breaks with the lazy evaluation paradigm that is implemented in the 
traditional OLTP processor in order to achieve efficient parallel execution. 
+
+=== A unified control mechanism for lazy vs. eager evaluation ===
+
+In this proposal we argue that guarantees for and control over lazy vs. eager 
evaluation order should be a well-defined aspect of the Gremlin language that 
has to strike the right balance between (a) imposing a minimal set of 
constraints by default, as to leave implementers the freedom to apply 
optimizations for the general cases (and account for the variety of approaches 
that Gremlin engines implement today) while (b) providing Gremlin users the 
freedom to specify and constrain flow control whenever they depend on it. With 
these goals in mind, our proposal is as follows.
+
+===== Proposal 1: By default, the Gremlin semantics shall NOT prescribe lazy 
vs. eager evaluation order =====

Review Comment:
   100% agree. Gremlin tends to be overly prescriptive, and would benefit from 
explicitly allowing more implementation defined behaviour in certain places. 
It's even worse in this example as it's currently ambiguous if lazy evaluation 
order is actually a defined gremlin semantic, or merely an implementation 
detail of the reference implementation (I believe it's the latter but that 
needs to be made explicit). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Scp5 [tinkerpop]

Reply via email to