Cole-Greer commented on code in PR #3115: URL: https://github.com/apache/tinkerpop/pull/3115#discussion_r2073788046
########## docs/src/dev/future/proposal-scoping-5.asciidoc: ########## @@ -0,0 +1,189 @@ +//// +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +//// +image::apache-tinkerpop-logo.png[width=500,link="https://tinkerpop.apache.org"] + +*x.y.z - Proposal 5* + +== Lazy vs. Eager Evaluation in TP4 == + +=== Introduction === + +Gremlin comes with conventions and mechanisms to control the flow strategy for traversal processing: _lazy evaluation_ is conceptually a depth-first evaluation paradigm that follows as a natural result from the pull-based stacked iterator model (as implemented in the Apache TinkerPop OLTP engine), whereas _eager evaluation_ enforces a Gremlin step to process all its incoming traversers before passing any results to the subsequent step. + +In many cases, switching between a lazy vs. eager flow strategy merely affects the internal order in which the engine processes traversers, yet there is no observable difference for end users in the final query result. However, there exist quite a few common use cases where lazy vs. eager evaluation may cause observable differences in the query results. These scenarios include (1) queries with side effects — where side effect variables are written and read, and the order in which these variables are updated and accessed changes observed values in these variables, (2) cases where queries aim to visit and return results in a given order — particularly queries with `limit()` steps to achieve top-k behavior, and (3) certain classes of update queries where the order in which updates are being applied affects the final state of the database. + +To illustrate the difference between lazy and eager evaluation, consider the following simple query over the modern graph: + +[code] +---- +gremlin> g.V().hasLabel('person').groupCount('x').select('x') +---- + +If a lazy flow strategy is used, the observed `x` values are reported incrementally in the output: + +[code] +---- +==>[v[1]:1] +==>[v[1]:1,v[2]:1] +==>[v[1]:1,v[2]:1,v[4]:1] +==>[v[1]:1,v[2]:1,v[4]:1,v[6]:1] +---- + +In contrast, an eager evaluation strategy would store the complete set of solutions in the side effect variable `x` before proceeding, in which case the output would change to the following: + +[code] +---- +==>[v[1]:1,v[2]:1,v[4]:1,v[6]:1] +==>[v[1]:1,v[2]:1,v[4]:1,v[6]:1] +==>[v[1]:1,v[2]:1,v[4]:1,v[6]:1] +==>[v[1]:1,v[2]:1,v[4]:1,v[6]:1] +---- + +While there are select Gremlin steps that provide explicit control over lazy vs. eager flow — for instance, switching the `Scope.local` default to `Scope.global` in the https://tinkerpop.apache.org/docs/current/reference/#aggregate-step[the side effect version of the aggregate step] allows users to enforce eager evaluation — the https://tinkerpop.apache.org/gremlin.html[Apache TinkerPop documentation] is rather vague when it comes to providing guarantees regarding the flow strategy that is used in the general case (the https://tinkerpop.apache.org/docs/current/dev/provider/#gremlin-semantics[Gremlin Semantics] section currently does not talk about this distinction). On the other hand, the Apache TinkerPop Gremlin OLTP processor as the de facto reference implementation, leverages a pull-based execution engine that typically (though not always) results in a lazy evaluation semantics -- yet it is not clear whether the Gremlin language as such aims to impose a _strong guarantee_ tha t queries have to be evaluated lazily or whether the observed lazy evaluation in the TinkerPop OLTP processor is just an implementation artifact. From our perspective, it is important for Gremlin users — who often seek to run queries and workloads across different engines and appreciate the freedom to switch implementations — to have a concise answer on the design intent and be explicit about guarantees that Gremlin implementations do vs. do not have to provide in order to be considered compliant with the language spec. + +In fact, when looking at the specific question of lazy vs. eager flow guarantees, different Gremlin processors today come with different “degrees of compatibility” with the lazy execution behavior observed in the TinkerPop OLTP processor. The key reason for deviating from a rigorous lazy execution paradigm usually is performance: the problem with lazy evaluation is that it prescribes serial execution order, which in many cases complicates (or even prevents) common optimization techniques such as bulking, vectored execution, and parallelization. As a matter of fact, even the TinkerPop Gremlin OLAP graph processor breaks with the lazy evaluation paradigm that is implemented in the traditional OLTP processor in order to achieve efficient parallel execution. + +=== A unified control mechanism for lazy vs. eager evaluation === + +In this proposal we argue that guarantees for and control over lazy vs. eager evaluation order should be a well-defined aspect of the Gremlin language that has to strike the right balance between (a) imposing a minimal set of constraints by default, as to leave implementers the freedom to apply optimizations for the general cases (and account for the variety of approaches that Gremlin engines implement today) while (b) providing Gremlin users the freedom to specify and constrain flow control whenever they depend on it. With these goals in mind, our proposal is as follows. + +===== Proposal 1: By default, the Gremlin semantics shall NOT prescribe lazy vs. eager evaluation order ===== Review Comment: 100% agree. Gremlin tends to be overly prescriptive, and would benefit from explicitly allowing more implementation defined behaviour in certain places. It's even worse in this example as it's currently ambiguous if lazy evaluation order is actually a defined gremlin semantic, or merely an implementation detail of the reference implementation (I believe it's the latter but that needs to be made explicit). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
