Github user jessehatfield commented on a diff in the pull request:
https://github.com/apache/incubator-rya/pull/255#discussion_r160554313
--- Diff:
dao/mongodb.rya/src/main/java/org/apache/rya/mongodb/aggregation/SparqlToPipelineTransformVisitor.java
---
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.mongodb.aggregation;
+
+import java.util.Arrays;
+
+import org.apache.rya.mongodb.StatefulMongoDBRdfConfiguration;
+import org.bson.Document;
+import org.openrdf.query.algebra.Distinct;
+import org.openrdf.query.algebra.Extension;
+import org.openrdf.query.algebra.Filter;
+import org.openrdf.query.algebra.Join;
+import org.openrdf.query.algebra.MultiProjection;
+import org.openrdf.query.algebra.Projection;
+import org.openrdf.query.algebra.Reduced;
+import org.openrdf.query.algebra.StatementPattern;
+import org.openrdf.query.algebra.helpers.QueryModelVisitorBase;
+
+import com.google.common.base.Preconditions;
+import com.mongodb.MongoClient;
+import com.mongodb.client.MongoCollection;
+import com.mongodb.client.MongoDatabase;
+
+/**
+ * Visitor that transforms a SPARQL query tree by replacing as much of the
tree
+ * as possible with one or more {@code AggregationPipelineQueryNode}s.
+ * <p>
+ * Each {@link AggregationPipelineQueryNode} contains a MongoDB aggregation
+ * pipeline which is equivalent to the replaced portion of the original
query.
+ * Evaluating this node executes the pipeline and converts the results into
+ * query solutions. If only part of the query was transformed, the
remaining
+ * query logic (higher up in the query tree) can be applied to those
+ * intermediate solutions as normal.
+ * <p>
+ * In general, processes the tree in bottom-up order: A leaf node
+ * ({@link StatementPattern}) is replaced with a pipeline that matches the
+ * corresponding statements. Then, if the parent node's semantics are
supported
+ * by the visitor, stages are appended to the pipeline and the subtree at
the
+ * parent node is replaced with the extended pipeline. This continues up
the
+ * tree until reaching a node that cannot be transformed, in which case
that
+ * node's child is now a single {@code AggregationPipelineQueryNode} (a
leaf
+ * node) instead of the previous subtree, or until the entire tree has been
+ * subsumed into a single pipeline node.
+ * <p>
+ * Nodes which are transformed into pipeline stages:
+ * <p><ul>
+ * <li>A {@code StatementPattern} node forms the beginning of each
pipeline.
+ * <li>Single-argument operations {@link Projection}, {@link
MultiProjection},
+ * {@link Extension}, {@link Distinct}, and {@link Reduced} will be
transformed
+ * into pipeline stages whenever the child {@link TupleExpr} represents a
+ * pipeline.
+ * <li>A {@link Filter} operation will be appended to the pipeline when its
+ * child {@code TupleExpr} represents a pipeline and the filter condition
is a
+ * type of {@link ValueExpr} understood by {@code
AggregationPipelineQueryNode}.
+ * <li>A {@link Join} operation will be appended to the pipeline when one
child
+ * is a {@code StatementPattern} and the other is an
+ * {@code AggregationPipelineQueryNode}.
+ * </ul>
+ */
+public class SparqlToPipelineTransformVisitor extends
QueryModelVisitorBase<Exception> {
--- End diff --
It doesn't strictly have to be executed first, though it may be better. For
example, if the tree is `Join(<somethingComplex>,
AggregationPipelineQueryNode))`, then the join iterator will get an iterator of
results from the complex thing on the left, then for each result, execute the
pipeline -- likely not optimal.
The only logic to try to group pipeline-amenable nodes/subqueries together
is in this visitor; there's no restructuring done at a higher level to make it
work better (except to the extent that ordinary query planning steps may happen
to help). For example, this visitor can turn `Join(Join(Join(SP1, SP2), SP3),
SP4)` into a single `AggregationPipelineQueryNode`, but it can only turn
`Join(Join(SP1, SP2), Join(SP3, SP4))` into
`Join(AggregationPipelineQueryNode1, AggregationPipelineQueryNode2)`. Ideally
the query would take the former form, and in this case I believe it typically
does. But there's room for development and testing in terms of what query forms
this optimization is good for, and how it should interact with the rest of
query planning. Uncertainty here is the main reason I left this optimization
disabled by default.
---