[
https://issues.apache.org/jira/browse/DRILL-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119211#comment-16119211
]
ASF GitHub Bot commented on DRILL-5657:
---------------------------------------
Github user bitblender commented on a diff in the pull request:
https://github.com/apache/drill/pull/866#discussion_r131564509
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/ResultVectorCache.java
---
@@ -0,0 +1,181 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.rowSet.impl;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.vector.ValueVector;
+
+/**
+ * Manages an inventory of value vectors used across row batch readers.
+ * Drill semantics for batches is complex. Each operator logically returns
+ * a batch of records on each call of the Drill Volcano iterator protocol
+ * <tt>next()</tt> operation. However, the batches "returned" are not
+ * separate objects. Instead, Drill enforces the following semantics:
+ * <ul>
+ * <li>If a <tt>next()</tt> call returns <tt>OK</tt> then the set of
vectors
+ * in the "returned" batch must be identical to those in the prior batch.
Not
+ * just the same type; they must be the same <tt>ValueVector</tt> objects.
+ * (The buffers within the vectors will be different.)</li>
+ * <li>If the set of vectors changes in any way (add a vector, remove a
+ * vector, change the type of a vector), then the <tt>next()</tt> call
+ * <b>must</b> return <tt>OK_NEW_SCHEMA</tt>.</ul>
+ * </ul>
+ * These rules create interesting constraints for the scan operator.
+ * Conceptually, each batch is distinct. But, it must share vectors. The
+ * {@link ResultSetLoader} class handles this by managing the set of
vectors
+ * used by a single reader.
+ * <p>
+ * Readers are independent: each may read a distinct schema (as in JSON.)
+ * Yet, the Drill protocol requires minimizing spurious
<tt>OK_NEW_SCHEMA</tt>
+ * events. As a result, two readers run by the same scan operator must
+ * share the same set of vectors, despite the fact that they may have
+ * different schemas and thus different <tt>ResultSetLoader</tt>s.
+ * <p>
+ * The purpose of this inventory is to persist vectors across readers, even
+ * when, say, reader B does not use a vector that reader A created.
+ * <p>
+ * The semantics supported by this class include:
+ * <ul>
+ * <li>Ability to "pre-declare" columns based on columns that appear in
+ * an explicit select list. This ensures that the columns are known (but
+ * not their types).</li>
+ * <li>Ability to reuse a vector across readers if the column retains the
same
+ * name and type (minor type and mode.)</li>
+ * <li>Ability to flush unused vectors for readers with changing schemas
+ * if a schema change occurs.</li>
+ * <li>Support schema "hysteresis"; that is, the a "sticky" schema that
+ * minimizes spurious changes. Once a vector is declared, it can be
included
+ * in all subsequent batches (provided the column is nullable or an
array.)</li>
+ * </ul>
+ */
+public class ResultVectorCache {
+
+ /**
+ * State of a projected vector. At first all we have is a name.
+ * Later, we'll discover the type.
+ */
+
+ private static class VectorState {
+ protected final String name;
+ protected ValueVector vector;
+ protected boolean touched;
+
+ public VectorState(String name) {
+ this.name = name;
+ }
+
+ public boolean satisfies(MaterializedField colSchema) {
+ if (vector == null) {
+ return false;
+ }
+ MaterializedField vectorSchema = vector.getField();
+ return vectorSchema.getType().equals(colSchema.getType());
+ }
+ }
+
+ private final BufferAllocator allocator;
+ private final Map<String, VectorState> vectors = new HashMap<>();
+
+ public ResultVectorCache(BufferAllocator allocator) {
+ this.allocator = allocator;
+ }
+
+ public void predefine(List<String> selected) {
+ for (String colName : selected) {
+ addVector(colName);
+ }
+ }
+
+ private VectorState addVector(String colName) {
+ VectorState vs = new VectorState(colName);
+ vectors.put(vs.name, vs);
+ return vs;
+ }
+
+ public void newBatch() {
+ for (VectorState vs : vectors.values()) {
+ vs.touched = false;
+ }
+ }
+
+ public void trimUnused() {
--- End diff --
What is this for ?
> Implement size-aware result set loader
> --------------------------------------
>
> Key: DRILL-5657
> URL: https://issues.apache.org/jira/browse/DRILL-5657
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: Future
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Fix For: Future
>
>
> A recent extension to Drill's set of test tools created a "row set"
> abstraction to allow us to create, and verify, record batches with very few
> lines of code. Part of this work involved creating a set of "column
> accessors" in the vector subsystem. Column readers provide a uniform API to
> obtain data from columns (vectors), while column writers provide a uniform
> writing interface.
> DRILL-5211 discusses a set of changes to limit value vectors to 16 MB in size
> (to avoid memory fragmentation due to Drill's two memory allocators.) The
> column accessors have proven to be so useful that they will be the basis for
> the new, size-aware writers used by Drill's record readers.
> A step in that direction is to retrofit the column writers to use the
> size-aware {{setScalar()}} and {{setArray()}} methods introduced in
> DRILL-5517.
> Since the test framework row set classes are (at present) the only consumer
> of the accessors, those classes must also be updated with the changes.
> This then allows us to add a new "row mutator" class that handles size-aware
> vector writing, including the case in which a vector fills in the middle of a
> row.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)