[jira] [Commented] (DRILL-5657) Implement size-aware result set loader

ASF GitHub Bot (JIRA) Tue, 08 Aug 2017 17:03:35 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119211#comment-16119211
 ]


ASF GitHub Bot commented on DRILL-5657:
---------------------------------------

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r131564509
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/ResultVectorCache.java
 ---
    @@ -0,0 +1,181 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.ArrayList;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import org.apache.drill.common.types.TypeProtos.MajorType;
    +import org.apache.drill.exec.expr.TypeHelper;
    +import org.apache.drill.exec.memory.BufferAllocator;
    +import org.apache.drill.exec.record.MaterializedField;
    +import org.apache.drill.exec.vector.ValueVector;
    +
    +/**
    + * Manages an inventory of value vectors used across row batch readers.
    + * Drill semantics for batches is complex. Each operator logically returns
    + * a batch of records on each call of the Drill Volcano iterator protocol
    + * <tt>next()</tt> operation. However, the batches "returned" are not
    + * separate objects. Instead, Drill enforces the following semantics:
    + * <ul>
    + * <li>If a <tt>next()</tt> call returns <tt>OK</tt> then the set of 
vectors
    + * in the "returned" batch must be identical to those in the prior batch. 
Not
    + * just the same type; they must be the same <tt>ValueVector</tt> objects.
    + * (The buffers within the vectors will be different.)</li>
    + * <li>If the set of vectors changes in any way (add a vector, remove a
    + * vector, change the type of a vector), then the <tt>next()</tt> call
    + * <b>must</b> return <tt>OK_NEW_SCHEMA</tt>.</ul>
    + * </ul>
    + * These rules create interesting constraints for the scan operator.
    + * Conceptually, each batch is distinct. But, it must share vectors. The
    + * {@link ResultSetLoader} class handles this by managing the set of 
vectors
    + * used by a single reader.
    + * <p>
    + * Readers are independent: each may read a distinct schema (as in JSON.)
    + * Yet, the Drill protocol requires minimizing spurious 
<tt>OK_NEW_SCHEMA</tt>
    + * events. As a result, two readers run by the same scan operator must
    + * share the same set of vectors, despite the fact that they may have
    + * different schemas and thus different <tt>ResultSetLoader</tt>s.
    + * <p>
    + * The purpose of this inventory is to persist vectors across readers, even
    + * when, say, reader B does not use a vector that reader A created.
    + * <p>
    + * The semantics supported by this class include:
    + * <ul>
    + * <li>Ability to "pre-declare" columns based on columns that appear in
    + * an explicit select list. This ensures that the columns are known (but
    + * not their types).</li>
    + * <li>Ability to reuse a vector across readers if the column retains the 
same
    + * name and type (minor type and mode.)</li>
    + * <li>Ability to flush unused vectors for readers with changing schemas
    + * if a schema change occurs.</li>
    + * <li>Support schema "hysteresis"; that is, the a "sticky" schema that
    + * minimizes spurious changes. Once a vector is declared, it can be 
included
    + * in all subsequent batches (provided the column is nullable or an 
array.)</li>
    + * </ul>
    + */
    +public class ResultVectorCache {
    +
    +  /**
    +   * State of a projected vector. At first all we have is a name.
    +   * Later, we'll discover the type.
    +   */
    +
    +  private static class VectorState {
    +    protected final String name;
    +    protected ValueVector vector;
    +    protected boolean touched;
    +
    +    public VectorState(String name) {
    +      this.name = name;
    +    }
    +
    +    public boolean satisfies(MaterializedField colSchema) {
    +      if (vector == null) {
    +        return false;
    +      }
    +      MaterializedField vectorSchema = vector.getField();
    +      return vectorSchema.getType().equals(colSchema.getType());
    +    }
    +  }
    +
    +  private final BufferAllocator allocator;
    +  private final Map<String, VectorState> vectors = new HashMap<>();
    +
    +  public ResultVectorCache(BufferAllocator allocator) {
    +    this.allocator = allocator;
    +  }
    +
    +  public void predefine(List<String> selected) {
    +    for (String colName : selected) {
    +      addVector(colName);
    +    }
    +  }
    +
    +  private VectorState addVector(String colName) {
    +    VectorState vs = new VectorState(colName);
    +    vectors.put(vs.name, vs);
    +    return vs;
    +  }
    +
    +  public void newBatch() {
    +    for (VectorState vs : vectors.values()) {
    +      vs.touched = false;
    +    }
    +  }
    +
    +  public void trimUnused() {
    --- End diff --
    
    What is this for ?


> Implement size-aware result set loader
> --------------------------------------
>
>                 Key: DRILL-5657
>                 URL: https://issues.apache.org/jira/browse/DRILL-5657
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: Future
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: Future
>
>
> A recent extension to Drill's set of test tools created a "row set" 
> abstraction to allow us to create, and verify, record batches with very few 
> lines of code. Part of this work involved creating a set of "column 
> accessors" in the vector subsystem. Column readers provide a uniform API to 
> obtain data from columns (vectors), while column writers provide a uniform 
> writing interface.
> DRILL-5211 discusses a set of changes to limit value vectors to 16 MB in size 
> (to avoid memory fragmentation due to Drill's two memory allocators.) The 
> column accessors have proven to be so useful that they will be the basis for 
> the new, size-aware writers used by Drill's record readers.
> A step in that direction is to retrofit the column writers to use the 
> size-aware {{setScalar()}} and {{setArray()}} methods introduced in 
> DRILL-5517.
> Since the test framework row set classes are (at present) the only consumer 
> of the accessors, those classes must also be updated with the changes.
> This then allows us to add a new "row mutator" class that handles size-aware 
> vector writing, including the case in which a vector fills in the middle of a 
> row.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5657) Implement size-aware result set loader

Reply via email to