[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types

ASF GitHub Bot (JIRA) Sun, 21 Jan 2018 14:58:51 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333727#comment-16333727
 ]


ASF GitHub Bot commented on DRILL-5846:
---------------------------------------

Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1060#discussion_r162828788
  
    --- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java 
---
    @@ -68,96 +85,441 @@
     
       private final UInt1Vector bits = new UInt1Vector(bitsField, allocator);
       private final ${valuesName} values = new ${minor.class}Vector(field, 
allocator);
    +  private final Mutator mutator      = new MutatorImpl();
    +  private final Accessor accessor    = new AccessorImpl();
    +
    +  <#if type.major == "VarLen" && minor.class == "VarChar">
    +  private final Mutator dupMutator   = new DupValsOnlyMutator();
    +  /** Accessor instance for duplicate values vector */
    +  private final Accessor dupAccessor = new DupValsOnlyAccessor();
    +  /** Optimization for cases where all values are identical */
    +  private boolean duplicateValuesOnly;
    +  /** logical number of values */
    +  private int logicalNumValues;
    +  /** logical value capacity */
    +  private int logicalValueCapacity;
    +  /** Mutator instance for duplicate values vector */
    +
    +  /** true if this vector holds the same value albeit repeated */
    +  public boolean isDuplicateValsOnly() {
    --- End diff --
    
    Tried that (that was my first attempt) but the main issue is that the Drill 
factory creates vectors based on the column metadata alone. This design has the 
advantage of enabling / rolling-back optimizations transparently from the 
consumer. I also made sure there is no performance penalty (or at least 
minimal).


> Improve Parquet Reader Performance for Flat Data types 
> -------------------------------------------------------
>
>                 Key: DRILL-5846
>                 URL: https://issues.apache.org/jira/browse/DRILL-5846
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.11.0
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>              Labels: performance
>             Fix For: 1.13.0
>
>
> The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to 
> further improve the Parquet Reader performance as several users reported that 
> Parquet parsing represents the lion share of the overall query execution. It 
> tracks Flat Data types only as Nested DTs might involve functional and 
> processing enhancements (e.g., a nested column can be seen as a Document; 
> user might want to perform operations scoped at the document level that is no 
> need to span all rows). Another JIRA will be created to handle the nested 
> columns use-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types

Reply via email to