[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types

ASF GitHub Bot (JIRA) Fri, 12 Jan 2018 06:40:27 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324035#comment-16324035
 ]


ASF GitHub Bot commented on DRILL-5846:
---------------------------------------

Github user sachouche commented on the issue:

    https://github.com/apache/drill/pull/1060
  
    @paul-rogers with regard to the design aspects that you brought up:
    ** Corrections about the proposed Design **
    Your analysis somehow assumes the Vector is the one driving the loading 
operation. This is not the case, a) it is the reader which invokes the bulk 
save API b) the Vector save method runs a loop requesting the rest of the data. 
The reader can any time stop the loading phase; currently the reader uses the 
batch size to make this decision. The plan is to enhance this logic by having a 
feedback from the Vector save method about memory usage (e.g., save current row 
set with the condition the memory usage doesn't get X bytes). I believe I have 
raised this design decision early on and the agreement was to implement the 
memory batch restrictions in a next Drill release.
    
    ** Agile Development **
    Through the years I came to appreciate the value of agile development. One 
needs to prioritize design activities; it is completely valid to start with a 
design (which is not guaranteed to be perfect) and then refine it overtime as 
long as the underlying implementation doesn't spill to other modules 
(encapsulation). This has the merit of showcasing incremental improvements and 
allowing the developer to get new insight as they have a better understanding.
       


> Improve Parquet Reader Performance for Flat Data types 
> -------------------------------------------------------
>
>                 Key: DRILL-5846
>                 URL: https://issues.apache.org/jira/browse/DRILL-5846
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.11.0
>            Reporter: salim achouche
>            Assignee: salim achouche
>              Labels: performance
>             Fix For: 1.13.0
>
>
> The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to 
> further improve the Parquet Reader performance as several users reported that 
> Parquet parsing represents the lion share of the overall query execution. It 
> tracks Flat Data types only as Nested DTs might involve functional and 
> processing enhancements (e.g., a nested column can be seen as a Document; 
> user might want to perform operations scoped at the document level that is no 
> need to span all rows). Another JIRA will be created to handle the nested 
> columns use-case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types

Reply via email to