[jira] [Created] (DRILL-5207) Improve Parquet scan pipelining

Parth Chandra (JIRA) Thu, 19 Jan 2017 17:53:54 -0800

Parth Chandra created DRILL-5207:
------------------------------------

             Summary: Improve Parquet scan pipelining
                 Key: DRILL-5207
                 URL: https://issues.apache.org/jira/browse/DRILL-5207
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.9.0
            Reporter: Parth Chandra
            Assignee: Parth Chandra
             Fix For: 1.10



The parquet reader's async page reader is not quite efficiently pipelined. 
The default size of the disk read buffer is 4MB while the page reader reads 
~1MB at a time. The Parquet decode is also processing 1MB at a time. This means 
the disk is idle while the data is being processed. Reducing the buffer to 1MB 
will reduce the time the processing thread waits for the disk read thread.
Additionally, since the data to process a page may be more or less than 1MB, a 
queue of pages will help so that the disk scan does not block (until the queue 
is full), waiting for the processing thread.
Additionally, the BufferedDirectBufInputStream class reads from disk as soon as 
it is initialized. Since this is called at setup time, this increases the setup 
time for the query and query execution does not begin until this is completed.
There are a few other inefficiencies - options are read every time a page 
reader is created. Reading options can be expensive.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-5207) Improve Parquet scan pipelining

Reply via email to