[
https://issues.apache.org/jira/browse/DRILL-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843711#comment-15843711
]
ASF GitHub Bot commented on DRILL-5207:
---------------------------------------
Github user parthchandra commented on a diff in the pull request:
https://github.com/apache/drill/pull/723#discussion_r98289161
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
---
@@ -179,10 +189,10 @@ private int getNextBlock() throws IOException {
this.curPosInStream = getInputStream().getPos();
bytesRead = nBytes;
logger.trace(
--- End diff --
Sure.
> Improve Parquet scan pipelining
> -------------------------------
>
> Key: DRILL-5207
> URL: https://issues.apache.org/jira/browse/DRILL-5207
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.9.0
> Reporter: Parth Chandra
> Assignee: Parth Chandra
> Fix For: 1.10
>
>
> The parquet reader's async page reader is not quite efficiently pipelined.
> The default size of the disk read buffer is 4MB while the page reader reads
> ~1MB at a time. The Parquet decode is also processing 1MB at a time. This
> means the disk is idle while the data is being processed. Reducing the buffer
> to 1MB will reduce the time the processing thread waits for the disk read
> thread.
> Additionally, since the data to process a page may be more or less than 1MB,
> a queue of pages will help so that the disk scan does not block (until the
> queue is full), waiting for the processing thread.
> Additionally, the BufferedDirectBufInputStream class reads from disk as soon
> as it is initialized. Since this is called at setup time, this increases the
> setup time for the query and query execution does not begin until this is
> completed.
> There are a few other inefficiencies - options are read every time a page
> reader is created. Reading options can be expensive.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)