GitHub user parthchandra opened a pull request:

    Drill-4800: Improve parquet reader performance

    Added a Buffering input stream
    Updated parquet reader to optionally use the buffering input stream
    Added optional asynchronous reading of page data
    Added optional parallel decompression and decoding of columns
        Decompression of data using Gzip/Snappy bypasses the Parquet APIs and 
calls the decompressors directly (there were concurrency issues with using the 
Parquet APIs)
    Added new operator metrics for asynchronous page reading.

You can merge this pull request into a Git repository by running:

    $ git pull DRILL-4800

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #611
commit 0457d69cae403bc8abcebb90ead55769ec58f5ac
Author: Parth Chandra <>
Date:   2016-06-10T21:56:41Z

    DRILL-4800: Use a buffering input stream in the Parquet reader

commit a33200107a5180f1b0dbad2b2e5b0905de4ed884
Author: Parth Chandra <>
Date:   2016-08-24T17:46:37Z

    DRILL-4800: Parallelize column reading.
      Read/Decode fixed width fields in parallel
      Decoding var length columns in parallel
      Use simplified decompress method for Gzip and Snappy decompression. 
Avoids concurrency issue with Parquet decompression. (It's also faster).
      Stress test Parquet read write
      Parallel column reader is disabled by default (may perform less well 
under higher concurrency)

commit 8d9c26071b4826bda917ac4e88c70b7351a16d83
Author: Parth Chandra <>
Date:   2016-09-27T21:03:35Z

    DRILL-4800: Add AsyncPageReader to pipeline PageRead
      Use non tracking input stream for Parquet scans.
      Make choice between async and sync reader configurable.
      Make various options user configurable - choose between sync and async 
page reader, enable/disable fadvise
      Add Parquet Scan metrics to track time spent in various operations

commit 91658f0cb3bb2ee3ff35a0ffde859052df91527e
Author: Parth Chandra <>
Date:   2016-09-14T04:47:49Z

    DRILL-4800: Various fixes.
     Fix buffer underflow exception in BufferedDirectBufInputStream.
     Fix writer index for in64 dictionary encoded types.
     Added logging to help debug.
     Fix memory leaks.
     Work around issues with of InputStream.available() ( Do not use 
hasRemainder; Remove check for EOF in ).
     Finalize defaults.
     Remove commented code.


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

Reply via email to