[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

sachouche Tue, 09 Jan 2018 11:57:26 -0800

Github user sachouche commented on the issue:

    https://github.com/apache/drill/pull/1060
  
    Before I reply to the provided comments I want first to thank both Parth 
and Paul for taking time to review this Pull Request.
    @parthchandra Regarding the High Level Comments
    
    - FS Access Manager
    The attached document listed several possible improvements in terms of CPU, 
Partitioning (changes to the query plan), and IO layer. I didn't want to club 
all the changes in a single pull request; instead I started with the Pull 
Request with the highest priority (in terms of performance improvement) which 
is CPU related enhancements.
    
    - Drill Memory Package
    I understand where you're coming from; I have analyzed this task and came 
up with a solution which I felt was the cleanest. Let me take you through my 
thinking process and the alternative solutions that I have considered:
    a) Why did I use this low level APIs?
    The Bulk Parquet reader uses the Drill Buffer APIs to copy data chunks to 
the CPU cache; this means that memory sanity checks are always performed. 
During code profiling I have noticed the Hotspot was using too many 
instructions (byte based assembly was used); I have researched ways to improve 
this and found some Hadoop and Compression java code which utilized 64bit 
access instructions when manipulating byte arrays (e.g., 
[Snappy](https://github.com/airlift/aircompressor/blob/master/src/main/java/io/airlift/compress/snappy/SnappyRawCompressor.java)).
  This makes a huge difference
    
    b) Safety
    As indicated previously, native data is always accessed through the Drill 
Buffer APIs (that is, accessed data is sanitized). During bulk processing, the 
new utility has additional checks which are triggered when assertions are 
enabled. This means all test-suite and QA tests (which run with assertions On) 
will be able to catch any faulty memory access; note also the checks are 
simpler since they only pertain to byte arrays boundary checks.
    
    c) Abstraction
    I agree that the Drill code base should follow a common code path when 
accessing native memory (minimize chances of introducing bugs), though I felt 
the Drill Memory package was the main abstraction since it was centralized and 
thus easier to test.
    
    d) Alternatives
    I have considered using the Netty io.netty.util.internal.PlatformDependent0 
to access the Java intrinsics APIs though this class has a package scope and 
the wrapper public class io.netty.util.internal.PlatformDependent doesn't 
expose this low level functionality. Thus the work around would have been to 
create MemoryUtils class under the io.netty.util.internal package. I didn't 
think this was a cleaner approach than just using Java's intrinsic APIs 
directly (note also that Java9 will expose such intrinsics under a public JDK 
package instead of the sun.misc. Please let me know if you rather prefer the 
io.netty.util.internal route instead.
    
    - Changes to the Vector Code
    I had few discussions with Paul about the need to reconcile DRILL-5657; the 
agreement was to proceed with the current implementation and then apply 
refactoring (future releases) as my changes are backward compatible (new public 
APIs added). 
    
    - Arrow Integration
    I will create a task to analyze the impact of the Arrow Integration on the 
Parquet Reader.
    
    - UserBitShared.proto
    This class was updated to deal with the implicit columns optimization. I 
have added two fields to the NullableVarChar class to capture the logical 
number of rows and the fact this feature was active. This allowed me to use 
different Accessor and Mutator implementation. You and I had a discussion about 
this and the agreement was to:
    a) The new optimization is configurable (current default is off); though we 
might change the default at some point
    b) Detect C++ based clients and turn of this optimization (that is, the 
implicit columns enhancements); I didn't implement this task yet as I forgot 
the name of the property that you have indicated (this should be quite fast to 
implement though)
    
    - Testing
    Yes, QA helped me to run the test-suite with the two new options enabled. 
The new changes didn't modify any of the encoding/compression logic but I'll be 
happy to add any test(s) you deem necessary.. 
    
    - Code Style
    Thank you for the pointers; I'll stick to your feedback in future changes 
to ease the review process.

---

[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

Reply via email to