Github user sachouche commented on the issue:
https://github.com/apache/drill/pull/1060
Before I reply to the provided comments I want first to thank both Parth
and Paul for taking time to review this Pull Request.
@parthchandra Regarding the High Level Comments
- FS Access Manager
The attached document listed several possible improvements in terms of CPU,
Partitioning (changes to the query plan), and IO layer. I didn't want to club
all the changes in a single pull request; instead I started with the Pull
Request with the highest priority (in terms of performance improvement) which
is CPU related enhancements.
- Drill Memory Package
I understand where you're coming from; I have analyzed this task and came
up with a solution which I felt was the cleanest. Let me take you through my
thinking process and the alternative solutions that I have considered:
a) Why did I use this low level APIs?
The Bulk Parquet reader uses the Drill Buffer APIs to copy data chunks to
the CPU cache; this means that memory sanity checks are always performed.
During code profiling I have noticed the Hotspot was using too many
instructions (byte based assembly was used); I have researched ways to improve
this and found some Hadoop and Compression java code which utilized 64bit
access instructions when manipulating byte arrays (e.g.,
[Snappy](https://github.com/airlift/aircompressor/blob/master/src/main/java/io/airlift/compress/snappy/SnappyRawCompressor.java)).
This makes a huge difference
b) Safety
As indicated previously, native data is always accessed through the Drill
Buffer APIs (that is, accessed data is sanitized). During bulk processing, the
new utility has additional checks which are triggered when assertions are
enabled. This means all test-suite and QA tests (which run with assertions On)
will be able to catch any faulty memory access; note also the checks are
simpler since they only pertain to byte arrays boundary checks.
c) Abstraction
I agree that the Drill code base should follow a common code path when
accessing native memory (minimize chances of introducing bugs), though I felt
the Drill Memory package was the main abstraction since it was centralized and
thus easier to test.
d) Alternatives
I have considered using the Netty io.netty.util.internal.PlatformDependent0
to access the Java intrinsics APIs though this class has a package scope and
the wrapper public class io.netty.util.internal.PlatformDependent doesn't
expose this low level functionality. Thus the work around would have been to
create MemoryUtils class under the io.netty.util.internal package. I didn't
think this was a cleaner approach than just using Java's intrinsic APIs
directly (note also that Java9 will expose such intrinsics under a public JDK
package instead of the sun.misc. Please let me know if you rather prefer the
io.netty.util.internal route instead.
- Changes to the Vector Code
I had few discussions with Paul about the need to reconcile DRILL-5657; the
agreement was to proceed with the current implementation and then apply
refactoring (future releases) as my changes are backward compatible (new public
APIs added).
- Arrow Integration
I will create a task to analyze the impact of the Arrow Integration on the
Parquet Reader.
- UserBitShared.proto
This class was updated to deal with the implicit columns optimization. I
have added two fields to the NullableVarChar class to capture the logical
number of rows and the fact this feature was active. This allowed me to use
different Accessor and Mutator implementation. You and I had a discussion about
this and the agreement was to:
a) The new optimization is configurable (current default is off); though we
might change the default at some point
b) Detect C++ based clients and turn of this optimization (that is, the
implicit columns enhancements); I didn't implement this task yet as I forgot
the name of the property that you have indicated (this should be quite fast to
implement though)
- Testing
Yes, QA helped me to run the test-suite with the two new options enabled.
The new changes didn't modify any of the encoding/compression logic but I'll be
happy to add any test(s) you deem necessary..
- Code Style
Thank you for the pointers; I'll stick to your feedback in future changes
to ease the review process.
---