Re: [DISCUSS] DrillBuf

Parth Chandra Tue, 10 Apr 2018 11:08:49 -0700

Inline

On Sun, Apr 8, 2018 at 10:04 AM, Vlad Rozov <vro...@apache.org> wrote:


> Hi Paul,
>
> My comments in-line.
>
> Thank you,
>
> Vlad
>
> On 4/5/18 20:50, Paul Rogers wrote:
>
>> Hi Vlad,
>>
>>>   I'd suggest to keep focus on DrillBuf design and implementation as the
>>> only gate for accessing raw (direct) memory.
>>>
>> I was doing that. By explaining where DrillBuf fits in the overall
>> design, we see that DrillBuf should be the only access point for direct
>> memory. The context explains why this is the right decision. Changes to
>> DrillBuf should support our design as DrillBuf only exists for that purpose.
>>
> My concern is not why Drill adopted Netty as Netty does provide a good
> amount of functionality on top of Java networking and NIO. I also do not
> propose to replace Netty with something else. My primary focus for this
> thread is the design and implementation of the DrillBuf Java class itself.
> Namely, why was it necessary to introduce DrillBuf and/or
> UnsafeDirectLittleEndian? What functionality do those classes provide that
> existing Netty classes do not? Netty already provides memory pooling,
> reference counting, slicing, composite buffers, working with direct and
> heap memory. By looking at the DrillBuf.java git history, the DrillBuf was
> introduced in 2014 and prior to that Netty classes were used directly.
> Unfortunately, the commit that introduced DrillBuf does not provide any
> info why it was introduced and does not have a reference to a JIRA.
>
> One may argue that DrillBuf is a way for Drill to encapsulate Netty
> ByteBuf and guard other modules that use DrillBuf from Netty ByteBuf API,
> so if Netty decides to change ByteBuf API in the next major release, amount
> of changes will be limited to DrillBuf only. Problem is that DrillBuf
> inherits from Netty AbstractByteBuf, so the above goal is not achieved
> either.
>
>>
>>
I believe DrillBuf started out as a way to avoid using ByteBuf directly to
protect Drill code from any unwanted changes in ByteBuf. Also, I believe it
was to allow experimentation that allowed us to switch between direct and
heap memory without changing Drill code. A second and third iteration of
changes to Drill memory accounting implementation resulted in DrillBuf
becoming the primary interface through which all memory account is done.
The prime purpose of UDLE's is to do the actual memory allocation/release
using the underlying memory pool. I have no explanation for the
inconsistent behaviour.
One other important point: IMO there should be no direct access to
PlatformDependent in ValueVectors or other code in Drill unless an argument
can be made for performance. There is one API in VariableLengthVectors.java
that appears to use memory address directly, that we should probably
remove.

I agree that we should clean this up. Let me know if you want me to assist
in this cleanup or at least in formalising the APIs for DrillBuf and UDLE.



>
>> 1. Boundary checking (on/off based on a flag or assertions being
>>> enabled/disabled, always on, always off, any other suggestions)
>>>
>> By understanding the design, we can see that we do, in fact, need both
>> checked and unchecked methods. The row set mechanisms takes it upon
>> themselves to have sufficient context to ensure that memory access is
>> always within bounds, and so can benefit from the unchecked methods. As we
>> said, we need debug-time bounds checks to catch errors during development.
>>
>> On the other hand, value vectors should probably be protected by using
>> checked methods because they do not have intrinsic mechanisms that ensure
>> correct access. With vectors, the memory location to access is set by the
>> caller (each operator) and there is no guarantee that all this code is
>> correct all the time. (Though, it probably is right now because if it
>> wasn't we'd get errors.)
>>
> I don't see a difference in bounds checking requirements between row set
> mechanism and value vectors as value vectors do have "safe" methods or
> intrinsic mechanism that ensures correct access. If not all operators use
> "safe" methods, than that operator should provide a guarantee. At the end,
> if an operator accesses memory out of allocated bounds, boundary checking
> will not fix it. If there is a bug in row set mechanism, value vectors or
> an operator, the end result (crash of the JVM) is the same for all.
>
>>
>> The proposal just made represents a change; currently the two mechanisms
>> use the same set of methods, which puts us into the "should we turn on
>> bounds checks for everyone or turn them off for everyone" dilemma.
>>
>> This is a technical design decision, not a community preference (other
>> than that we'd prefer that stuff works...)
>>
> On the dev list, almost everything is a technical design decision and the
> community (compared to Drill users who prefer that the community makes a
> choice and stuff works) needs to agree on how to proceed and what
> development practices to adopt and follow.



Bounds checking, or too much of it, is a problem with the current
implementation. At one time I counted a single value written to a nullable
varchar going thru *eight* bounds checks. Most of the bounds checks were in
the value vectors code, but DrillBuf introduced a couple as well.Since the
initial Drill code provided no mechanism for operators to write safely,
bounds checking was introduced first in DrillBuf and than in Vectors (when
the former proved inadequate).  Ideally, we want at most one bounds check
at the time of write; the only question is where we should do so.
It seems that doing the bounds check at application level, is the right way
to go, and I believe we have general agreement on that. I believe there is
also agreement that we should be able to turn on checking for debug.
As Paul has pointed out, the row set mechanism provides the safe write
guarantee at the operator level and so we can proceed with a unchecked path
in both vectors and DrillBuf.



>
>
>> 2. Ref count checking (delegate to netty or have a separate mechanism to
>>> enable/disable, always on or off)
>>>
>>
>> Ref counts are absolutely necessary, in the current design, for the
>> reasons explained earlier: a single memory block can be shared by multiple
>> DrillBufs. We have no other way at present to know when the last reference
>> goes away if we don't have ref counts.
>>
>>
>> To deprecate reference counts, we'd have to rework the way that memory is
>> transferred between operators. We'd have to deprecate shared buffers. (Or,
>> we'd have to move to the fixed blocks mentioned earlier; but even then we'd
>> need ref counts if a single sender can feed data to multiple fragments
>> without copies.)
>>
>> Again, this is not a preference issue, it is a fundamental design issue
>> (unless you know of a trick to remove the need for ref counts, in which
>> case please do propose it.) Or, if there is a better way to implement
>> bounds checks that is faster or simpler, please do propose that.
>>
> The question is not about deprecating reference count mechanism. Question
> is whether or not to check that reference count is not zero every time
> DrillBuf is used (see ensureAccessible()).
>
>>
>>
>>
I don't think ensureAccessible is called for every read operation except in
debug. AFAIK, it should be turned off unless debug is enabled.
(checkAccessible is set to true only if
io.netty.buffer.bytebuf.checkAccessible is set to true).



> 3. Usage of UDLE
>>>
>>
>> If we meet the design goals stated earlier, and DrillBuf is the only
>> interface to memory, then we could change the internal representation if
>> there were a good reason to do so. Since you listed UDLE, you probably have
>> an idea in mind. To make the code simpler? Faster? Perhaps make a proposal.
>>
> IMO, it will be good to make DrillBuf code simpler and consistent. As the
> first step, it will be good to understand existing design and
> implementation in the absence of comments and/or JIRA. For example, where
> UnsafeDirectLittleEndian.getLong(int index) is used? Or, why
> UnsafeDirectLittleEndian overrides "public int setBytes(int index,
> InputStream in, int length)" and does not override "public ByteBuf
> setBytes(int index, ByteBuf src, int length)"?
>
>
There is no (satisfactory) explanation for this. It should be one or the
other.



>
>> 4. Changing DrillBuf to follow Netty convention
>>>
>> Why? Vectors are not byte buffers; they just happen to use them as
>> explained earlier. Vectors are actually, well, vectors of fixed types with
>> rigid structure, so DrillBuf should provide methods that implement that
>> design. Further, Netty never uses DrillBuf so who would benefit from an API
>> change? Just to make the code cleaner? To gain performance? Something else?
>> Again, proposal? Note that the comment about checked and unchecked methods
>> made earlier suggests we'd make DrillBuf even *less* like Netty...
>>
> I am not sure I understand your reference to Vectors, can you please
> explain. I do not propose to change DrillBuf to follow Vectors API. The
> proposal is to make DrillBuf methods that override Netty methods to be
> consistent with the rest of Netty API. Note that it is not correct that
> Netty never uses DrillBuf API. It does when DrillBuf delegates to
> UnsafeDirectLittleEndian or does not override inherited methods. In many
> cases Netty implementation relies on abstract methods implemented in
> subclasses. Netty also has unchecked variants and checked methods use
> unchecked most of the time.
>
>>
>>
Sure. It would be a good policy the DrillBuf methods to follow the same
convention as Netty. We might as well do that at the same time as the
cleanup.



>
>> 5. Moving DrillBuf to a different package
>>
>>
>> This is purely a functional question. As the lore has it, DrillBuf is in
>> the Netty package because it needs (or needed) visibility to protected
>> members in Netty classes. If that is no longer true, then lets do move it
>> to the Drill package so we confuse ourselves less about what is and what is
>> not Drill code. (I'm STILL confused...) Have you checked to see if things
>> compile if we do the move?
>>
> There is no need to place a class into the same package to use protected
> members of the base class. There may be classes or fields with package
> private visibility and IMO, it is better not to use them (see my previous
> e-mail). Again, I do not want to surprise the community with JIRA and/or PR
> with massive package name changes and I would like to get the community
> consensus before implementing those changes.
>
>
I believe the reason was to access some package private members. If we can
rename the package, then that would be excellent.


 Vlad, if you're going to work on this, let me know how I can help.


Parth

Re: [DISCUSS] DrillBuf

Reply via email to