Re: [DISCUSS] DrillBuf

Paul Rogers Sat, 07 Apr 2018 23:58:53 -0700

Hi Vlad,

Thanks for the clarifications. My general comment is that it is always good to 
refactor things if that is the fastest way to achieve some goal. It is not 
clear, however, what the goal is here other than code improvement. Since Drill 
still has plenty of opportunities for improvement that will help users, I guess 
it's a bit unclear why we'd clean up code just for its own sake if doing so 
will entail a large amount of work and provide no user benefit.


I say this because I've done quite a bit of work in surrounding code and 
learned how much work it takes to make these kinds of changes and then 
stabilize the result.

On the other hand, if you are working on a memory-related project that is 
making major changes, and these issues are getting in your way, then 
refactoring could well be the fastest way to achieve your project goals. Are 
you working on such a project?

> why was it necessary to introduce DrillBuf and/or UnsafeDirectLittleEndian? 
> What functionality do those classes provide that existing Netty classes do 
> not?
> IMO, it will be good to make DrillBuf code simpler and consistent.

Good questions!  So, it seems that DrillBuf turns out to be a bit of a muddle. 
But, unless it is preventing us from making some desired change, or changing 
things will provide a significant performance boost, I'd guess I'd just hold my 
nose and leave it as is for now.

The same argument applies, by the way, to the value vectors. The value vector 
classes become almost entirely redundant once the row set mechanisms are 
adopted. But, I suspect that value vectors will live on anyway until there is a 
reason to do something with them.

> Question is whether or not to check that reference count is not zero every 
> time DrillBuf is used (see ensureAccessible()).

IMHO, there is no reason to check on every access, except in a "paranoid" debug 
mode. Removing the check might provide a nice little performance bump. Avoiding 
bounds and ref checks was one of the goals of the "unchecked" methods that I 
had in DrillBuf but which we decided to remove... If it turns out that the 
methods used by the row set abstractions do, in fact, do bounds checks, then 
this is a strong case to put the "unsafe" (unchecked) methods back.

> 5. Moving DrillBuf to a different package


I agree with your explanation. However, I assume the original authors were 
forced to put DrillBuf in the Netty package for some reason or other. If that 
reason is no longer valid, then it is usually pretty simple to have your IDE 
move the class to a different package and adjust all its references. This 
improvement, if possible, seems low cost and so might be worth doing.

On the other hand, if the move causes things to break, which causes effort to 
go into changing things, I guess I'd wonder why we can't just leave well enough 
alone and focus on things which are actually broken or could use a performance 
boost.

In short, I'm all for refactoring when it helps us deliver fixes or new 
features to customers. But, I'm struggling to see the user benefit in this 
case. Can you help me to understand the user benefit of these changes?

Thanks,

- Paul

 

    On Saturday, April 7, 2018, 9:35:06 PM PDT, Vlad Rozov <vro...@apache.org> 
wrote:  
 
 Hi Paul,

My comments in-line.

Thank you,

Vlad

On 4/5/18 20:50, Paul Rogers wrote:
> Hi Vlad,
>>  I'd suggest to keep focus on DrillBuf design and implementation as the only 
>>gate for accessing raw (direct) memory.
> I was doing that. By explaining where DrillBuf fits in the overall design, we 
> see that DrillBuf should be the only access point for direct memory. The 
> context explains why this is the right decision. Changes to DrillBuf should 
> support our design as DrillBuf only exists for that purpose.
My concern is not why Drill adopted Netty as Netty does provide a good 
amount of functionality on top of Java networking and NIO. I also do not 
propose to replace Netty with something else. My primary focus for this 
thread is the design and implementation of the DrillBuf Java class 
itself. Namely, why was it necessary to introduce DrillBuf and/or 
UnsafeDirectLittleEndian? What functionality do those classes provide 
that existing Netty classes do not? Netty already provides memory 
pooling, reference counting, slicing, composite buffers, working with 
direct and heap memory. By looking at the DrillBuf.java git history, the 
DrillBuf was introduced in 2014 and prior to that Netty classes were 
used directly. Unfortunately, the commit that introduced DrillBuf does 
not provide any info why it was introduced and does not have a reference 
to a JIRA.

One may argue that DrillBuf is a way for Drill to encapsulate Netty 
ByteBuf and guard other modules that use DrillBuf from Netty ByteBuf 
API, so if Netty decides to change ByteBuf API in the next major 
release, amount of changes will be limited to DrillBuf only. Problem is 
that DrillBuf inherits from Netty AbstractByteBuf, so the above goal is 
not achieved either.
>
>
>> 1. Boundary checking (on/off based on a flag or assertions being 
>> enabled/disabled, always on, always off, any other suggestions)
> By understanding the design, we can see that we do, in fact, need both 
> checked and unchecked methods. The row set mechanisms takes it upon 
> themselves to have sufficient context to ensure that memory access is always 
> within bounds, and so can benefit from the unchecked methods. As we said, we 
> need debug-time bounds checks to catch errors during development.
>
> On the other hand, value vectors should probably be protected by using 
> checked methods because they do not have intrinsic mechanisms that ensure 
> correct access. With vectors, the memory location to access is set by the 
> caller (each operator) and there is no guarantee that all this code is 
> correct all the time. (Though, it probably is right now because if it wasn't 
> we'd get errors.)
I don't see a difference in bounds checking requirements between row set 
mechanism and value vectors as value vectors do have "safe" methods or 
intrinsic mechanism that ensures correct access. If not all operators 
use "safe" methods, than that operator should provide a guarantee. At 
the end, if an operator accesses memory out of allocated bounds, 
boundary checking will not fix it. If there is a bug in row set 
mechanism, value vectors or an operator, the end result (crash of the 
JVM) is the same for all.
>
> The proposal just made represents a change; currently the two mechanisms use 
> the same set of methods, which puts us into the "should we turn on bounds 
> checks for everyone or turn them off for everyone" dilemma.
>
> This is a technical design decision, not a community preference (other than 
> that we'd prefer that stuff works...)
On the dev list, almost everything is a technical design decision and 
the community (compared to Drill users who prefer that the community 
makes a choice and stuff works) needs to agree on how to proceed and 
what development practices to adopt and follow.
>
>> 2. Ref count checking (delegate to netty or have a separate mechanism to 
>> enable/disable, always on or off)
>
> Ref counts are absolutely necessary, in the current design, for the reasons 
> explained earlier: a single memory block can be shared by multiple DrillBufs. 
> We have no other way at present to know when the last reference goes away if 
> we don't have ref counts.
>
>
> To deprecate reference counts, we'd have to rework the way that memory is 
> transferred between operators. We'd have to deprecate shared buffers. (Or, 
> we'd have to move to the fixed blocks mentioned earlier; but even then we'd 
> need ref counts if a single sender can feed data to multiple fragments 
> without copies.)
>
> Again, this is not a preference issue, it is a fundamental design issue 
> (unless you know of a trick to remove the need for ref counts, in which case 
> please do propose it.) Or, if there is a better way to implement bounds 
> checks that is faster or simpler, please do propose that.
The question is not about deprecating reference count mechanism. 
Question is whether or not to check that reference count is not zero 
every time DrillBuf is used (see ensureAccessible()).
>
>
>> 3. Usage of UDLE
>
> If we meet the design goals stated earlier, and DrillBuf is the only 
> interface to memory, then we could change the internal representation if 
> there were a good reason to do so. Since you listed UDLE, you probably have 
> an idea in mind. To make the code simpler? Faster? Perhaps make a proposal.
IMO, it will be good to make DrillBuf code simpler and consistent. As 
the first step, it will be good to understand existing design and 
implementation in the absence of comments and/or JIRA. For example, 
where UnsafeDirectLittleEndian.getLong(int index) is used? Or, why 
UnsafeDirectLittleEndian overrides "public int setBytes(int index, 
InputStream in, int length)" and does not override "public ByteBuf 
setBytes(int index, ByteBuf src, int length)"?

>
>> 4. Changing DrillBuf to follow Netty convention
> Why? Vectors are not byte buffers; they just happen to use them as explained 
> earlier. Vectors are actually, well, vectors of fixed types with rigid 
> structure, so DrillBuf should provide methods that implement that design. 
> Further, Netty never uses DrillBuf so who would benefit from an API change? 
> Just to make the code cleaner? To gain performance? Something else? Again, 
> proposal? Note that the comment about checked and unchecked methods made 
> earlier suggests we'd make DrillBuf even *less* like Netty...
I am not sure I understand your reference to Vectors, can you please 
explain. I do not propose to change DrillBuf to follow Vectors API. The 
proposal is to make DrillBuf methods that override Netty methods to be 
consistent with the rest of Netty API. Note that it is not correct that 
Netty never uses DrillBuf API. It does when DrillBuf delegates to 
UnsafeDirectLittleEndian or does not override inherited methods. In many 
cases Netty implementation relies on abstract methods implemented in 
subclasses. Netty also has unchecked variants and checked methods use 
unchecked most of the time.
>
>
> 5. Moving DrillBuf to a different package
>
>
> This is purely a functional question. As the lore has it, DrillBuf is in the 
> Netty package because it needs (or needed) visibility to protected members in 
> Netty classes. If that is no longer true, then lets do move it to the Drill 
> package so we confuse ourselves less about what is and what is not Drill 
> code. (I'm STILL confused...) Have you checked to see if things compile if we 
> do the move?
There is no need to place a class into the same package to use protected 
members of the base class. There may be classes or fields with package 
private visibility and IMO, it is better not to use them (see my 
previous e-mail). Again, I do not want to surprise the community with 
JIRA and/or PR with massive package name changes and I would like to get 
the community consensus before implementing those changes.
>
>
> Thanks,
>
> - Paul
>

Re: [DISCUSS] DrillBuf

Reply via email to