Hi Vlad,
More responses.
> The same approach [as for internal operators] applies to senders and 
> receivers. Senders gets batches 
from the upstream operators taking ownership of those batches and send 
data to receivers.

Senders receive data from an "upstream" operator, then serialize over the wire. 
As a result, Senders take ownership from the upstream operator, but then must 
transfer ownership to Netty. Here I'll speculate. I believe that we create a 
Netty composite buffer that strings together the buffers that underlie the 
value vectors in the outgoing record batch. (Yes, there are many layers in 
play.)

Netty does not know about our allocator model. It does, however, have a 
reference count. So, my guess is that the Sender somehow gives up ownership of 
the outgoing buffer in the sense of the Drill allocator, but lets Netty drop 
the reference count once Netty has sent the buffer.

I believe you are quite familiar with Netty, so perhaps you can dig around here 
and explain how this actually works.

> Receivers get data from senders and reconstruct 
record batches.

You are right logically. But, physically there is a difference. Data arrives 
via Netty which allocates buffers for the data. Receivers take these raw 
buffers and turn them into batches. Here things get even more complex (if that 
is possible.) The Receiver creates multiple vectors on top of a single Netty 
buffer. That is, multiple vectors were serialized together and were read 
together. Much of the complexity of Drill's memory model comes from the ability 
to create multiple (logical) DrillBufs on top of a single (physical) Netty 
buffer. This is where we need reference counts (so we know when the last shared 
use goes away), and where we need the UDLE/DrillBuf separation.

So, again, Netty does not play the Drill "ownership" game, it only does 
reference counts. So the Receiver must convert from the Netty reference count 
of the big incoming buffer, to reference counts for each materialized vector, 
and create some kind of entry in Drill's allocator. I'm not sure how this is 
done; it would be great if you could figure this out.

Could this be done differently? Probably. Maybe serialize each buffer by itself 
so that Netty creates separate buffers for each. I'd guess the original authors 
started with this design and moved to the present one, perhaps for performance 
reasons. (Anyone know of the history here?)

> It is the business logic of senders and receivers and 
they may rely on other libraries (rpc and netty) or classes to handle 
serialization/de-serialization, buffering, acknowledgment, back-pressure 
or dealing with network. From other Drill operators point of view, 
senders and receivers are operators responsible for passing record 
batches from one drillbit to another.

True. Senders/Receivers should speak Drill operator protocol on one side, Netty 
protocol on the other. They are adapters. Is this not what you see?

> Following your approach it is necessary to modify MergingReceiver as 
well. It also pulls batches from a queue (see 
MergingRecordBatch.getNext()), but instead of almost immediately passing 
it to a next operator as UnorderReceiver does, MergingReceiver creates a 
new record batch from those batches that it pulls from the queue. To be 
consistent with proposed changes to UnorderReceiver, it is necessary to 
change the ownership of batches that MergingReceiver pulls as well 
especially that MergingReciver may keep reference to the original batch 
much longer compared to UnorderedReceiver (while it waits for batches 
from other drillbits).

I personally don't know the details. But, in general, if one operator passes 
data to another, it should play by the Drill ownership rules if it works with 
vectors. If, instead, it works with buffers, then it should probably play by 
the Netty rules.

> I don't see a reason to modify both UnorderedReceiver and 
MergingReceiver, instead, I think, we should modify allocator used when 
batches are created in the first place before they are added to a queue.

My own suggestion here is that we may want to make use of an old-school 
technique that is still often handy: write up the design. Document the rules 
I've been doing my best to explain above. Add a detailed explanation of how 
Drill interfaces with Netty. Then, think through how we wan to handle the 
Drill-opererator-to-Netty interface.

Another particularly nasty area is the "Mux" operators. Several folks struggled 
to understand them and didn't get very far. This is not a good state to be in. 
We should really understand how they work. Perhaps understanding the most 
complex case will help shed light on the case under discussion.
Thanks,

- Paul


  

Reply via email to