Re: About integration of drill and arrow

Igor Guzenko Fri, 10 Jan 2020 13:39:05 -0800

Hello Paul,

Thank you very much for your active participation in this discussion. I
agree that we shouldn't blindly accept Arrow as the only option in the
world.
Also, I would like to learn more about the fixed-size blocks. So I'll read
the paper and hope I'll have some related ideas to discuss later.


Thanks,
Igor


On Fri, Jan 10, 2020 at 9:17 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi All,
>
> Glad to see the Arrow discussion heating up and that it is causing us to
> ask deeper questions.
>
> Here I want to get a bit techie on everyone and highlight two potential
> memory management problems with Arrow.
>
> First: memory fragmentation. Recall that this is how we started on the EVF
> path. Allow allocates large, variable-size blocks of memory. To quote a
> 35-year old DB paper [1]: "[V]ariable-sized pages would cause heavy
> fragmentation problems."
>
> Second: the idea of Arrow is that tool A creates a set of vectors that
> tool B will consume. This means that tool A and B have to agree on vector
> (buffer) size. Suppose tool A wants really big batches, but B can handle
> only small batches. In a columnar system, there is no good way to split a
> bit batch into smaller ones. One can copy values. but this is exactly what
> Arrow is supposed to avoid.
>
> Hence, when using Arrow, a data producer dictates to Drill a crucial
> factor in memory management: batch size. And, Drill dictates batch size to
> its clients. It will require complex negotiation logic. All to avoid a copy
> when the tools will communicate via RPC anyway. This is, in the larger
> picture, not a very good design at all. Needless to say, I am personally
> very skeptical of the benefits.
>
> A possible better alternative, one that we prototyped some time back, is
> to base Drill memory on fixed-size "blocks", say 1 MB in size. Any given
> vector can use part of, all of, or multiple of the blocks to store data.
> The blocks are at least as large as the CPU cache lines, so we retain that
> benefit. Memory management is now far easier, and we can exploit 40 years
> of experience in effective buffer management. (Plus, the blocks are easy to
> spill to disk using classic RDBMS algorithms.)
>
> Point is: let's not blindly accept the work that Arrow has done. Let's do
> our homework to figure out the best system for Drill: whether that be
> Arrow, fixed-size buffers, the current vectors, or something else entirely.
>
> Thanks,
> - Paul
>
>
>
> [1]
> http://users.informatik.uni-halle.de/~hinnebur/Lehre/2008_db_iib_web/uebung3_p560-effelsberg.pdf

Re: About integration of drill and arrow

Reply via email to