Paul Rogers created DRILL-5593:
----------------------------------
Summary: Modernize Drill's memory allocator to reflect current
usage
Key: DRILL-5593
URL: https://issues.apache.org/jira/browse/DRILL-5593
Project: Apache Drill
Issue Type: Improvement
Affects Versions: 1.10.0
Reporter: Paul Rogers
Drill's memory allocator is quite sophisticated. But, as Drill moves toward
improved resource management, the design of the current allocator no longer
aligns well with the overall resource management design.
The current allocator:
* Provides a separate allocator and accountant for each operator.
* Enforces a hard memory limit for each operator, causing an OOM error when the
operator exceeds the per-operator limit.
* Provides a complex transfer mechanism that moves memory ownership from one
operator to another as batches move downstream.
* Allows a buffer to be shared by multiple allocators, with one allocator being
the "owing" allocator.
* Allows a memory block to be shared by multiple buffers (as occurs when
deserializing a record batch from the wire.)
* Provides a tree of allocators in which child allocators can ask parents for
more memory and parents provide that memory out of their own allocation.
The current design appears to have been an attempt to allow operators to
negotiate among themselves for memory usage. The idea seems to be that any
given operator uses its assigned memory. If it needs more, it asks the parent
allocator for more. If the parent can't provide more, the child operator sends
a {{OUT_OF_MEMORY}} signal downstream and some downstream operator must give up
some of its memory (perhaps by spilling) so that the upstream operator can
proceed.
The challenge is that only the framework was implemented, not the intended
negotiation mechanisms. As a result, the current allocator presents challenges:
* Drill is moving toward a planned memory allocation system: the planner
assigns memory limits to each fragment (for the in-flight batch overhead) and
to each buffering operator.
* Memory is then managed at the fragment level, and per-opeartor, but only for
buffering operators.
* Memory for other operators (scan, select, project, etc.) is completely
determined by batch size, th operators have no way to deal with OOM conditions.
* The {{OUT_OF_MEMORY}} iterator status never worked. (It is hard to imagine
how, say, a scan operator would run out of memory on column d within (a, b, c,
d, e, f), remember its state, hold onto the d value, send the signal
downstream, then resume where it left off. The code would become even more
complex than it already is.
* Code now must rediscover the memory used by each batch just to ensure that it
never exceeds the per-operator memory limits. The sort, in particular is
infamous for OOM on SV2 allocation because a batch is so large that it fills up
the allocator, causing the next allocation (the SV2) to fail -- but only for
accounting reasons.
One very important part of the current allocator to be retained is the "fresh"
(one buffer per vector) and deserialized (shared buffer for all vectors) modes.
Also, the ability for a single deserialized buffer to be shared by multiple
fragments.
As a result, this is a complex design task, not a simple bug fix.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)