Paul Rogers created DRILL-5275:
----------------------------------
Summary: Sort spill serialization is very slow
Key: DRILL-5275
URL: https://issues.apache.org/jira/browse/DRILL-5275
Project: Apache Drill
Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Paul Rogers
Assignee: Paul Rogers
Fix For: 1.10.0
Drill provides a sort operator that spills to disk. The spill and read
operations use the serialization code in the {{VectorAccessibleSerializable}}.
This code, in turn, uses the {{DrillBuf.getBytes()}} method to write to an
output stream. (Yes, the "get" method writes, and the "write" method reads...)
The DrillBuf method turns around and calls the UDLE method that does:
{code}
byte[] tmp = new byte[length];
PlatformDependent.copyMemory(addr(index), tmp, 0, length);
out.write(tmp);
{code}
That is, for each write the code allocates a heap buffer. Since Drill buffers
can be quite large (4, 8, 16 MB or larger), the above rapidly fills the heap
and causes GC.
The result is slow performance. On a Mac, with an SSD that can do 700 MB/s of
I/O, we get only about 40 MB/s. Very likely because of excessive CPU cost and
GC.
The solution is to allocate a single read or write buffer, then use that same
buffer over and over when reading or writing. This must be done in
{{VectorAccessibleSerializable}} as it is a per-thread class that has
visibility to all the buffers to be written.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)