Paul Rogers created DRILL-5275: ---------------------------------- Summary: Sort spill serialization is very slow Key: DRILL-5275 URL: https://issues.apache.org/jira/browse/DRILL-5275 Project: Apache Drill Issue Type: Bug Affects Versions: 1.10.0 Reporter: Paul Rogers Assignee: Paul Rogers Fix For: 1.10.0
Drill provides a sort operator that spills to disk. The spill and read operations use the serialization code in the {{VectorAccessibleSerializable}}. This code, in turn, uses the {{DrillBuf.getBytes()}} method to write to an output stream. (Yes, the "get" method writes, and the "write" method reads...) The DrillBuf method turns around and calls the UDLE method that does: {code} byte[] tmp = new byte[length]; PlatformDependent.copyMemory(addr(index), tmp, 0, length); out.write(tmp); {code} That is, for each write the code allocates a heap buffer. Since Drill buffers can be quite large (4, 8, 16 MB or larger), the above rapidly fills the heap and causes GC. The result is slow performance. On a Mac, with an SSD that can do 700 MB/s of I/O, we get only about 40 MB/s. Very likely because of excessive CPU cost and GC. The solution is to allocate a single read or write buffer, then use that same buffer over and over when reading or writing. This must be done in {{VectorAccessibleSerializable}} as it is a per-thread class that has visibility to all the buffers to be written. -- This message was sent by Atlassian JIRA (v6.3.15#6346)