Paul Rogers created DRILL-5013:
----------------------------------

             Summary: Heap allocation, data copies in UDLE write path for 
ExternalSortBatch
                 Key: DRILL-5013
                 URL: https://issues.apache.org/jira/browse/DRILL-5013
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.8.0
            Reporter: Paul Rogers
            Priority: Minor


The ExternalSortBatch (ESB) uses spill-to-disk to sort a large collection of 
records within a limited memory footprint.

As part of writing data to disk, ESB writes each of a target byte buffer to 
disk. Since the vector is stored in direct memory (not visible to an output 
stream), the code path first makes a temporary on-heap copy.

In particular the code in `io.netty.buffer.PooledUnsafeDirectByteBuf` does the 
following:

{code}
    @Override
    public ByteBuf getBytes(int index, OutputStream out, int length) throws 
IOException {
        checkIndex(index, length);
        if (length != 0) {
            byte[] tmp = new byte[length];
            PlatformDependent.copyMemory(addr(index), tmp, 0, length);
            out.write(tmp);
        }
        return this;
    }
{code}

The result is that we 1) create a large number of on-heap objects, and 2) copy 
the data twice: once from direct memory to the tmp buffer, and from the tmp 
buffer into the output stream's own buffer.

Two optimizations are possible:

1. Copy the data byte-by-byte from the direct memory buffer to the output 
stream, or
2. Reuse the same tmp buffer across vector writes.

Since the code is in Netty, if we do either of the above, we'd have to write 
our own "getBytes" (misnomer, really write bytes) method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to