Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/754#discussion_r102378905
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/cache/VectorAccessibleSerializable.java
---
@@ -57,6 +57,12 @@
private BatchSchema.SelectionVectorMode svMode =
BatchSchema.SelectionVectorMode.NONE;
private SelectionVector2 sv2;
+ /**
+ * Disk I/O buffer used for all reads and writes of DrillBufs.
+ */
+
+ private byte buffer[] = new byte[32*1024];
--- End diff --
Read a 18 GB file on disk using just a test program that uses various size
buffers.
{code}
32K buffer: Rate: 799 MB/s
64K buffer: Rate: 766 MB/s
{code}
So, seems no advantage of a larger buffer. (Tests with a smaller buffer do
slow things down, hence the 32K size.)
On direct memory: can't use direct memory as the fundamental problem is
that data is in a direct memory DrillBuf, and must be copied to heap memory for
writing. The original code does the copy by allocating a heap buffer the same
size as the vector (16 MB, 32 MB or larger.) This code does the copying by
reusing the same buffer over and over.
No need to hold the buffer on the operator. This class is used for an
entire spill/read session.
What may be an issue, however, is the merge phase of a sort when many files
are open and so many buffers are created. The reads are synchronous, so they
could share a buffer.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---