Hello Drillers,

Currently each of the physical operators in Drill has it's own way of
specifying how many records it will try to produce in a single batch. For
some operators like project, the outgoing batch will be the same size as
the incoming, in the case of a projection with no evaluations. If the size
of the data is changing in a projection, such as converting a numeric type
to varchar, we cannot guess how much memory will be needed in the outgoing
buffer, so we may have to cut off the first batch once we run out of space
and separately handle the overflowing data.

In other operators, where the incoming streams cause the spawning of new
outgoing records, we can not make a guess about the outgoing batch size, we
just need to keep producing row and cutting off batches as we run out of
space. Rather than hit exceptions in all cases, many of the operators have
a loop termination based on some expected number of rows in a batch, this
is generally around 4096. The record readers also define such limits.

I believe standardizing this value and making it configurable may be useful
for both debugging and tuning Drill. We have often found bugs around batch
boundary conditions, which often necessitates generating larger test cases
to reproduce problems and create unit tests once the issues are fixed. I'm
thinking if we could lower this value we may be able to write more concise
tests that easily demonstrate the boundary conditions in smaller input
files and test definitions.

This could also be useful for tuning drill. While we may not want to make
this option available in production, we could use it in the meantime to
drive efforts to identify best values in different scenarios when we
stretch the limits of Drill. After a brief discussion with Steven he said
that in some of his testing he was able to see some performance gains
increasing the value from 4000, to 32k. This isn't a strong argument in
itself for pushing up the default, as it will increase memory requirements
and will likely hurt is in multi-user and environments running many
concurrent queries. In these cases we may need to automatically throttle
back the batch size to reduce overall memory usage of any particular
operation.

Making this would be a code change that would touch a fairly large number
of files, but I think the possible benefits could justify the change, just
wanted to collect thoughts from the community.

- Jason

Reply via email to