For testing purposes, having a static and configurable batch size, in number of 
rows, will definitely help.

> On Jan 9, 2015, at 5:45 PM, Aman Sinha <[email protected]> wrote:
> 
> Yes, in the past we have talked about this primarily in the context of unit
> testing...certainly we want to be able to write tests with small input data
> and exercise the batch boundaries.  Performance tuning is added benefit
> once we can do some performance characterizations.
> We might want to think about whether the batch size should be a static
> value or something that is determined once the 'fast schema' is known for
> the leaf operators.  The batch size could be a function of the row width...
> 
> Aman
> 
> On Fri, Jan 9, 2015 at 5:09 PM, Jason Altekruse <[email protected]>
> wrote:
> 
>> Hello Drillers,
>> 
>> Currently each of the physical operators in Drill has it's own way of
>> specifying how many records it will try to produce in a single batch. For
>> some operators like project, the outgoing batch will be the same size as
>> the incoming, in the case of a projection with no evaluations. If the size
>> of the data is changing in a projection, such as converting a numeric type
>> to varchar, we cannot guess how much memory will be needed in the outgoing
>> buffer, so we may have to cut off the first batch once we run out of space
>> and separately handle the overflowing data.
>> 
>> In other operators, where the incoming streams cause the spawning of new
>> outgoing records, we can not make a guess about the outgoing batch size, we
>> just need to keep producing row and cutting off batches as we run out of
>> space. Rather than hit exceptions in all cases, many of the operators have
>> a loop termination based on some expected number of rows in a batch, this
>> is generally around 4096. The record readers also define such limits.
>> 
>> I believe standardizing this value and making it configurable may be useful
>> for both debugging and tuning Drill. We have often found bugs around batch
>> boundary conditions, which often necessitates generating larger test cases
>> to reproduce problems and create unit tests once the issues are fixed. I'm
>> thinking if we could lower this value we may be able to write more concise
>> tests that easily demonstrate the boundary conditions in smaller input
>> files and test definitions.
>> 
>> This could also be useful for tuning drill. While we may not want to make
>> this option available in production, we could use it in the meantime to
>> drive efforts to identify best values in different scenarios when we
>> stretch the limits of Drill. After a brief discussion with Steven he said
>> that in some of his testing he was able to see some performance gains
>> increasing the value from 4000, to 32k. This isn't a strong argument in
>> itself for pushing up the default, as it will increase memory requirements
>> and will likely hurt is in multi-user and environments running many
>> concurrent queries. In these cases we may need to automatically throttle
>> back the batch size to reduce overall memory usage of any particular
>> operation.
>> 
>> Making this would be a code change that would touch a fairly large number
>> of files, but I think the possible benefits could justify the change, just
>> wanted to collect thoughts from the community.
>> 
>> - Jason
>> 

Reply via email to