Hello Drillers, Currently each of the physical operators in Drill has it's own way of specifying how many records it will try to produce in a single batch. For some operators like project, the outgoing batch will be the same size as the incoming, in the case of a projection with no evaluations. If the size of the data is changing in a projection, such as converting a numeric type to varchar, we cannot guess how much memory will be needed in the outgoing buffer, so we may have to cut off the first batch once we run out of space and separately handle the overflowing data.
In other operators, where the incoming streams cause the spawning of new outgoing records, we can not make a guess about the outgoing batch size, we just need to keep producing row and cutting off batches as we run out of space. Rather than hit exceptions in all cases, many of the operators have a loop termination based on some expected number of rows in a batch, this is generally around 4096. The record readers also define such limits. I believe standardizing this value and making it configurable may be useful for both debugging and tuning Drill. We have often found bugs around batch boundary conditions, which often necessitates generating larger test cases to reproduce problems and create unit tests once the issues are fixed. I'm thinking if we could lower this value we may be able to write more concise tests that easily demonstrate the boundary conditions in smaller input files and test definitions. This could also be useful for tuning drill. While we may not want to make this option available in production, we could use it in the meantime to drive efforts to identify best values in different scenarios when we stretch the limits of Drill. After a brief discussion with Steven he said that in some of his testing he was able to see some performance gains increasing the value from 4000, to 32k. This isn't a strong argument in itself for pushing up the default, as it will increase memory requirements and will likely hurt is in multi-user and environments running many concurrent queries. In these cases we may need to automatically throttle back the batch size to reduce overall memory usage of any particular operation. Making this would be a code change that would touch a fairly large number of files, but I think the possible benefits could justify the change, just wanted to collect thoughts from the community. - Jason
