For testing purposes, having a static and configurable batch size, in number of rows, will definitely help.
> On Jan 9, 2015, at 5:45 PM, Aman Sinha <[email protected]> wrote: > > Yes, in the past we have talked about this primarily in the context of unit > testing...certainly we want to be able to write tests with small input data > and exercise the batch boundaries. Performance tuning is added benefit > once we can do some performance characterizations. > We might want to think about whether the batch size should be a static > value or something that is determined once the 'fast schema' is known for > the leaf operators. The batch size could be a function of the row width... > > Aman > > On Fri, Jan 9, 2015 at 5:09 PM, Jason Altekruse <[email protected]> > wrote: > >> Hello Drillers, >> >> Currently each of the physical operators in Drill has it's own way of >> specifying how many records it will try to produce in a single batch. For >> some operators like project, the outgoing batch will be the same size as >> the incoming, in the case of a projection with no evaluations. If the size >> of the data is changing in a projection, such as converting a numeric type >> to varchar, we cannot guess how much memory will be needed in the outgoing >> buffer, so we may have to cut off the first batch once we run out of space >> and separately handle the overflowing data. >> >> In other operators, where the incoming streams cause the spawning of new >> outgoing records, we can not make a guess about the outgoing batch size, we >> just need to keep producing row and cutting off batches as we run out of >> space. Rather than hit exceptions in all cases, many of the operators have >> a loop termination based on some expected number of rows in a batch, this >> is generally around 4096. The record readers also define such limits. >> >> I believe standardizing this value and making it configurable may be useful >> for both debugging and tuning Drill. We have often found bugs around batch >> boundary conditions, which often necessitates generating larger test cases >> to reproduce problems and create unit tests once the issues are fixed. I'm >> thinking if we could lower this value we may be able to write more concise >> tests that easily demonstrate the boundary conditions in smaller input >> files and test definitions. >> >> This could also be useful for tuning drill. While we may not want to make >> this option available in production, we could use it in the meantime to >> drive efforts to identify best values in different scenarios when we >> stretch the limits of Drill. After a brief discussion with Steven he said >> that in some of his testing he was able to see some performance gains >> increasing the value from 4000, to 32k. This isn't a strong argument in >> itself for pushing up the default, as it will increase memory requirements >> and will likely hurt is in multi-user and environments running many >> concurrent queries. In these cases we may need to automatically throttle >> back the batch size to reduce overall memory usage of any particular >> operation. >> >> Making this would be a code change that would touch a fairly large number >> of files, but I think the possible benefits could justify the change, just >> wanted to collect thoughts from the community. >> >> - Jason >>
