With average row size method, since I know number of rows and the average size for each column, I am planning to use that information to allocate required memory for each vector upfront. This should help avoid copying every time we double and also improve memory utilization.
Thanks Padma > On Feb 11, 2018, at 3:44 PM, Paul Rogers <par0...@yahoo.com.INVALID> wrote: > > One more thought: >>> 3) Assuming that you go with the average batch size calculation approach, > > The average batch size approach is a quick and dirty approach for non-leaf > operators that can observe an incoming batch to estimate row width. Because > Drill batches are large, the law of large numbers means that the average of a > large input batch is likely to be a good estimator for the average size of a > large output batch. > Note that this works only because non-leaf operators have an input batch to > sample. Leaf operators (readers) do not have this luxury. Hence the result > set loader uses the actual accumulated size for the current batch. > Also note that the average row method, while handy, is not optimal. It will, > in general, result in greater internal fragmentation than the result set > loader. Why? The result set loader packs vectors right up to the point where > the largest would overflow. The average row method works at the aggregate > level and will likely result in wasted space (internal fragmentation) in the > largest vector. Said another way, with the average row size method, we can > usually pack in a few more rows before the batch actually fills, and so we > end up with batches with lower "density" than the optimal. This is important > when the consuming operator is a buffering one such as sort. > The key reason Padma is using the quick & dirty average row size method is > not that it is ideal (it is not), but rather that it is, in fact, quick. > We do want to move to the result set loader over time so we get improved > memory utilization. And, it is the only way to control row size in readers > such as CSV or JSON in which we have no size information until we read the > data. > - Paul