Re: Batch Sizing for Parquet Flat Reader

Padma Penumarthy Sun, 11 Feb 2018 19:29:48 -0800

With average row size method, since I know number of rows and the average size 
for each column, 
I am planning to use that information to allocate required memory for each 
vector upfront. 
This should help avoid copying every time we double and also improve memory 
utilization.


Thanks
Padma


> On Feb 11, 2018, at 3:44 PM, Paul Rogers <par0...@yahoo.com.INVALID> wrote:
> 
> One more thought:
>>> 3) Assuming that you go with the average batch size calculation approach,
> 
> The average batch size approach is a quick and dirty approach for non-leaf 
> operators that can observe an incoming batch to estimate row width. Because 
> Drill batches are large, the law of large numbers means that the average of a 
> large input batch is likely to be a good estimator for the average size of a 
> large output batch.
> Note that this works only because non-leaf operators have an input batch to 
> sample. Leaf operators (readers) do not have this luxury. Hence the result 
> set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It will, 
> in general, result in greater internal fragmentation than the result set 
> loader. Why? The result set loader packs vectors right up to the point where 
> the largest would overflow. The average row method works at the aggregate 
> level and will likely result in wasted space (internal fragmentation) in the 
> largest vector. Said another way, with the average row size method, we can 
> usually pack in a few more rows before the batch actually fills, and so we 
> end up with batches with lower "density" than the optimal. This is important 
> when the consuming operator is a buffering one such as sort.
> The key reason Padma is using the quick & dirty average row size method is 
> not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved 
> memory utilization. And, it is the only way to control row size in readers 
> such as CSV or JSON in which we have no size information until we read the 
> data.
> - Paul

Re: Batch Sizing for Parquet Flat Reader

Reply via email to