Hi Aman,
Please see my comment in DRILL-6147.
For the hangout to be productive, perhaps we should create test cases that will
show the benefit of DRILL-6147 relative to the result set loader.
The test case of interest has three parts:
* Multiple variable-width fields (say five) with a large variance in field
widths in each field* Large data set that will be split across multiple batches
(say 10 or 50 batches)* Constraints on total batch size and size of the largest
vector
Clearly, we can't try this out with Parquet: that's the topic we are discussing.
But, we can generate a data set in code, then do a unit test of the two methods
(just the vector loading bits) and time the result. Similar code already exists
in the result set loader branch that can be repurposed for this use. We'd want
to create a similar test for the DRILL-6147 mechanisms. We can work out the
details in a separate discussion.
IHMO, if the results are the same, we should go with one solution. If
DRILL-6147 is significantly faster, the decision is clear: we have two
solutions.
We also should consider things such as selection, null columns, implicit
columns, and the other higher-level functionality provided by the result set
loader. Since Parquet already has ad-hoc solutions for these, with DRILL-6147
we'd simply keep those solutions for Parquet, while the other readers use the
new, unified mechanisms.
In terms of time, this week is busy:
* Wed. 3PM or later* Fri. 3PM or later
The week of the 12th is much more open.
Thanks,
- Paul
On Sunday, March 4, 2018, 11:48:33 AM PST, Aman Sinha
<[email protected]> wrote:
Hi all, with reference to DRILL-6147
<https://issues.apache.org/jira/browse/DRILL-6147> given the overlapping
approaches, I feel like we should have a separate hangout session with
interested parties and discuss the details.
Let me know and I can setup one.
Aman
On Mon, Feb 12, 2018 at 8:50 AM, Padma Penumarthy <[email protected]>
wrote:
> If our goal is to not to allocate more than 16MB for individual vectors to
> avoid external fragmentation, I guess
> we can take that also into consideration in our calculations to figure out
> the outgoing number of rows.
> The math might become more complex. But, the main point like you said is
> operators know what they are
> getting and can figure out how to deal with that to honor the constraints
> imposed.
>
> Thanks
> Padma
>
>
> On Feb 12, 2018, at 8:25 AM, Paul Rogers <[email protected]<
> mailto:[email protected]>> wrote:
>
> Agreed that allocating vectors up front is another good improvement.
> The average batch size approach gets us 80% of the way to the goal: it
> limits batch size and allows vector preallocation.
> What it cannot do is limit individual vector sizes. Nor can it ensure that
> the resulting batch is optimally loaded with data. Getting the remaining
> 20% requires the level of detail provided by the result set loader.
> We are driving to use the result set loader first in readers, since
> readers can't use the average batch size (they don't have an input batch to
> use to obtain sizes.)
> To use the result set loader in non-leaf operators, we'd need to modify
> code generation. AFAIK, that is not something anyone is working on, so
> another advantage of the average batch size method is that it works with
> the code generation we already have.
> Thanks,
> - Paul
>
>
>
> On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy <
> [email protected]<mailto:[email protected]>> wrote:
>
> With average row size method, since I know number of rows and the average
> size for each column,
> I am planning to use that information to allocate required memory for each
> vector upfront.
> This should help avoid copying every time we double and also improve
> memory utilization.
>
> Thanks
> Padma
>
>
> On Feb 11, 2018, at 3:44 PM, Paul Rogers <[email protected]<
> mailto:[email protected]>> wrote:
>
> One more thought:
> 3) Assuming that you go with the average batch size calculation approach,
>
> The average batch size approach is a quick and dirty approach for non-leaf
> operators that can observe an incoming batch to estimate row width. Because
> Drill batches are large, the law of large numbers means that the average of
> a large input batch is likely to be a good estimator for the average size
> of a large output batch.
> Note that this works only because non-leaf operators have an input batch
> to sample. Leaf operators (readers) do not have this luxury. Hence the
> result set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It
> will, in general, result in greater internal fragmentation than the result
> set loader. Why? The result set loader packs vectors right up to the point
> where the largest would overflow. The average row method works at the
> aggregate level and will likely result in wasted space (internal
> fragmentation) in the largest vector. Said another way, with the average
> row size method, we can usually pack in a few more rows before the batch
> actually fills, and so we end up with batches with lower "density" than the
> optimal. This is important when the consuming operator is a buffering one
> such as sort.
> The key reason Padma is using the quick & dirty average row size method is
> not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved
> memory utilization. And, it is the only way to control row size in readers
> such as CSV or JSON in which we have no size information until we read the
> data.
> - Paul
>
>