[ 
https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360138#comment-16360138
 ] 

Paul Rogers commented on DRILL-6147:
------------------------------------

We invested a large effort into the batch size framework specially designed for 
readers. In Parquet, we are focused on optimizing CPU behavior, which has 
driven us down the bulk load route. Rather than figure out how to combine the 
two, we seem to be discussing a fork.

The result set loader handles the batch size limit at the level of the record 
and batch. It is pretty easy to handle overflow for a single fixed-width 
vector. A bit more complex to handle VarChar with their offset vectors. Even 
harder for arrays, or arrays of VarChar. We start pulling our hair out as we 
add arrays of maps that contain other arrays of maps that contain unions with a 
map that has an array of VarChar. The gory details are 
[here|https://github.com/paul-rogers/drill/wiki/Batch-Handling-Upgrades].

Now, we might argue that, if we care about performance, users should not use 
anything other than flat rows with simple types. Good argument. Yet, we've got 
Aman defining implicit join logic precisely because there are users who do, in 
fact, have complex nested Parquet structures. The lesson is that Parquet must 
not only be fast, it must handle complex structures. This is, in fact, one of 
the problems that the result set loader was designed to handle. Tests show that 
the code does, in fact, work for these cases (though it is not yet integrated 
with Parquet - that was supposed to be the next step.)

The details of the result set loader are spelled out in the link above. 
Perhaps, as Parth suggests, we can do the same, at the same level of detail, 
for the bulk load alternative so we can compare the details of both approaches.

I would encourage us to think about how bulk load works not just for simple 
flat rows, but also for complex structures, for columns that have say, 1 MB 
values and so on. The solution must work for all data files, no matter how 
pathological, not just "well behaved" files. The result set loader handles 
these cases. How will the bulk-load alternative do so?

Here's an example of the level of detail we need.

Assume three columns. On the first batch, we read some amount of bulk data for 
each vector. We don't know the column width yet, so we read, say, x values into 
the first vector, y into the second and z into the third. Once we have the 
data, we can do some math and guess an average row width and compute a batch 
size, n, n <= x, n <= y, n <= z. So, we have to create output vectors by either 
copying n values to new vectors, or copying (x - n), (y - n) or (z - n) values 
to "look ahead" vectors.

The result set loader uses the second approach. But, the result set loader 
makes the simplifying assumption that all vectors contain the same number of 
rows. The proposal here requires that each vector have different row counts. 
(Why? Each vector was bulk-loaded independently and so did not coordinate on 
the number of values to load.) It was hard enough to get the common-row-count 
case to work, the bulk-load case will be far harder.

Why do we have to copy? Because we can't just point to the values we want in 
the original vectors. Why not? Consider the second row.

In the second row, we already have the overflow values in our vectors, ready to 
add more values. If the original vectors were just slices, then on the second 
batch, we'd create a larger vector that has first batch values, overflow values 
and new values. (That's what it means to have a single block of data with no 
copying.) We'd now end up growing that vector indefinitely. Not good. So, we 
copied the overflow values.

But, how many new values do we add add? The limit is set by the aggregate batch 
size across all columns of all rows. But we can guess based on average column 
width seen so far and the number of "lookahead" values. This could be quite 
wrong for data with a large size distribution: too few or too many values. If 
too few, we won't fill the available vectors, we'll increase memory 
fragmentation, and will slow the sort operator (which must buffer many rows.) 
So, perhaps we repeat the bulk-load cycle to "top up" our vectors. (The result 
set loader is designed to minimize internal fragmentation to reduce spilling 
and increase overall system performance.)

Then, the whole cycle repeats to "trim off" excess values.

As it turns out, the result set loader does all of this: it works and has many 
tests. Parquet must recreate all of this (including the nasty nested maps). 
Why? Simply because we want to do special tricks to make the CPU go faster.

Further, when we consider the operation over many batches, with many columns, 
and with data with a wide variance, we see that we'll make many copies, make 
several "top up" passes and so on. I wonder, after adding all this extra 
complexity needed for a production-quality solution, will the result be much 
faster than the existing row-by-row result set loader solution?

Perhaps, then, we should first integrate the result set loader mechanism into 
Parquet. Make Parquet work for all types, including nested maps (which, as I 
understand it, the "fast" Parquet reader does not yet handle). Then, optimize 
the result so that the resulting improvements help all readers. If bulk load 
can be added, so much the better.

In summary, I confess I'm a bit confused as to why we need to create an 
entirely new mechanism simply because we want to play CPU speed up tricks? 
Handling batch size, in the general case, is non-trivial; I worry that we may 
be missing this fact in the rush to speed up simple cases and will end up 
reinventing work that has already been done.

Then there is the equally complex topic of projection when including maps and 
arrays; something the turned out to also be fiendishly complex to get right, 
but that we seem to be reinventing for Parquet.

Can we address how this will work in the design document?

> Limit batch size for Flat Parquet Reader
> ----------------------------------------
>
>                 Key: DRILL-6147
>                 URL: https://issues.apache.org/jira/browse/DRILL-6147
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.13.0
>
>
> The Parquet reader currently uses a hard-coded batch size limit (32k rows) 
> when creating scan batches; there is no parameter nor any logic for 
> controlling the amount of memory used. This enhancement will allow Drill to 
> take an extra input parameter to control direct memory usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to