[
https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360138#comment-16360138
]
Paul Rogers commented on DRILL-6147:
------------------------------------
We invested a large effort into the batch size framework specially designed for
readers. In Parquet, we are focused on optimizing CPU behavior, which has
driven us down the bulk load route. Rather than figure out how to combine the
two, we seem to be discussing a fork.
The result set loader handles the batch size limit at the level of the record
and batch. It is pretty easy to handle overflow for a single fixed-width
vector. A bit more complex to handle VarChar with their offset vectors. Even
harder for arrays, or arrays of VarChar. We start pulling our hair out as we
add arrays of maps that contain other arrays of maps that contain unions with a
map that has an array of VarChar. The gory details are
[here|https://github.com/paul-rogers/drill/wiki/Batch-Handling-Upgrades].
Now, we might argue that, if we care about performance, users should not use
anything other than flat rows with simple types. Good argument. Yet, we've got
Aman defining implicit join logic precisely because there are users who do, in
fact, have complex nested Parquet structures. The lesson is that Parquet must
not only be fast, it must handle complex structures. This is, in fact, one of
the problems that the result set loader was designed to handle. Tests show that
the code does, in fact, work for these cases (though it is not yet integrated
with Parquet - that was supposed to be the next step.)
The details of the result set loader are spelled out in the link above.
Perhaps, as Parth suggests, we can do the same, at the same level of detail,
for the bulk load alternative so we can compare the details of both approaches.
I would encourage us to think about how bulk load works not just for simple
flat rows, but also for complex structures, for columns that have say, 1 MB
values and so on. The solution must work for all data files, no matter how
pathological, not just "well behaved" files. The result set loader handles
these cases. How will the bulk-load alternative do so?
Here's an example of the level of detail we need.
Assume three columns. On the first batch, we read some amount of bulk data for
each vector. We don't know the column width yet, so we read, say, x values into
the first vector, y into the second and z into the third. Once we have the
data, we can do some math and guess an average row width and compute a batch
size, n, n <= x, n <= y, n <= z. So, we have to create output vectors by either
copying n values to new vectors, or copying (x - n), (y - n) or (z - n) values
to "look ahead" vectors.
The result set loader uses the second approach. But, the result set loader
makes the simplifying assumption that all vectors contain the same number of
rows. The proposal here requires that each vector have different row counts.
(Why? Each vector was bulk-loaded independently and so did not coordinate on
the number of values to load.) It was hard enough to get the common-row-count
case to work, the bulk-load case will be far harder.
Why do we have to copy? Because we can't just point to the values we want in
the original vectors. Why not? Consider the second row.
In the second row, we already have the overflow values in our vectors, ready to
add more values. If the original vectors were just slices, then on the second
batch, we'd create a larger vector that has first batch values, overflow values
and new values. (That's what it means to have a single block of data with no
copying.) We'd now end up growing that vector indefinitely. Not good. So, we
copied the overflow values.
But, how many new values do we add add? The limit is set by the aggregate batch
size across all columns of all rows. But we can guess based on average column
width seen so far and the number of "lookahead" values. This could be quite
wrong for data with a large size distribution: too few or too many values. If
too few, we won't fill the available vectors, we'll increase memory
fragmentation, and will slow the sort operator (which must buffer many rows.)
So, perhaps we repeat the bulk-load cycle to "top up" our vectors. (The result
set loader is designed to minimize internal fragmentation to reduce spilling
and increase overall system performance.)
Then, the whole cycle repeats to "trim off" excess values.
As it turns out, the result set loader does all of this: it works and has many
tests. Parquet must recreate all of this (including the nasty nested maps).
Why? Simply because we want to do special tricks to make the CPU go faster.
Further, when we consider the operation over many batches, with many columns,
and with data with a wide variance, we see that we'll make many copies, make
several "top up" passes and so on. I wonder, after adding all this extra
complexity needed for a production-quality solution, will the result be much
faster than the existing row-by-row result set loader solution?
Perhaps, then, we should first integrate the result set loader mechanism into
Parquet. Make Parquet work for all types, including nested maps (which, as I
understand it, the "fast" Parquet reader does not yet handle). Then, optimize
the result so that the resulting improvements help all readers. If bulk load
can be added, so much the better.
In summary, I confess I'm a bit confused as to why we need to create an
entirely new mechanism simply because we want to play CPU speed up tricks?
Handling batch size, in the general case, is non-trivial; I worry that we may
be missing this fact in the rush to speed up simple cases and will end up
reinventing work that has already been done.
Then there is the equally complex topic of projection when including maps and
arrays; something the turned out to also be fiendishly complex to get right,
but that we seem to be reinventing for Parquet.
Can we address how this will work in the design document?
> Limit batch size for Flat Parquet Reader
> ----------------------------------------
>
> Key: DRILL-6147
> URL: https://issues.apache.org/jira/browse/DRILL-6147
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Reporter: salim achouche
> Assignee: salim achouche
> Priority: Major
> Fix For: 1.13.0
>
>
> The Parquet reader currently uses a hard-coded batch size limit (32k rows)
> when creating scan batches; there is no parameter nor any logic for
> controlling the amount of memory used. This enhancement will allow Drill to
> take an extra input parameter to control direct memory usage.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)