Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/1060
This PR is a tough one. We generally like to avoid design discussions in
PRs; but I'm going to violate that rule.
This PR is based on the premise that each vector has all the information
needed to pull values from a `VLBulkEntry`. The bulk entry reads a given number
of values. Presumably this is faster than having the source of the values set
values one-by-one.
A separate project is confronting a limitation of Drill: that readers can
create batches of very large sizes. Why? Readers choose to read x records.
Records can be as small as a few bytes or as large as many MB. (Large records
are common in document data sources.) Suppose we decide to read 4K records (a
common size) and the records are 1 MB each. The batch needs 4 GB of data. But,
the "managed" sort and hash agg operators often get only about 40 MB. Since 4
GB batch > 40 MB working memory, queries die.
If we knew that each record was y bytes, we could compute the record count,
x = target batch size / record size. But, for most readers (including Parquet),
Drill cannot know the record size until after the record is read. Hence, the
entire premise of setting a fixed read size is unworkable. In short, unlike a
classic relational DB, Drill cannot predict record sizes, Drill can only react
to the size of records as they are read.
Suppose the design were modified so that vector V read either up to x
values or until it reaches some maximum vector size. The problem then becomes
that, if the record has two columns, V and W, they cannot coordinate.
(Specifically, the inner loop in vector V should not have visibility into the
state of vector W.) V may read, say, 16K values of 1K each and become full.
(For reasons discussed elsewhere, 16 MB is a convenient maximum vector size.)
But if column W is 4X wider, at 4K bytes, it must read the same 16K records
already read by X, but will require 64 MB, exceeding Y's own size target by 4X.
Suppose the row has 1000 columns. (Unusual, but not unheard of.) Then, we
have 1000 bulk readers all needing to coordinate. Maybe each one stays within
its own size limit, growing to only 8 MB. But, the aggregate total of 1000
columns at 8 MB each is 8 GB, which exceeds the overall batch size target. (The
actual target will be set by the planner, but low 10s of MB is a likely limit.)
To work around this, the bulk load mechanism must know the sizes of all
columns it will read in a batch and compute a row count that causes the
aggregate sum of values to be below both the per-vector and overall batch size
targets. As discussed, since Drill cannot predict the future, and Parquet
provides no per-row size metadata, this approach cannot work in the general
case (though, with good guesses, it can work in, say, some percentage of
"typical" queries -- the approach the Drill currently uses.)
Maybe the solution can read in small bursts, maybe 100 rows, to sense if
any vector gets too large before doing another 100 values into each vector.
Still with any value larger than 1 for the "burst count", it is easy to get
back into the situation of an oversize batch if row sizes are not uniform.
Since Parquet makes no guarantees that row sizes are uniform, this is a
stochastic game. What are the chances that the next 100 (say) rows behave like
the last 100? Any deviation leads to OOM errors and query failures. What
percentage of the time will we allow queries to fail with OOM errors to go fast
the rest of the time? 10% failure rate for a 100% speed improvement? Is there
any correct failure rate other than 0%? This is a hard question. In fact, this
is the very problem that the "batch size" project was asked to solve.
We are left with two choices. Either 1) exempt Parquet from row size
limits, or 2) devise a new solution that allows size limits to be enforced, if
only stochastically. Since we are entertaining this PR, then clearly we've
decided to *not* address size issues in Parquet. But, since Parquet is our
primary input format, we've decided not to address size issues for ~90% of
queries. But, since we do, in fact, have a parallel project to limit batch
sizes, perhaps we have not, in fact, exempted Parquet. This is quite the
dilemma!
For background, the "batch size" project, to be used for other readers,
solves the problem by monitoring total batch size row by row. When a row causes
the batch (or any individual vector) size to be exceeded, "rollover" occurs to
create a new batch, and the full batch is sent downstream. Lots of complexity
comes from the fact that overflow can occur in the middle of a rows, perhaps
inside a deeply nested structure. The mechanism makes overflow transparent to
the reader; the reader simply asks if it can write a row, writes it, then loops
to check again.
It is understood that Parquet is columnar, and the row-by-row approach is
perhaps inefficient. So, we need a columnar solution to the batch size problem.
Perhaps this project can propose one.
The above description points out another concern. The PR says that this
mechanism is for flat rows. Yet, we have yet another project to support nested
tables (the so-called "complex types".) Is this PR consistent with that
project, or will we need a new, complex type reader that uses, say, the "batch
size" mechanism, while the "flat" reader uses this mechanism? Perhaps this
project can propose a solution to that issue also.
Would appreciate direction from the PMC and commercial management about
their intent and how these three now-conflicting projects are to be reconciled.
Once that info is available, we can continue to determine if the code offered
here realizes the goals set by project leadership.
---