[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

paul-rogers Sat, 23 Dec 2017 17:44:48 -0800

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1060
  
    This PR is a tough one. We generally like to avoid design discussions in 
PRs; but I'm going to violate that rule.
    
    This PR is based on the premise that each vector has all the information 
needed to pull values from a `VLBulkEntry`. The bulk entry reads a given number 
of values. Presumably this is faster than having the source of the values set 
values one-by-one.
    
    A separate project is confronting a limitation of Drill: that readers can 
create batches of very large sizes. Why? Readers choose to read x records. 
Records can be as small as a few bytes or as large as many MB. (Large records 
are common in document data sources.) Suppose we decide to read 4K records (a 
common size) and the records are 1 MB each. The batch needs 4 GB of data. But, 
the "managed" sort and hash agg operators often get only about 40 MB. Since 4 
GB batch > 40 MB working memory, queries die.
    
    If we knew that each record was y bytes, we could compute the record count, 
x = target batch size / record size. But, for most readers (including Parquet), 
Drill cannot know the record size until after the record is read. Hence, the 
entire premise of setting a fixed read size is unworkable. In short, unlike a 
classic relational DB, Drill cannot predict record sizes, Drill can only react 
to the size of records as they are read.
    
    Suppose the design were modified so that vector V read either up to x 
values or until it reaches some maximum vector size. The problem then becomes 
that, if the record has two columns, V and W, they cannot coordinate. 
(Specifically, the inner loop in vector V should not have visibility into the 
state of vector W.) V may read, say, 16K values of 1K each and become full. 
(For reasons discussed elsewhere, 16 MB is a convenient maximum vector size.) 
But if column W is 4X wider, at 4K bytes, it must read the same 16K records 
already read by X, but will require 64 MB, exceeding Y's own size target by 4X.
    
    Suppose the row has 1000 columns. (Unusual, but not unheard of.) Then, we 
have 1000 bulk readers all needing to coordinate. Maybe each one stays within 
its own size limit, growing to only 8 MB. But, the aggregate total of 1000 
columns at 8 MB each is 8 GB, which exceeds the overall batch size target. (The 
actual target will be set by the planner, but low 10s of MB is a likely limit.)
    
    To work around this, the bulk load mechanism must know the sizes of all 
columns it will read in a batch and compute a row count that causes the 
aggregate sum of values to be below both the per-vector and overall batch size 
targets. As discussed, since Drill cannot predict the future, and Parquet 
provides no per-row size metadata, this approach cannot work in the general 
case (though, with good guesses, it can work in, say, some percentage of 
"typical" queries -- the approach the Drill currently uses.)
    
    Maybe the solution can read in small bursts, maybe 100 rows, to sense if 
any vector gets too large before doing another 100 values into each vector. 
Still with any value larger than 1 for the "burst count", it is easy to get 
back into the situation of an oversize batch if row sizes are not uniform. 
    
    Since Parquet makes no guarantees that row sizes are uniform, this is a 
stochastic game. What are the chances that the next 100 (say) rows behave like 
the last 100? Any deviation leads to OOM errors and query failures. What 
percentage of the time will we allow queries to fail with OOM errors to go fast 
the rest of the time? 10% failure rate for a 100% speed improvement? Is there 
any correct failure rate other than 0%? This is a hard question. In fact, this 
is the very problem that the "batch size" project was asked to solve.
    
    We are left with two choices. Either 1) exempt Parquet from row size 
limits, or 2) devise a new solution that allows size limits to be enforced, if 
only stochastically. Since we are entertaining this PR, then clearly we've 
decided to *not* address size issues in Parquet. But, since Parquet is our 
primary input format, we've decided not to address size issues for ~90% of 
queries. But, since we do, in fact, have a parallel project to limit batch 
sizes, perhaps we have not, in fact, exempted Parquet. This is quite the 
dilemma! 
    
    For background, the "batch size" project, to be used for other readers, 
solves the problem by monitoring total batch size row by row. When a row causes 
the batch (or any individual vector) size to be exceeded, "rollover" occurs to 
create a new batch, and the full batch is sent downstream. Lots of complexity 
comes from the fact that overflow can occur in the middle of a rows, perhaps 
inside a deeply nested structure. The mechanism makes overflow transparent to 
the reader; the reader simply asks if it can write a row, writes it, then loops 
to check again.
    
    It is understood that Parquet is columnar, and the row-by-row approach is 
perhaps inefficient. So, we need a columnar solution to the batch size problem. 
Perhaps this project can propose one.
    
    The above description points out another concern. The PR says that this 
mechanism is for flat rows. Yet, we have yet another project to support nested 
tables (the so-called "complex types".) Is this PR consistent with that 
project, or will we need a new, complex type reader that uses, say, the "batch 
size" mechanism, while the "flat" reader uses this mechanism? Perhaps this 
project can propose a solution to that issue also.
    
    Would appreciate direction from the PMC and commercial management about 
their intent and how these three now-conflicting projects are to be reconciled. 
Once that info is available, we can continue to determine if the code offered 
here realizes the goals set by project leadership.

---

[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

Reply via email to