[
https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868627#comment-15868627
]
Paul Rogers edited comment on DRILL-5266 at 2/15/17 9:56 PM:
-------------------------------------------------------------
The logic for determining field widths is confusing.
{code}
public int next() {
...
if (allFieldsFixedLength) {
...
} else { // variable length columns
long fixedRecordsToRead = varLengthReader.readFields(recordsToRead,
firstColumnStatus); // Read var
readAllFixedFields(fixedRecordsToRead); // Read fixed
}
{code}
The above claims that we call one method to read variable length fields, then
another to read fixed length fields. Fine, presumably we pack in the
variable-length fields, figure out how many records that is, then read the
fixed length data to match. Makes sense. But then:
{code}
public class VarLenBinaryReader {
public long readFields(long recordsToReadInThisPass, ColumnReader<?>
firstColumnStatus) throws IOException {
...
recordsReadInCurrentPass = determineSizesSerial(recordsToReadInThisPass);
...
}
private long determineSizesSerial(long recordsToReadInThisPass) throws
IOException {
...
// check that the next record will fit in the batch
if (exitLengthDeterminingLoop ||
(recordsReadInCurrentPass + 1) *
parentReader.getBitWidthAllFixedFields()
+ totalVariableLengthData + lengthVarFieldsInCurrentRecord >
parentReader.getBatchSize()) {
{code}
That is, the *variable* length reader is making its decision about when to stop
based, in part on *fixed* length fields. This is contradictory to the earlier
code, rendering the entire operational incoherent.
Given that the variable length width is not returned (see above), the
calculation reduces down to dividing batch size by fixed length record width.
This can all be refactored to be simpler.
was (Author: paul-rogers):
The logic for determining field widths is confusing.
{code}
public int next() {
...
if (allFieldsFixedLength) {
...
} else { // variable length columns
long fixedRecordsToRead = varLengthReader.readFields(recordsToRead,
firstColumnStatus); // Read var
readAllFixedFields(fixedRecordsToRead); // Read fixed
}
{code}
The above claims that we call one method to read variable length fields, then
another to read fixed length fields. Fine, presumably we pack in the
variable-length fields, figure out how many records that is, then read the
fixed length data to match. Makes sense. But then:
{code}
public class VarLenBinaryReader {
public long readFields(long recordsToReadInThisPass, ColumnReader<?>
firstColumnStatus) throws IOException {
...
recordsReadInCurrentPass = determineSizesSerial(recordsToReadInThisPass);
...
}
private long determineSizesSerial(long recordsToReadInThisPass) throws
IOException {
...
// check that the next record will fit in the batch
if (exitLengthDeterminingLoop ||
(recordsReadInCurrentPass + 1) *
parentReader.getBitWidthAllFixedFields()
+ totalVariableLengthData + lengthVarFieldsInCurrentRecord >
parentReader.getBatchSize()) {
{code}
That is, the *variable* length reader is making its decision about when to stop
based, in part on *fixed* length fields. This is contradictory to the earlier
code, rendering the entire operational incoherent.
> Parquet Reader produces "low density" record batches
> ----------------------------------------------------
>
> Key: DRILL-5266
> URL: https://issues.apache.org/jira/browse/DRILL-5266
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.10
> Reporter: Paul Rogers
>
> Testing with the managed sort revealed that, for at least one file, Parquet
> produces "low-density" batches: batches in which only 5% of each value vector
> contains actual data, with the rest being unused space. When fed into the
> sort, we end up buffering 95% of wasted space, using only 5% of available
> memory to hold actual query data. The result is poor performance of the sort
> as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use
> estimates. The following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
> T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size:
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size:
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size:
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
> c_email_address(std col. size: 54, actual col. size: 27, total size: 53248,
> vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
> Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)