[
https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868927#comment-15868927
]
Paul Rogers commented on DRILL-5266:
------------------------------------
More lies in the code:
{code}
public boolean determineSize(long recordsReadInCurrentPass, Integer
lengthVarFieldsInCurrentRecord) throws IOException {
...
return checkVectorCapacityReached();
}
protected boolean checkVectorCapacityReached() {
// Here "bits" means "bytes"
if (bytesReadInCurrentPass + dataTypeLengthInBits > capacity()) {
logger.debug("Reached the capacity of the data vector in a variable
length value vector.");
return true;
}
return valuesReadInCurrentPass > valueVec.getValueCapacity();
}
{code}
This seems to check if we have filled up the variable-length vector. That would
be fine, but we've already read the value and extended the vector (see above.)
Further, the {{bytesReadInCurrentPass}} variable is never actually incremented;
it is always 0. Not only that, the code above already checked vector capacity;
if we exceeded vector capacity then we would not be here, so no need to check
again on the last line.
The {{bytesReadInCurrentPass}} variable is eventually updated, but only *after*
we do all the "check length" work. So, we updated it when we don't need it, and
leave it zero when we do need it.
In short, this entire function is a no-op.
> Parquet Reader produces "low density" record batches
> ----------------------------------------------------
>
> Key: DRILL-5266
> URL: https://issues.apache.org/jira/browse/DRILL-5266
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.10
> Reporter: Paul Rogers
>
> Testing with the managed sort revealed that, for at least one file, Parquet
> produces "low-density" batches: batches in which only 5% of each value vector
> contains actual data, with the rest being unused space. When fed into the
> sort, we end up buffering 95% of wasted space, using only 5% of available
> memory to hold actual query data. The result is poor performance of the sort
> as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use
> estimates. The following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
> T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size:
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size:
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size:
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
> c_email_address(std col. size: 54, actual col. size: 27, total size: 53248,
> vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
> Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)