[
https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868835#comment-15868835
]
Paul Rogers commented on DRILL-5266:
------------------------------------
More silly code:
{code}
public abstract class ColumnReader<V extends ValueVector> {
...
// length of single data value in bits, if the length is fixed
int dataTypeLengthInBits;
...
protected ColumnReader(ParquetRecordReader parentReader, int allocateSize,
ColumnDescriptor descriptor, ...
...
if (columnDescriptor.getType() == PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY)
{
dataTypeLengthInBits = columnDescriptor.getTypeLength() * 8;
} else {
dataTypeLengthInBits =
ParquetRecordReader.getTypeLengthInBits(columnDescriptor.getType());
}
...
protected boolean checkVectorCapacityReached() {
if (bytesReadInCurrentPass + dataTypeLengthInBits > capacity()) {
{code}
Note that the code adds a variable called "bytes" with one called "bits" and
compares it to a capacity in bytes. But, that might be OK because the variable
is named "bits" but sometimes holds bytes (see line above.) But, at other time
it holds bits (see other line above.)
So, we have a variable that holds bits some times, bytes others, and is
compared to bytes all the time...
> Parquet Reader produces "low density" record batches
> ----------------------------------------------------
>
> Key: DRILL-5266
> URL: https://issues.apache.org/jira/browse/DRILL-5266
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.10
> Reporter: Paul Rogers
>
> Testing with the managed sort revealed that, for at least one file, Parquet
> produces "low-density" batches: batches in which only 5% of each value vector
> contains actual data, with the rest being unused space. When fed into the
> sort, we end up buffering 95% of wasted space, using only 5% of available
> memory to hold actual query data. The result is poor performance of the sort
> as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use
> estimates. The following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
> T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size:
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size:
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size:
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
> c_email_address(std col. size: 54, actual col. size: 27, total size: 53248,
> vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
> Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)