[jira] [Commented] (DRILL-5266) Parquet Reader produces "low density" record batches

Paul Rogers (JIRA) Wed, 15 Feb 2017 17:20:07 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868927#comment-15868927
 ]


Paul Rogers commented on DRILL-5266:
------------------------------------

More lies in the code:

{code}
  public boolean determineSize(long recordsReadInCurrentPass, Integer 
lengthVarFieldsInCurrentRecord) throws IOException {
    ...
    return checkVectorCapacityReached();
  }

  protected boolean checkVectorCapacityReached() {
    // Here "bits" means "bytes"
    if (bytesReadInCurrentPass + dataTypeLengthInBits > capacity()) {
      logger.debug("Reached the capacity of the data vector in a variable 
length value vector.");
      return true;
    }
    return valuesReadInCurrentPass > valueVec.getValueCapacity();
  }
{code}

This seems to check if we have filled up the variable-length vector. That would 
be fine, but we've already read the value and extended the vector (see above.) 
Further, the {{bytesReadInCurrentPass}} variable is never actually incremented; 
it is always 0. Not only that, the code above already checked vector capacity; 
if we exceeded vector capacity then we would not be here, so no need to check 
again on the last line.

The {{bytesReadInCurrentPass}} variable is eventually updated, but only *after* 
we do all the "check length" work. So, we updated it when we don't need it, and 
leave it zero when we do need it.

In short, this entire function is a no-op.

> Parquet Reader produces "low density" record batches
> ----------------------------------------------------
>
>                 Key: DRILL-5266
>                 URL: https://issues.apache.org/jira/browse/DRILL-5266
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>
> Testing with the managed sort revealed that, for at least one file, Parquet 
> produces "low-density" batches: batches in which only 5% of each value vector 
> contains actual data, with the rest being unused space. When fed into the 
> sort, we end up buffering 95% of wasted space, using only 5% of available 
> memory to hold actual query data. The result is poor performance of the sort 
> as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use 
> estimates. The following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
>   T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
>   c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, 
> vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
>   Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5266) Parquet Reader produces "low density" record batches

Reply via email to