[jira] [Comment Edited] (DRILL-5266) Parquet Reader produces "low density" record batches

Paul Rogers (JIRA) Wed, 15 Feb 2017 13:57:12 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868627#comment-15868627
 ]


Paul Rogers edited comment on DRILL-5266 at 2/15/17 9:56 PM:
-------------------------------------------------------------

The logic for determining field widths is confusing.

{code}
  public int next() {
    ...
      if (allFieldsFixedLength) {
        ...
      } else { // variable length columns
        long fixedRecordsToRead = varLengthReader.readFields(recordsToRead, 
firstColumnStatus); // Read var
        readAllFixedFields(fixedRecordsToRead); // Read fixed
      }
{code}

The above claims that we call one method to read variable length fields, then 
another to read fixed length fields. Fine, presumably we pack in the 
variable-length fields, figure out how many records that is, then read the 
fixed length data to match. Makes sense. But then:

{code}
public class VarLenBinaryReader {
  public long readFields(long recordsToReadInThisPass, ColumnReader<?> 
firstColumnStatus) throws IOException {
    ...
    recordsReadInCurrentPass = determineSizesSerial(recordsToReadInThisPass);
    ...
  }

  private long determineSizesSerial(long recordsToReadInThisPass) throws 
IOException {
    ...
      // check that the next record will fit in the batch
      if (exitLengthDeterminingLoop ||
          (recordsReadInCurrentPass + 1) * 
parentReader.getBitWidthAllFixedFields()
              + totalVariableLengthData + lengthVarFieldsInCurrentRecord > 
parentReader.getBatchSize()) {
{code}

That is, the *variable* length reader is making its decision about when to stop 
based, in part on *fixed* length fields. This is contradictory to the earlier 
code, rendering the entire operational incoherent.

Given that the variable length width is not returned (see above), the 
calculation reduces down to dividing batch size by fixed length record width. 
This can all be refactored to be simpler.



was (Author: paul-rogers):
The logic for determining field widths is confusing.

{code}
  public int next() {
    ...
      if (allFieldsFixedLength) {
        ...
      } else { // variable length columns
        long fixedRecordsToRead = varLengthReader.readFields(recordsToRead, 
firstColumnStatus); // Read var
        readAllFixedFields(fixedRecordsToRead); // Read fixed
      }
{code}

The above claims that we call one method to read variable length fields, then 
another to read fixed length fields. Fine, presumably we pack in the 
variable-length fields, figure out how many records that is, then read the 
fixed length data to match. Makes sense. But then:

{code}
public class VarLenBinaryReader {
  public long readFields(long recordsToReadInThisPass, ColumnReader<?> 
firstColumnStatus) throws IOException {
    ...
    recordsReadInCurrentPass = determineSizesSerial(recordsToReadInThisPass);
    ...
  }

  private long determineSizesSerial(long recordsToReadInThisPass) throws 
IOException {
    ...
      // check that the next record will fit in the batch
      if (exitLengthDeterminingLoop ||
          (recordsReadInCurrentPass + 1) * 
parentReader.getBitWidthAllFixedFields()
              + totalVariableLengthData + lengthVarFieldsInCurrentRecord > 
parentReader.getBatchSize()) {
{code}

That is, the *variable* length reader is making its decision about when to stop 
based, in part on *fixed* length fields. This is contradictory to the earlier 
code, rendering the entire operational incoherent.


> Parquet Reader produces "low density" record batches
> ----------------------------------------------------
>
>                 Key: DRILL-5266
>                 URL: https://issues.apache.org/jira/browse/DRILL-5266
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>
> Testing with the managed sort revealed that, for at least one file, Parquet 
> produces "low-density" batches: batches in which only 5% of each value vector 
> contains actual data, with the rest being unused space. When fed into the 
> sort, we end up buffering 95% of wasted space, using only 5% of available 
> memory to hold actual query data. The result is poor performance of the sort 
> as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use 
> estimates. The following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
>   T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
>   c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, 
> vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
>   Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (DRILL-5266) Parquet Reader produces "low density" record batches

Reply via email to