[jira] [Commented] (DRILL-5266) Parquet Reader produces "low density" record batches

Paul Rogers (JIRA) Wed, 15 Feb 2017 09:43:30 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868235#comment-15868235
 ]


Paul Rogers commented on DRILL-5266:
------------------------------------

Looking at the {{ParquetRecordReader}}, target record count is 32K:

{code}
  private static final char DEFAULT_RECORDS_TO_READ_IF_NOT_FIXED_WIDTH = 
32*1024;
  ...
        recordsToRead = DEFAULT_RECORDS_TO_READ_IF_NOT_FIXED_WIDTH;
{code}

But, this is changed to 1129 here:

{code}
      if (allFieldsFixedLength) {
        ...
      } else { // variable length columns
        long fixedRecordsToRead = varLengthReader.readFields(recordsToRead, 
firstColumnStatus); // Here
        readAllFixedFields(fixedRecordsToRead);
      }
{code}

That, in turn is set here:

{code}
  public long readFields(long recordsToReadInThisPass, ColumnReader<?> 
firstColumnStatus) throws IOException {
    ...
    recordsReadInCurrentPass = determineSizesSerial(recordsToReadInThisPass); 
// Here
    ...
  private long determineSizesSerial(long recordsToReadInThisPass) throws 
IOException {
    ...
      // check that the next record will fit in the batch
      if (exitLengthDeterminingLoop ||
          (recordsReadInCurrentPass + 1) * 
parentReader.getBitWidthAllFixedFields()
              + totalVariableLengthData + lengthVarFieldsInCurrentRecord > 
parentReader.getBatchSize()) {
        break; // Breaks here at 1129
      }
{code}

Here, {{parentReader.getBatchSize()}} returns 20,97,152. 
{{parentReader.getBitWidthAllFixedFields()}} returns 1856.

Strangely, {{lengthVarFieldsInCurrentRecord}} is still zero, even when 
{{recordsReadInCurrentPass}} is 1129.

Things are seriously whacky. Take that batch size: 20 MB / 32K target records 
is only 640 bytes per record. But, the above says that just the fixed-width 
fields add up to 1856 bytes per record. This, in turn, does not agree with the 
empirical measurement done by the sort, which computed 335 bytes per record. 
Further, how did we get the 20 MB? The batch size returned by Parquet to the 
sort is 32 MB in size. Perhaps the extra 12 MB, or 38% is just overhead of null 
flags and offset vectors? If we know the fixed-width part is 1856, shouldn't we 
have allocated at least 1856 * 32K = 61 MB for the batch?

One mystery can be immediately solved. Someone does not know that Java does not 
support pass-by-reference semantics:

{code}
    long totalVariableLengthData = 0;
...
columnReader.determineSize(recordsReadInCurrentPass, 
lengthVarFieldsInCurrentRecord);
...
      totalVariableLengthData += lengthVarFieldsInCurrentRecord;
{code}

The intent, clearly, is to update the parameter with the updated length. But, 
since Java supports pass-by-value semantics, the value is not returned. Seems a 
bug. But, then, the plot thickens:

{code}
  public boolean determineSize(long recordsReadInCurrentPass, Integer 
lengthVarFieldsInCurrentRecord) throws IOException {
    ...
    // Never used in this code path. Hard to remove because the method is 
overidden by subclasses
    lengthVarFieldsInCurrentRecord = -1;
{code}

Instead, the method just returns false when there is no more data to read. The 
only problem is, it never does so when reading the sample file.

This means that the Parquet batch size logic is all bollixed up and needs 
attention.

> Parquet Reader produces "low density" record batches
> ----------------------------------------------------
>
>                 Key: DRILL-5266
>                 URL: https://issues.apache.org/jira/browse/DRILL-5266
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>
> Testing with the managed sort revealed that, for at least one file, Parquet 
> produces "low-density" batches: batches in which only 5% of each value vector 
> contains actual data, with the rest being unused space. When fed into the 
> sort, we end up buffering 95% of wasted space, using only 5% of available 
> memory to hold actual query data. The result is poor performance of the sort 
> as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use 
> estimates. The following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
>   T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
>   c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, 
> vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
>   Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5266) Parquet Reader produces "low density" record batches

Reply via email to