[jira] [Commented] (DRILL-5266) Parquet Reader produces "low density" record batches

Paul Rogers (JIRA) Wed, 15 Feb 2017 10:59:05 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868361#comment-15868361
 ]


Paul Rogers commented on DRILL-5266:
------------------------------------

The Parquet reader code is inconsistent with its constants:

{code}
  public void setup(OperatorContext operatorContext, OutputMutator output) 
throws ExecutionSetupException {
    ...
    if (columnsToScan != 0  && allFieldsFixedLength) {
      recordsPerBatch = (int) Math.min(Math.min(batchSize / 
bitWidthAllFixedFields,
          footer.getBlocks().get(0).getColumns().get(0).getValueCount()), 
65535);
    }
    else {
      recordsPerBatch = DEFAULT_RECORDS_TO_READ_IF_NOT_FIXED_WIDTH;
    }
{code}

There is a constant for variable-length record count, but not fixed length; 
fixed length is hard-coded above.

> Parquet Reader produces "low density" record batches
> ----------------------------------------------------
>
>                 Key: DRILL-5266
>                 URL: https://issues.apache.org/jira/browse/DRILL-5266
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>
> Testing with the managed sort revealed that, for at least one file, Parquet 
> produces "low-density" batches: batches in which only 5% of each value vector 
> contains actual data, with the rest being unused space. When fed into the 
> sort, we end up buffering 95% of wasted space, using only 5% of available 
> memory to hold actual query data. The result is poor performance of the sort 
> as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use 
> estimates. The following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
>   T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
>   c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, 
> vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
>   Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5266) Parquet Reader produces "low density" record batches

Reply via email to