[jira] [Commented] (DRILL-5266) Parquet Reader produces "low density" record batches - bits vs. bytes

ASF GitHub Bot (JIRA) Thu, 23 Feb 2017 14:10:02 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881389#comment-15881389
 ]


ASF GitHub Bot commented on DRILL-5266:
---------------------------------------

Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/749#discussion_r102830312
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenBinaryReader.java
 ---
    @@ -33,35 +33,52 @@
       ParquetRecordReader parentReader;
       final List<VarLengthColumn<? extends ValueVector>> columns;
       final boolean useAsyncTasks;
    +  private final long targetRecordCount;
     
       public VarLenBinaryReader(ParquetRecordReader parentReader, 
List<VarLengthColumn<? extends ValueVector>> columns) {
         this.parentReader = parentReader;
         this.columns = columns;
         useAsyncTasks = parentReader.useAsyncColReader;
    +
    +    // Can't read any more records than fixed width fields will fit.
    +    // Note: this calculation is very likely wrong; it is a simplified
    +    // version of earlier code, but probably needs even more attention.
    +
    +    int totalFixedFieldWidth = parentReader.getBitWidthAllFixedFields() / 
8;
    +    if (totalFixedFieldWidth == 0) {
    +      targetRecordCount = 0;
    +    } else {
    +      targetRecordCount = parentReader.getBatchSize() / 
totalFixedFieldWidth;
    +    }
       }
     
       /**
        * Reads as many variable length values as possible.
        *
        * @param recordsToReadInThisPass - the number of records recommended 
for reading form the reader
    -   * @param firstColumnStatus - a reference to the first column status in 
the parquet file to grab metatdata from
    +   * @param firstColumnStatus - a reference to the first column status in 
the Parquet file to grab metatdata from
        * @return - the number of fixed length fields that will fit in the batch
        * @throws IOException
        */
       public long readFields(long recordsToReadInThisPass, ColumnReader<?> 
firstColumnStatus) throws IOException {
     
    -    long recordsReadInCurrentPass = 0;
    -
         // write the first 0 offset
         for (VarLengthColumn<?> columnReader : columns) {
           columnReader.reset();
         }
         Stopwatch timer = Stopwatch.createStarted();
     
    -    recordsReadInCurrentPass = 
determineSizesSerial(recordsToReadInThisPass);
    -    if(useAsyncTasks){
    +    long recordsReadInCurrentPass = 
determineSizesSerial(recordsToReadInThisPass);
    +
    +    // Can't read any more records than fixed width fields will fit.
    +
    +    if (targetRecordCount > 0) {
    +      recordsToReadInThisPass = Math.min(recordsToReadInThisPass, 
targetRecordCount);
    --- End diff --
    
    I think you mean to update recordsReadInCurrentPass. 
recordsToReadInThisPass is not being used after this. So, what is the point in 
updating ?


> Parquet Reader produces "low density" record batches - bits vs. bytes
> ---------------------------------------------------------------------
>
>                 Key: DRILL-5266
>                 URL: https://issues.apache.org/jira/browse/DRILL-5266
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>              Labels: ready-to-commit
>
> Testing with the managed sort revealed that, for at least one file, Parquet 
> produces "low-density" batches: batches in which only 5% of each value vector 
> contains actual data, with the rest being unused space. When fed into the 
> sort, we end up buffering 95% of wasted space, using only 5% of available 
> memory to hold actual query data. The result is poor performance of the sort 
> as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use 
> estimates. The following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
>   T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
>   c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, 
> vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
>   Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5266) Parquet Reader produces "low density" record batches - bits vs. bytes

Reply via email to