[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527425#comment-15527425
 ] 

ASF GitHub Bot commented on DRILL-4905:
---------------------------------------

Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/597#discussion_r80796011
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
    @@ -115,6 +115,8 @@
       private List<RowGroupInfo> rowGroupInfos;
       private Metadata.ParquetTableMetadataBase parquetTableMetadata = null;
       private String cacheFileRoot = null;
    +  private int batchSize;
    +  private static final int DEFAULT_BATCH_LENGTH = 256 * 1024;
    --- End diff --
    
    Are you referring to code here:
    {code}
        // Pick the minimum of recordsPerBatch calculated above, batchSize we 
got from rowGroupScan (based on limit)
        // and user configured batchSize value.
        recordsPerBatch = (int) Math.min(Math.min(recordsPerBatch, batchSize),
                                         
fragmentContext.getOptions().getOption(ExecConstants.PARQUET_RECORD_BATCH_SIZE).num_val.intValue());
    {code}
    
    If I understand correctly, batchSize in ParquetRecordReader comes from 
ParquetRowGroupScan, which comes from ParquetGroupScan, which is set to 
DEFAULT_BATCH_LENGTH.  If I have a RG with 512K rows, and I set 
"store.parquet.record_batch_size" to be 512K, will your code honor this 512 
batch size, or will it use DEFAULT_BATCH_LENGTH since it's smallest? 
    
    Also, if "store.parquet.record_batch_size" is set to be different from 
DEFAULT_BATCH_LENGTH, why would we still use DEFAULT_BATCH_LENGTH in 
ParquetGroupScan / ParquetRowGroupScan?  People might be confused if they look 
at the serialized physical plan, which shows "batchSize = DEFAULT_BATCH_LENGTH. 



> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> -----------------------------------------------------------------------------------
>
>                 Key: DRILL-4905
>                 URL: https://issues.apache.org/jira/browse/DRILL-4905
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.8.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>             Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from <table> limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to