[ 
https://issues.apache.org/jira/browse/DRILL-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142269#comment-15142269
 ] 

ASF GitHub Bot commented on DRILL-4363:
---------------------------------------

Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/371#discussion_r52565738
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
    @@ -485,12 +486,14 @@ public void populatePruningVector(ValueVector v, int 
index, SchemaPath column, S
         private EndpointByteMap byteMap;
         private int rowGroupIndex;
         private String root;
    +    private long rowCount;
     
         @JsonCreator
         public RowGroupInfo(@JsonProperty("path") String path, 
@JsonProperty("start") long start,
    -        @JsonProperty("length") long length, 
@JsonProperty("rowGroupIndex") int rowGroupIndex) {
    +        @JsonProperty("length") long length, 
@JsonProperty("rowGroupIndex") int rowGroupIndex, long rowCount) {
    --- End diff --
    
    Add the comment.
    
    Pass -1 as rowCount in an unused method 
TestAffinityCalculator.buildRowGroups(). That is just to make code compile.   


> Apply row count based pruning for parquet table in LIMIT n query
> ----------------------------------------------------------------
>
>                 Key: DRILL-4363
>                 URL: https://issues.apache.org/jira/browse/DRILL-4363
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Jinfeng Ni
>            Assignee: Aman Sinha
>             Fix For: 1.6.0
>
>
> In interactive data exploration use case, one common and probably first query 
> that users would use is " SELECT * from table LIMIT n", where n is a small 
> number. Such query will give user idea about the columns in the table.
> Normally, user would expect such query should be completed in very short 
> time, since it's just asking for small amount of rows, without any 
> sort/aggregation.
> When table is small, there is no big problem for Drill. However, when the 
> table is extremely large,  Drill's response time is not as fast as what user 
> would expect.
> In case of parquet table, it seems that query planner could do a bit better 
> job : by applying row count based pruning for such LIMIT n query.  The 
> pruning is kind of similar to what partition pruning will do, except that it 
> uses row count, in stead of partition column values. Since row count is 
> available in parquet table, it's possible to do such pruning.
> The benefit of doing such pruning is clear: 1) for small "n",  such pruning 
> would end up with a few parquet files, in stead of thousands, or millions of 
> files to scan. 2) execution probably does not have to put scan into multiple 
> minor fragments and start reading the files concurrently, which will cause 
> big IO overhead. 3) the physical plan itself is much smaller, since it does 
> not include the long list of parquet files, reduce rpc cost of sending the 
> fragment plans to multiple drillbits, and the overhead to 
> serialize/deserialize the fragment plans.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to