[
https://issues.apache.org/jira/browse/DRILL-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139860#comment-15139860
]
ASF GitHub Bot commented on DRILL-4363:
---------------------------------------
GitHub user jinfengni opened a pull request:
https://github.com/apache/drill/pull/371
DRILL-4363: Row count based pruning for parquet table used in Limit n…
… query.
Modify two existint unit testcase:
1) TestPartitionFilter.testMainQueryFalseCondition(): rowCount pruning
applied after false condition is transformed into LIMIT 0
2) TestLimitWithExchanges.testPushLimitPastUnionExchange(): modify the
testcase to use Json source, so that it does not mix with PushLimitIntoScanRule.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jinfengni/incubator-drill DRILL-4363
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/371.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #371
----
commit a84d61fe2b820fe8395e73347dfb0e2986ed9dd0
Author: Jinfeng Ni <[email protected]>
Date: 2016-02-02T23:31:47Z
DRILL-4363: Row count based pruning for parquet table used in Limit n query.
Modify two existint unit testcase:
1) TestPartitionFilter.testMainQueryFalseCondition(): rowCount pruning
applied after false condition is transformed into LIMIT 0
2) TestLimitWithExchanges.testPushLimitPastUnionExchange(): modify the
testcase to use Json source, so that it does not mix with PushLimitIntoScanRule.
----
> Apply row count based pruning for parquet table in LIMIT n query
> ----------------------------------------------------------------
>
> Key: DRILL-4363
> URL: https://issues.apache.org/jira/browse/DRILL-4363
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: Jinfeng Ni
> Assignee: Jinfeng Ni
> Fix For: 1.6.0
>
>
> In interactive data exploration use case, one common and probably first query
> that users would use is " SELECT * from table LIMIT n", where n is a small
> number. Such query will give user idea about the columns in the table.
> Normally, user would expect such query should be completed in very short
> time, since it's just asking for small amount of rows, without any
> sort/aggregation.
> When table is small, there is no big problem for Drill. However, when the
> table is extremely large, Drill's response time is not as fast as what user
> would expect.
> In case of parquet table, it seems that query planner could do a bit better
> job : by applying row count based pruning for such LIMIT n query. The
> pruning is kind of similar to what partition pruning will do, except that it
> uses row count, in stead of partition column values. Since row count is
> available in parquet table, it's possible to do such pruning.
> The benefit of doing such pruning is clear: 1) for small "n", such pruning
> would end up with a few parquet files, in stead of thousands, or millions of
> files to scan. 2) execution probably does not have to put scan into multiple
> minor fragments and start reading the files concurrently, which will cause
> big IO overhead. 3) the physical plan itself is much smaller, since it does
> not include the long list of parquet files, reduce rpc cost of sending the
> fragment plans to multiple drillbits, and the overhead to
> serialize/deserialize the fragment plans.
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)