[
https://issues.apache.org/jira/browse/DRILL-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141278#comment-15141278
]
ASF GitHub Bot commented on DRILL-4363:
---------------------------------------
Github user jacques-n commented on a diff in the pull request:
https://github.com/apache/drill/pull/371#discussion_r52494516
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java
---
@@ -0,0 +1,107 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.planner.logical;
+
+import com.google.common.collect.ImmutableList;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptRuleOperand;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.util.Pair;
+import org.apache.drill.exec.physical.base.GroupScan;
+import org.apache.drill.exec.planner.logical.partition.PruneScanRule;
+import org.apache.drill.exec.store.parquet.ParquetGroupScan;
+
+import java.io.IOException;
+import java.util.concurrent.TimeUnit;
+
+public abstract class DrillPushLimitToScanRule extends RelOptRule {
+ static final org.slf4j.Logger logger =
org.slf4j.LoggerFactory.getLogger(DrillPushLimitToScanRule.class);
+
+ private DrillPushLimitToScanRule(RelOptRuleOperand operand, String
description) {
+ super(operand, description);
+ }
+
+ public static DrillPushLimitToScanRule LIMIT_ON_SCAN = new
DrillPushLimitToScanRule(
+ RelOptHelper.some(DrillLimitRel.class,
RelOptHelper.any(DrillScanRel.class)), "DrillPushLimitToScanRule_LimitOnScan") {
+ @Override
+ public boolean matches(RelOptRuleCall call) {
+ DrillScanRel scanRel = call.rel(1);
+ return scanRel.getGroupScan() instanceof ParquetGroupScan; // It
only applies to Parquet.
+ }
+
+ @Override
+ public void onMatch(RelOptRuleCall call) {
+ DrillLimitRel limitRel = call.rel(0);
+ DrillScanRel scanRel = call.rel(1);
+ doOnMatch(call, limitRel, scanRel, null);
+ }
+ };
+
+ public static DrillPushLimitToScanRule LIMIT_ON_PROJECT = new
DrillPushLimitToScanRule(
+ RelOptHelper.some(DrillLimitRel.class,
RelOptHelper.some(DrillProjectRel.class,
RelOptHelper.any(DrillScanRel.class))),
"DrillPushLimitToScanRule_LimitOnProject") {
+ @Override
+ public boolean matches(RelOptRuleCall call) {
+ DrillScanRel scanRel = call.rel(2);
+ return scanRel.getGroupScan() instanceof ParquetGroupScan; // It
only applies to Parquet.
+ }
+
+ @Override
+ public void onMatch(RelOptRuleCall call) {
+ DrillLimitRel limitRel = call.rel(0);
+ DrillProjectRel projectRel = call.rel(1);
+ DrillScanRel scanRel = call.rel(2);
+ doOnMatch(call, limitRel, scanRel, projectRel);
+ }
+ };
+
+
+ protected void doOnMatch(RelOptRuleCall call, DrillLimitRel limitRel,
DrillScanRel scanRel, DrillProjectRel projectRel){
+ try {
+ final int rowCountRequested = (int) limitRel.getRows();
+
+ final Pair<GroupScan, Boolean> newGroupScanPair =
ParquetGroupScan.filterParquetScanByLimit((ParquetGroupScan)(scanRel.getGroupScan()),
rowCountRequested);
--- End diff --
Oh wait, GroupScan should be treated as immutable. I shouldn't have
suggested that interface.
how about:
GroupScan applyLimit(int maxRecords)
and return new GroupScan in case of limit application. Otherwise, null.
> Apply row count based pruning for parquet table in LIMIT n query
> ----------------------------------------------------------------
>
> Key: DRILL-4363
> URL: https://issues.apache.org/jira/browse/DRILL-4363
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: Jinfeng Ni
> Assignee: Aman Sinha
> Fix For: 1.6.0
>
>
> In interactive data exploration use case, one common and probably first query
> that users would use is " SELECT * from table LIMIT n", where n is a small
> number. Such query will give user idea about the columns in the table.
> Normally, user would expect such query should be completed in very short
> time, since it's just asking for small amount of rows, without any
> sort/aggregation.
> When table is small, there is no big problem for Drill. However, when the
> table is extremely large, Drill's response time is not as fast as what user
> would expect.
> In case of parquet table, it seems that query planner could do a bit better
> job : by applying row count based pruning for such LIMIT n query. The
> pruning is kind of similar to what partition pruning will do, except that it
> uses row count, in stead of partition column values. Since row count is
> available in parquet table, it's possible to do such pruning.
> The benefit of doing such pruning is clear: 1) for small "n", such pruning
> would end up with a few parquet files, in stead of thousands, or millions of
> files to scan. 2) execution probably does not have to put scan into multiple
> minor fragments and start reading the files concurrently, which will cause
> big IO overhead. 3) the physical plan itself is much smaller, since it does
> not include the long list of parquet files, reduce rpc cost of sending the
> fragment plans to multiple drillbits, and the overhead to
> serialize/deserialize the fragment plans.
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)