[
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241661#comment-15241661
]
Khurram Faraaz edited comment on DRILL-4589 at 4/14/16 6:15 PM:
----------------------------------------------------------------
The following tests will be executed to verify this change.
{noformat}
There are 25 directories (1990 through 2015), and each directory has 4 sub
directories (Q1, Q2, Q3 and Q4)
and each of those sub directories has 2000 parquet files (each being ~2KB in
size)
REFRESH TABLE METADATA `DRILL_4589`
will be executed over the root directory and tests similar to those listed
below (and more) will be executed.
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 IS NOT NULL;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 IS NULL;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 >= 25 AND c1 <= 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 >= 53;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 <= 97;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 >= 25 AND c1 < 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 > 25 AND c1 <= 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 > 25 AND c1 < 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c4 LIKE 'orb%';
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c4 LIKE 'orb%' AND c7 = '1958-04-24';
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c4 IN (...)
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
LENGTH(c5) >= 1 AND LENGTH(c5) <= 172;
{noformat}
was (Author: khfaraaz):
The following tests will be executed to verify this change.
{noformat}
There are 25 directories (1990 THROUGH 2015), and each directory has 4 sub
directories (Q1, Q2, Q3 and Q4)
and each of those sub directories has 2000 parquet files (each being ~2KB in
size)
REFRESH TABLE METADATA `DRILL_4589`
will be executed over the root directory and tests similar to those listed
below (and more) will be executed.
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 IS NOT NULL;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 IS NULL;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 >= 25 AND c1 <= 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 >= 53;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 <= 97;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 >= 25 AND c1 < 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 > 25 AND c1 <= 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c1 > 25 AND c1 < 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c4 LIKE 'orb%';
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c4 LIKE 'orb%' AND c7 = '1958-04-24';
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
c4 IN (...)
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND
LENGTH(c5) >= 1 AND LENGTH(c5) <= 172;
{noformat}
> Reduce planning time for file system partition pruning by reducing filter
> evaluation overhead
> ---------------------------------------------------------------------------------------------
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Reporter: Jinfeng Ni
> Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files
> organized into multi-level directories, user typically will provide a
> partition filter like : dir0 = something and dir1 = something2 and .. .
> For such queries, we saw the query planning time could be unacceptable long,
> due to three main overheads: 1) to expand and get the list of files, 2) to
> evaluate the partition filter, 3) to get the metadata, in the case of parquet
> files for which metadata cache file is not available.
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the
> partition filter evaluation is applied to file level. In many cases, we saw
> that the number of leaf subdirectories is significantly lower than that of
> files. Since all the files under the same leaf subdirecctory share the same
> directory metadata, we should apply the filter evaluation at the leaf
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the
> filter, and the memory overhead as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)