[ 
https://issues.apache.org/jira/browse/DRILL-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942863#comment-16942863
 ] 

benj commented on DRILL-7004:
-----------------------------

Despite the parallelization, for a system containing several hundreds of 
thousands of files, it is really toooooo long and therefore unusable.

Example:
{code:java}
DIR_root
|--DIR_DAY_x
   |--DIR_HOUR_y
      |--File_MINUTE_z

with x from 1 to 31, y from 0 to 23, z from 0 to 59
{code}
{code:sql}
/* mydfs.myminutes : location = "DIR_ROOT/DIR_DAY_1/DIR_HOUR_0" */
SELECT * FROM INFORMATION_SCHEMA.`FILES` WHERE schema_name = 'mydfs.myminutes' 
and is_file = true;
=> time ~ 0.65 seconds - 60 files > (time of unix find : 0.042s)

/* mydfs.myhours : location = "DIR_ROOT/DIR_DAY_1" */
alter session SET `storage.list_files_recursively` = true;
SELECT * FROM INFORMATION_SCHEMA.`FILES` WHERE schema_name = 'mydfs.myhours' 
and is_file = true;
=> time ~ 9 seconds - 1440 files (60*24) >>> (time of unix find : 0.095s)

/* mydfs.mydays : location = "DIR_ROOT/" */
alter session SET `storage.list_files_recursively` = true;
SELECT * FROM INFORMATION_SCHEMA.`FILES` WHERE schema_name = 'mydfs.mydays' and 
is_file = true;
=> time ~ 417 seconds - 44640 files (60*24*31) >>>>>>  (time of unix find : 
1.5s (with print))
{code}
It's comprehensible that there is overhead compared to unix tools, but the 
average time per file is too much expensive - Here : 0.01s, (ie 2h30 to scan 1 
million files)
 it's a pity that it's really more efficient to make a `find path/ -type f > 
mytmp.csv` and next `SELECT * FROM mytmp.csv` _(with necessary permission)_

> improve show files functionnality
> ---------------------------------
>
>                 Key: DRILL-7004
>                 URL: https://issues.apache.org/jira/browse/DRILL-7004
>             Project: Apache Drill
>          Issue Type: Wish
>          Components: Storage - Other
>    Affects Versions: 1.15.0
>            Reporter: benj
>            Priority: Major
>
> For instant, it's possible to show files/directories in a particular 
> directory with the command
> {code:java}
> SHOW files FROM tmp.`mypath`;
> {code}
> It would be certainly very useful to improve this functionality with :
>  * possibility to list recursively
>  * possibility to use at least wildcard
> {code:java}
> SHOW files FROM tmp.`mypath/*/test/*/*a*`;
> {code}
>  * possibility to use the result like a table
> {code:java}
> SELECT p.* FROM (SHOW files FROM tmp.`mypath`) AS p WHERE ...
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to