[
https://issues.apache.org/jira/browse/DRILL-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942863#comment-16942863
]
benj commented on DRILL-7004:
-----------------------------
Despite the parallelization, for a system containing several hundreds of
thousands of files, it is really toooooo long and therefore unusable.
Example:
{code:java}
DIR_root
|--DIR_DAY_x
|--DIR_HOUR_y
|--File_MINUTE_z
with x from 1 to 31, y from 0 to 23, z from 0 to 59
{code}
{code:sql}
/* mydfs.myminutes : location = "DIR_ROOT/DIR_DAY_1/DIR_HOUR_0" */
SELECT * FROM INFORMATION_SCHEMA.`FILES` WHERE schema_name = 'mydfs.myminutes'
and is_file = true;
=> time ~ 0.65 seconds - 60 files > (time of unix find : 0.042s)
/* mydfs.myhours : location = "DIR_ROOT/DIR_DAY_1" */
alter session SET `storage.list_files_recursively` = true;
SELECT * FROM INFORMATION_SCHEMA.`FILES` WHERE schema_name = 'mydfs.myhours'
and is_file = true;
=> time ~ 9 seconds - 1440 files (60*24) >>> (time of unix find : 0.095s)
/* mydfs.mydays : location = "DIR_ROOT/" */
alter session SET `storage.list_files_recursively` = true;
SELECT * FROM INFORMATION_SCHEMA.`FILES` WHERE schema_name = 'mydfs.mydays' and
is_file = true;
=> time ~ 417 seconds - 44640 files (60*24*31) >>>>>> (time of unix find :
1.5s (with print))
{code}
It's comprehensible that there is overhead compared to unix tools, but the
average time per file is too much expensive - Here : 0.01s, (ie 2h30 to scan 1
million files)
it's a pity that it's really more efficient to make a `find path/ -type f >
mytmp.csv` and next `SELECT * FROM mytmp.csv` _(with necessary permission)_
> improve show files functionnality
> ---------------------------------
>
> Key: DRILL-7004
> URL: https://issues.apache.org/jira/browse/DRILL-7004
> Project: Apache Drill
> Issue Type: Wish
> Components: Storage - Other
> Affects Versions: 1.15.0
> Reporter: benj
> Priority: Major
>
> For instant, it's possible to show files/directories in a particular
> directory with the command
> {code:java}
> SHOW files FROM tmp.`mypath`;
> {code}
> It would be certainly very useful to improve this functionality with :
> * possibility to list recursively
> * possibility to use at least wildcard
> {code:java}
> SHOW files FROM tmp.`mypath/*/test/*/*a*`;
> {code}
> * possibility to use the result like a table
> {code:java}
> SELECT p.* FROM (SHOW files FROM tmp.`mypath`) AS p WHERE ...
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)