[ 
https://issues.apache.org/jira/browse/DRILL-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731080#comment-17731080
 ] 

Philippe Audet commented on DRILL-8425:
---------------------------------------

Hi, I investiguate a bit on the issue and I found that function 
{*}FileSelection.{*}{*}minusDirectories(){*} ** is the bottle neck. I'm not 
sure if it because it instatiate too many threads at the same time but it's 
almost one per subdirectory. For what I understand, it's does not look that 
trivial to narrow the search by updating the root dir.

> Directory pruning issue with queries including joins. 
> ------------------------------------------------------
>
>                 Key: DRILL-8425
>                 URL: https://issues.apache.org/jira/browse/DRILL-8425
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.21.0, 1.19.0
>            Reporter: Loy2
>            Priority: Major
>
> Performance degradation base on the number of files present in the directory 
> structure when using the same query on one day of data
> I'm using partitioned directories
> ./product/year/month/day
> ./command/year/month/day
>  each contain a particular  parquet file. (tested with csv as well)
> If I query a table for one day, say select * from dfs.root.product where dir0 
> = 2023 and dir1 = 04 and dir2 = 12; then only the file located in  
> ./product/year/month/day/product.parquet is accessed (as expected)
> Now if I do a join query between product and command for a particular day
> {quote}
> SELECT p.field1 , p.field2, c.field2 FROM dfs.root.command as c
> LEFT JOIN dfs.root.product as p
> on p.field1 = c.field1
> where p.dir0 = 2023
> and p.dir1 = 04
> and p.dir2 = 12
> and c.dir0 = 2023
> and c.dir1 = 04
> and c.dir2 = 12;
> {quote}
> I can see in the log (debug mode) that all the directory structures is 
> scanned and not just the 2 concerned files
> so the more file (year month) you have in the DFS the more heap memory you 
> use and the more time it takes to get the results
> (posted in slack channel 
> (https://apache-drill.slack.com/archives/CG380K519/p1681335761429099)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to