Loy2 created DRILL-8425:
---------------------------

             Summary: Directory pruning issue with queries including joins. 
                 Key: DRILL-8425
                 URL: https://issues.apache.org/jira/browse/DRILL-8425
             Project: Apache Drill
          Issue Type: Bug
          Components: Functions - Drill
    Affects Versions: 1.19.0, 1.21.0
            Reporter: Loy2


Performance degradation base on the number of files present in the directory 
structure when using the same query on one day of data

I'm using partitioned directories
./product/year/month/day
./command/year/month/day
 each contain a particular  parquet file. (tested with csv as well)
If I query a table for one day, say select * from dfs.root.product where dir0 = 
2023 and dir1 = 04 and dir2 = 12; then only the file located in  
./product/year/month/day/product.parquet is accessed (as expected)
Now if I do a join query between product and command for a particular day
{quote}
SELECT p.field1 , p.field2, c.field2 FROM dfs.root.command as c
LEFT JOIN dfs.root.product as p
on p.field1 = c.field1
where p.dir0 = 2023
and p.dir1 = 04
and p.dir2 = 12
and c.dir0 = 2023
and c.dir1 = 04
and c.dir2 = 12;
{quote}
I can see in the log (debug mode) that all the directory structures is scanned 
and not just the 2 concerned files
so the more file (year month) you have in the DFS the more heap memory you use 
and the more time it takes to get the results

(posted in slack channel 
(https://apache-drill.slack.com/archives/CG380K519/p1681335761429099)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to