Loy2 created DRILL-8425: --------------------------- Summary: Directory pruning issue with queries including joins. Key: DRILL-8425 URL: https://issues.apache.org/jira/browse/DRILL-8425 Project: Apache Drill Issue Type: Bug Components: Functions - Drill Affects Versions: 1.19.0, 1.21.0 Reporter: Loy2
Performance degradation base on the number of files present in the directory structure when using the same query on one day of data I'm using partitioned directories ./product/year/month/day ./command/year/month/day each contain a particular parquet file. (tested with csv as well) If I query a table for one day, say select * from dfs.root.product where dir0 = 2023 and dir1 = 04 and dir2 = 12; then only the file located in ./product/year/month/day/product.parquet is accessed (as expected) Now if I do a join query between product and command for a particular day {quote} SELECT p.field1 , p.field2, c.field2 FROM dfs.root.command as c LEFT JOIN dfs.root.product as p on p.field1 = c.field1 where p.dir0 = 2023 and p.dir1 = 04 and p.dir2 = 12 and c.dir0 = 2023 and c.dir1 = 04 and c.dir2 = 12; {quote} I can see in the log (debug mode) that all the directory structures is scanned and not just the 2 concerned files so the more file (year month) you have in the DFS the more heap memory you use and the more time it takes to get the results (posted in slack channel (https://apache-drill.slack.com/archives/CG380K519/p1681335761429099) -- This message was sent by Atlassian Jira (v8.20.10#820010)