johnnystargazer opened a new issue #2402: URL: https://github.com/apache/drill/issues/2402
**Describe the bug** Drill scan all the parquet file from query root for metadata if there is a "inner join " in query. **To Reproduce** Steps to reproduce the behavior: ``` # /data/01/ is the query root in this case # prepare the directory for used in query mkdir -p /data/01/2021/11/2021-11-23/ # download a parquet file fil for drill to query curl https://raw.githubusercontent.com/apache/drill/master/sample-data/nation.parquet -o /data/01/2021/11/2021-11-23/data.parquet # prepre inner join directory , mkdir -p /data/PRO/item # we prepare a invalid parquet file, this file is not supposed to be scan when our query mkdir -p /data/01/2010/01/2010-01-01/ echo "abc" > /data/01/2010/01/2010-01-01/data.parquet # query drill endpoint by curl json="{\"queryType\":\"SQL\", \"query\": \"SELECT COUNT(*) FROM dfs.\`/data/01\` as t INNER JOIN dfs.\`/data/PRO/item\` item ON t.N_REGIONKEY = item.ID WHERE t.dir2 >='2021-11-23' AND t.dir2<='2021-11-30' AND (REPEATED_CONTAINS(item.CATEGORIES,1031) OR REPEATED_CONTAINS(item.CATEGORIES,1047))\", \"autoLimit\":1}" drill_host="localhost:8047" curl -XPOST -H "Content-Type: application/json" "$drill_host/query.json" -d "$json" ``` **Expected behavior** As we only query t.dir2 >='2021-11-23' AND t.dir2<='2021-11-30' , and invalite file is under dir2="2010-01-01" , the expected behavior is drill perform query without any error, but it it return data.parquet is not a Parquet file **Screenshots** ``` { "errorMessage" : "SYSTEM ERROR: RuntimeException: file:/data/01/2010/01/2010-01-01/data.parquet is not a Parquet file (too small length: 4)\n\n\nPlease, refer to logs for more information.\n\n[Error Id: ce4e61af-5df8-440e-81d2-673c89106e5f on drill-0.drill:31010]" } ``` **Additional context** Drill return successfully if no inner join in query ``` # query drill endpoint by curl json="{\"queryType\":\"SQL\", \"query\": \"SELECT COUNT(*) FROM dfs.\`/data/01\` as t WHERE t.dir2 >='2021-11-23' AND t.dir2<='2021-11-30'\", \"autoLimit\":1}" drill_host="localhost:8047" curl -XPOST -H "Content-Type: application/json" "$drill_host/query.json" -d "$json" ``` ``` { "queryId" : "1e4ce295-3052-a66c-b68f-96cf4a97806d", "columns" : [ "EXPR$0" ], "rows" : [ { "EXPR$0" : "25" } ], "metadata" : [ "BIGINT" ], "queryState" : "COMPLETED", "attemptedAutoLimit" : 1 } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
