johnnystargazer opened a new issue #2402:
URL: https://github.com/apache/drill/issues/2402


   **Describe the bug**
   Drill scan all the parquet file from query root for metadata if there is a 
"inner join " in query. 
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   ```
   # /data/01/ is the query root in this case
   # prepare the directory for used in query
   mkdir -p /data/01/2021/11/2021-11-23/
   # download a parquet file fil for drill to query
   curl 
https://raw.githubusercontent.com/apache/drill/master/sample-data/nation.parquet
 -o /data/01/2021/11/2021-11-23/data.parquet
   
   # prepre inner join directory ,
   mkdir -p /data/PRO/item
   # we prepare a invalid parquet file, this file is not supposed to be scan 
when our query
   mkdir -p /data/01/2010/01/2010-01-01/
   echo "abc" > /data/01/2010/01/2010-01-01/data.parquet
   
   
   
   # query drill endpoint by curl
   json="{\"queryType\":\"SQL\", \"query\": \"SELECT COUNT(*) FROM  
dfs.\`/data/01\` as t INNER JOIN dfs.\`/data/PRO/item\` item  ON t.N_REGIONKEY 
= item.ID WHERE t.dir2 >='2021-11-23' AND t.dir2<='2021-11-30' AND 
(REPEATED_CONTAINS(item.CATEGORIES,1031) OR 
REPEATED_CONTAINS(item.CATEGORIES,1047))\", \"autoLimit\":1}"
   drill_host="localhost:8047"
   curl -XPOST  -H "Content-Type: application/json" "$drill_host/query.json" -d 
"$json"
   
   ```
   **Expected behavior**
   As we only query  t.dir2 >='2021-11-23' AND t.dir2<='2021-11-30'  , and 
invalite file is under dir2="2010-01-01" ,
   the expected behavior is drill perform query without any error, but it it 
return data.parquet is not a Parquet file
   
   **Screenshots**
   ```
   {
     "errorMessage" : "SYSTEM ERROR: RuntimeException: 
file:/data/01/2010/01/2010-01-01/data.parquet is not a Parquet file (too small 
length: 4)\n\n\nPlease, refer to logs for more information.\n\n[Error Id: 
ce4e61af-5df8-440e-81d2-673c89106e5f on drill-0.drill:31010]"
   }
   ```
   
   
   **Additional context**
   Drill return successfully if no inner join in query
   ```
   # query drill endpoint by curl
   json="{\"queryType\":\"SQL\", \"query\": \"SELECT COUNT(*) FROM  
dfs.\`/data/01\` as t WHERE t.dir2 >='2021-11-23' AND t.dir2<='2021-11-30'\", 
\"autoLimit\":1}"
   drill_host="localhost:8047"
   curl -XPOST  -H "Content-Type: application/json" "$drill_host/query.json" -d 
"$json"
   ```
   
   ```
    {
     "queryId" : "1e4ce295-3052-a66c-b68f-96cf4a97806d",
     "columns" : [ "EXPR$0" ],
     "rows" : [ {
       "EXPR$0" : "25"
     } ],
     "metadata" : [ "BIGINT" ],
     "queryState" : "COMPLETED",
     "attemptedAutoLimit" : 1
   }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to