[I] Drill occasionally reject to process the same simple query on random parquet file excluded by WHERE (drill)

via GitHub Tue, 01 Apr 2025 10:56:48 -0700


zcattacz opened a new issue, #2982:
URL: https://github.com/apache/drill/issues/2982


   **Describe the bug**
   Possible bug aspect:
   - Drill seems randomly rejecting this simple query, as long as I keep 
trying, it will get over without me doing any change to the source parquets or 
the query.
   - The query `where` clause has excluded the parquet files known to lack of 
the required column. Drill rejects the query on random files disregarding the 
`where` clause
   
   What happened:
   Drill is used in the scada data process pipeline, the scada data is columnar 
table with more and more channels (cols with numeric id and float data) added 
over time. each 24hour data is dumped into a separate parquet file (e.g. 
m250131.parquet), all files stored in the same folder `d:/datarepo/fix1/`
   
   Due to latter files have more columns, than older files, I use a initial 
query to filter the raw data into a separate temporary table and only work on 
the temporary table.
   
   ```
   drop table if exists dfs.ds.metric_lines_raw;
   create table dfs.ds.metric_lines_raw  as
   select index,
    `107`, `207`, `307`, `407`, `507`, `607`, `707`, `807`, `907`, `1007`,
    `1107`,`1207`,`1307`,`1407`,`1507`,`1607`,`1707`,`1807`,`1907`,`2007`,
    `2107`,`2207`,`2307`,`2407`,
    `10102`, `10202`, `10302`, `10402`, `10502`, `10602`, `10702`, `10802`, 
`10902`, `11002`, 
    `11102`, `11202`, `11302`, `11402`, `11502`, `11602`, `11702`, `11802`, 
`11902`, `12002`, 
    `12202`, `12302`, `12402`
   from (select * from dfs.datarepo.`fix1` where `filename` like 'm25%')
   where `filename` like 'm2501%' or `filename` like 'm2502%' or `filename` 
like 'm2503%'
   ;
   ```
   I try to keep the query simple, but Drill randomly rejects the query with or 
without a subquery. 
   
   **Expected behavior**
   For simple select query, Drill should respect the `where` clause, without 
probing irrelevant files. 
   For simple query, even if parquest have different number of columns, as long 
as the required column exist, and valid, Drill is expected to proceed without 
issue.
   
   **Error detail, log output or screenshots**
   ```
   An error occurred when executing the SQL command:
   create table dfs.ds.metric_lines_raw  as
   select index,
    `107`, `207`, `307`, `407`, `507`, `607`, `707`, `807`, `907`, `1007`,
    `1107`,`1207`,`1307`,`...
   
   DATA_READ ERROR: Exception occurred while reading from disk.
   
   File: d:/datarepo/fix1/m210520.parquet
   Column: 11102
   Row Group Start: 775564
   File: d:/datarepo/fix1/m210520.parquet
   Column: 11102
   Row Group Start: 775564
   Fragment: 1:2
   
   [Error Id: f75abdf9-dad7-48c3-b047-763d3e2a5edc on TVSSRV02:31010]
   1 statement failed.
   
   Execution time: 9.07s
   
   --- retry ---
   ....
   
   File: d:/datarepo/fix1/m210103.parquet
   Column: 107
   Row Group Start: 292
   File: d:/datarepo/fix1/m210103.parquet
   Column: 107
   Row Group Start: 292
   Fragment: 1:1
   
   [Error Id: b96f48a5-8eaf-46e1-ad26-760946e5aba1 on TVSSRV02:31010]
   1 statement failed.
   
   Execution time: 8.67s
   
   -- retry ---
   ...
   File: d:/datarepo/fix1/m210212.parquet
   Column: 10502
   Row Group Start: 49816
   File: d:/datarepo/fix1/m210212.parquet
   Column: 10502
   Row Group Start: 49816
   Fragment: 1:1
   
   [Error Id: a38bd17c-9ee5-4fef-8dd9-9673e95eae1b on TVSSRV02:31010]
   1 statement failed.
   
   Execution time: 8.81s
   
   ...
   ```
   
   **Drill version**
   Drill 1.21.1
   LibreJDK jdk-17.0.14
   Windows 2019
   SQLWorkbench/J with org.apache.drill.jdbc driver
   
   **Additional context**
   Server with  >10G ram, >80G disk, 4 core, unlikely a resource issue. 
   Total parquet file 3.5 G
   Expected parquet file (m25*) in query 256MB
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Drill occasionally reject to process the same simple query on random parquet file excluded by WHERE (drill)

Reply via email to