Re: [PR] DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns (drill)

via GitHub Fri, 30 Aug 2024 05:32:04 -0700


ychernysh commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2321094956


   @paul-rogers 
   I have tested the fix the following way. I have a Hadoop/Drill cluster with 
2 nodes running DataNode/Drillbit. I have a `dfs.tmp.people` table consisting 
of 100 parquet files each having 10 or more row groups with such schemas:
   ```
   # case 1:
   /tmp/people/{0..49}.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
age<INT(OPTIONAL)>
   /tmp/people/{50..99}.parquet: id<INT(REQUIRED)>
   # case 2:
   /tmp/people/{50..99}.parquet: id<INT(REQUIRED)>
   /tmp/people/{100..149}.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> 
| age<INT(OPTIONAL)>
   ```
   The files are spread evenly across all DataNodes and when I run a query, I 
see (in each Drillbit's logs, and in query profile) that Drill reads in 
parallel in 2 Drillbits. I run such queries:
   ```
   SELECT age FROM dfs.tmp.people ORDER BY age;
   SELECT name FROM dfs.tmp.people ORDER BY name;
   SELECT age FROM dfs.tmp.people UNION ALL (VALUES(1));
   SELECT age FROM dfs.tmp.people UNION (VALUES(1));
   SELECT name FROM dfs.tmp.people UNION (VALUES ('Bob'));
   SELECT name FROM dfs.tmp.people UNION ALL (VALUES ('Bob'));
   ```
   They all succeeded. Without the fix, Drill would fail. And note that we need 
all of the 3 solutions provided in this PR to make all of them pass:
   1. Solve the naming problem
   2. Set the correct minor type
   3. Set the correct data mode
   
   The main idea of 
[DRILL-8508](https://issues.apache.org/jira/browse/DRILL-8508) solution is that 
since we scan footers of all the parquet files to read back at the planning 
phase in Foreman, we should already know what columns are (partially) missing 
and what are not. Knowing that `file1.parquet` contains `(a: VARCHAR:REQUIRED)` 
and `file2.parquet` has no `a` column at the planning phase, we can tell the 
reader 1 to forcefully put `a` column in a nullable vector (and not `REQUIRED`) 
and the reader 2 to create a missing column vector of type `VARCHAR` (and not 
default to `INT`). And since we've got this information even before any of the 
readers start to actually read, it doesn't matter what file would be read 
first. So this solution is order-agnostic 
   
   **Note about tests**: Unit tests for the fix, however, require having 
specific file read order. This is due to some operators such `UNION ALL`, who 
build their own output schema based on [the first prefetched batches from left 
and right 
inputs](https://github.com/apache/drill/blob/11aaa3f89cb85f7ef63e6cc5416c6b2f90e8322c/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/union/UnionAllRecordBatch.java#L89).
 Here it does matter what file would be read first since it defines the `UNION 
ALL` output schema. So the unit tests aim to do such an order that without the 
fix there would have been an error, while with the fix it succeeds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns (drill)

Reply via email to