Re: [PR] DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns (drill)

via GitHub Fri, 06 Sep 2024 14:47:27 -0700


paul-rogers commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2334853752


   @ychernysh, thank you for the detailed explanation. It is impressive how 
well you understand this complex code. 
   
   **Scope**: We're clear now that this PR is not trying to handle conflicting 
types. Yes, in my example, I was suggesting that the person who creates the 
Parquet files manage column types themselves. There is, of course, another 
workaround that I did not mention: the SQL user can (I believe) insert the 
needed casts. Suppose that we have a set of  files where the types differ for 
three columns, but we know a common type. We can coerce the types manually:
   
   ```sql
   SELECT a, b, c
   FROM (SELECT CAST(a AS, DOUBLE), CAST(b AS VARCHAR), CAST(c AS INT) FROM 
myParquet)
   ORDER BY a, b
   ```
   
   The above should insert that PROJECT I mentioned. In an ideal world, Drill 
would figure this out from the Parquet metadata. As we said, this can be left 
as a future project for another time.
   
   **Wildcard Query**: Ideally, if I have a Parquet file with columns `a` and 
`b`, and `b` is missing in some files, the following two queries should work 
identically:
   
   ```sql
   SELECT a, b FROM (SELECT * FROM myParquet) ORDER BY a, b
   
   SELECT a, b FROM myParquet ORDER BY a, b
   ```
   
   That is, it should not matter how we learn we will scan column `b`: if it is 
missing,  it should have a consistent type after this fix.
   
   The code you mentioned reflects the original schema-on-read design: the 
wildcard is expanded for each row group at run time. This is one reason I was 
surprised that we gather the schema at plan time. Now that it is clear that 
Parquet does have the schema at plan time, we can work out the union of all 
columns from all files at plan time. We can sort out the types of missing 
columns. We can then tell readers that `SELECT *` expands to not just all the 
columns in that particular row group, but to the union of all the columns.
   
   It is clear that we've got a mess. Drill started as schema-on-read. Then, it 
was found that, for Parquet (only) we need schema at plan time. But, the folks 
that added that code didn't fully think through the design. The result is a 
huge muddle that you are now starting to sort out.
   
   _Suggestion_: let's leave proper wildcard expansion to another time. You are 
fixing this bug for a reason: for some use case. If your use case does not use 
wildcard queries, then it is safe to defer this issue until someone actually 
needs a fix.
   
   **Reading non-nullable parquet column into a nullable vector**: Thanks for 
ensuring we set the null vector correctly. Sounds like this part is good.
   
   **Passing type information to readers**: I saw your fix. That is why I 
mentioned that we now have two lists of columns given to the reader:
   
   ```java
           rowGroupScan.getColumns(), // Columns as SchemaPath
          ...
           // each parquet SubScan shares the same table schema constructed by 
a GroupScan
           rowGroupScan.getSchema()); // Same list of columns as above, but as 
TupleMetadata?
   ```
   
   As a reader, I have to try to understand: are the two column lists the same? 
Is the order the same? Is the `TupleMetadata` version a 1:1 relationship with 
the `getSchemaColumns()` list? If not, what are the differences?
   
   You are adding code to an existing implementation, and so you want to avoid 
changing things any more than necessary. Having redundant lists is messy, but 
probably the simplest fix.
   
   _Suggestion_: Maybe just add a comment about the assumed relationship 
between the two lists.
   
   **UntypedNull (Part 1)**: Thanks for the detailed explanation. I appreciate 
the time you've put into fully understanding the convoluted logic.
   
   When faced with complex legacy code, I find it helpful to ask, _what is this 
trying to do_? The code itself is the ultimate truth, and we have to start 
there. But, to sort out what should be happening, we have to work out the 
developer's intent, and figure out if they made a mistake or omitted some 
important condition.
   
   You pointed out that we do two checks. In the [first 
one](https://github.com/apache/drill/blob/11aaa3f89cb85f7ef63e6cc5416c6b2f90e8322c/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206-L208)
 :
   
   ```java
     public static TypedFieldId getFieldId(ValueVector vector, int id, 
SchemaPath expectedPath, boolean hyper) {
       if 
(!expectedPath.getRootSegment().getPath().equalsIgnoreCase(vector.getField().getName()))
 {
         return null;
       }
   ```
   
   This code says, here is a `ValueVector` and an expected `SchemaPath`. Let's 
make sure that the vector actually is for the given schema path by checking the 
`MaterializedField` for the vector. In the case where the name was corrupted 
with backticks, this check failed: we have a vector with the name `foo`, but 
the expected path is `'foo'`. So, no match.
   
   The core question is: what use case is this meant to handle? I can speculate 
that there are two possible cases.
   
   First is the top-level fields. For top level fields, the names should always 
match. By the time we get here, we should have created any needed "dummy" 
top-level vectors. You correctly fixed a case where the top level did not match.
   
   I speculate that this code is meant to handle a second case: a column within 
a `MAP` column. Suppose the Parquet file has a map field `m` that contains two 
columns, `a` and `b`. The query, however, is asking to project `m.c` which does 
not exist. Perhaps this bit of code handles that map case.
   
   _Suggestion_: your fix is probably fine. Please check if we have a test for 
the map case above. If we don't, consider adding one, just so we verify that 
this PR doesn't break anything.
   
   **UntypedNull (Part 2)**: Next, let's understand what _should_ happen if a 
column is missing. We have four cases:
   
   1. The column `c` exists in _none_ of the Parquet files.
   2. The column `c` exists in _some_ of the Parquet files.
   3. The column `m.c` (a column within a map) exists in _some_ Parquet files.
   4. The column `m.c` exists in _none_ Parquet files.
    
   Your fix handles case 2: we will now correctly use the type of existing 
columns. So, we're good here.
   
   Case 1 is where the Nullable `INT`/Untyped NULL question arises. I agree 
with you that we should default the column type to Untyped NULL. It is odd that 
we used  Nullable `INT` in one case, and Untyped NULL in another case. Doing so 
makes no sense to a user. Can we modify the code so we always use Untyped NULL 
in this case? We would detect case 1 in the planner, and mark the corresponding 
column with the Untyped Null type. (I hope `TupleSchema` handles this type! I 
can't recall ever testing it.) The Parquet reader would then know to use the 
Untyped Null when creating the dummy vector. This is based on what _should_ 
happen; the devil is in the details, so please check if this suggestion can 
actually work.
   
   Case 3 is rather special. EVF has some rather complex logic to handle 
missing map columns (to any depth). Your fix relies on code in the planner to 
work out the type for case 2 (top-level columns). Does that code handle nested 
columns? If so, we just do the same as we do for case 2. However, if the 
planner code treats all maps as a single column (does not look inside), then 
maybe we just leave the existing code to do whatever it already does in this 
case.
   
   Case 4 depends on what we do in case 1 and 3. The "right" answer is for the 
missing column to be of type Untyped NULL. But, this case is obscure, so we can 
leave it to do whatever it currently does.
   
   _Suggestion_: handle case 1 as described above, and just test that cases 3 
and 4 still work however they used to work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns (drill)

Reply via email to