paul-rogers commented on PR #2937: URL: https://github.com/apache/drill/pull/2937#issuecomment-2334853752
@ychernysh, thank you for the detailed explanation. It is impressive how well you understand this complex code. **Scope**: We're clear now that this PR is not trying to handle conflicting types. Yes, in my example, I was suggesting that the person who creates the Parquet files manage column types themselves. There is, of course, another workaround that I did not mention: the SQL user can (I believe) insert the needed casts. Suppose that we have a set of files where the types differ for three columns, but we know a common type. We can coerce the types manually: ```sql SELECT a, b, c FROM (SELECT CAST(a AS, DOUBLE), CAST(b AS VARCHAR), CAST(c AS INT) FROM myParquet) ORDER BY a, b ``` The above should insert that PROJECT I mentioned. In an ideal world, Drill would figure this out from the Parquet metadata. As we said, this can be left as a future project for another time. **Wildcard Query**: Ideally, if I have a Parquet file with columns `a` and `b`, and `b` is missing in some files, the following two queries should work identically: ```sql SELECT a, b FROM (SELECT * FROM myParquet) ORDER BY a, b SELECT a, b FROM myParquet ORDER BY a, b ``` That is, it should not matter how we learn we will scan column `b`: if it is missing, it should have a consistent type after this fix. The code you mentioned reflects the original schema-on-read design: the wildcard is expanded for each row group at run time. This is one reason I was surprised that we gather the schema at plan time. Now that it is clear that Parquet does have the schema at plan time, we can work out the union of all columns from all files at plan time. We can sort out the types of missing columns. We can then tell readers that `SELECT *` expands to not just all the columns in that particular row group, but to the union of all the columns. It is clear that we've got a mess. Drill started as schema-on-read. Then, it was found that, for Parquet (only) we need schema at plan time. But, the folks that added that code didn't fully think through the design. The result is a huge muddle that you are now starting to sort out. _Suggestion_: let's leave proper wildcard expansion to another time. You are fixing this bug for a reason: for some use case. If your use case does not use wildcard queries, then it is safe to defer this issue until someone actually needs a fix. **Reading non-nullable parquet column into a nullable vector**: Thanks for ensuring we set the null vector correctly. Sounds like this part is good. **Passing type information to readers**: I saw your fix. That is why I mentioned that we now have two lists of columns given to the reader: ```java rowGroupScan.getColumns(), // Columns as SchemaPath ... // each parquet SubScan shares the same table schema constructed by a GroupScan rowGroupScan.getSchema()); // Same list of columns as above, but as TupleMetadata? ``` As a reader, I have to try to understand: are the two column lists the same? Is the order the same? Is the `TupleMetadata` version a 1:1 relationship with the `getSchemaColumns()` list? If not, what are the differences? You are adding code to an existing implementation, and so you want to avoid changing things any more than necessary. Having redundant lists is messy, but probably the simplest fix. _Suggestion_: Maybe just add a comment about the assumed relationship between the two lists. **UntypedNull (Part 1)**: Thanks for the detailed explanation. I appreciate the time you've put into fully understanding the convoluted logic. When faced with complex legacy code, I find it helpful to ask, _what is this trying to do_? The code itself is the ultimate truth, and we have to start there. But, to sort out what should be happening, we have to work out the developer's intent, and figure out if they made a mistake or omitted some important condition. You pointed out that we do two checks. In the [first one](https://github.com/apache/drill/blob/11aaa3f89cb85f7ef63e6cc5416c6b2f90e8322c/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206-L208) : ```java public static TypedFieldId getFieldId(ValueVector vector, int id, SchemaPath expectedPath, boolean hyper) { if (!expectedPath.getRootSegment().getPath().equalsIgnoreCase(vector.getField().getName())) { return null; } ``` This code says, here is a `ValueVector` and an expected `SchemaPath`. Let's make sure that the vector actually is for the given schema path by checking the `MaterializedField` for the vector. In the case where the name was corrupted with backticks, this check failed: we have a vector with the name `foo`, but the expected path is `'foo'`. So, no match. The core question is: what use case is this meant to handle? I can speculate that there are two possible cases. First is the top-level fields. For top level fields, the names should always match. By the time we get here, we should have created any needed "dummy" top-level vectors. You correctly fixed a case where the top level did not match. I speculate that this code is meant to handle a second case: a column within a `MAP` column. Suppose the Parquet file has a map field `m` that contains two columns, `a` and `b`. The query, however, is asking to project `m.c` which does not exist. Perhaps this bit of code handles that map case. _Suggestion_: your fix is probably fine. Please check if we have a test for the map case above. If we don't, consider adding one, just so we verify that this PR doesn't break anything. **UntypedNull (Part 2)**: Next, let's understand what _should_ happen if a column is missing. We have four cases: 1. The column `c` exists in _none_ of the Parquet files. 2. The column `c` exists in _some_ of the Parquet files. 3. The column `m.c` (a column within a map) exists in _some_ Parquet files. 4. The column `m.c` exists in _none_ Parquet files. Your fix handles case 2: we will now correctly use the type of existing columns. So, we're good here. Case 1 is where the Nullable `INT`/Untyped NULL question arises. I agree with you that we should default the column type to Untyped NULL. It is odd that we used Nullable `INT` in one case, and Untyped NULL in another case. Doing so makes no sense to a user. Can we modify the code so we always use Untyped NULL in this case? We would detect case 1 in the planner, and mark the corresponding column with the Untyped Null type. (I hope `TupleSchema` handles this type! I can't recall ever testing it.) The Parquet reader would then know to use the Untyped Null when creating the dummy vector. This is based on what _should_ happen; the devil is in the details, so please check if this suggestion can actually work. Case 3 is rather special. EVF has some rather complex logic to handle missing map columns (to any depth). Your fix relies on code in the planner to work out the type for case 2 (top-level columns). Does that code handle nested columns? If so, we just do the same as we do for case 2. However, if the planner code treats all maps as a single column (does not look inside), then maybe we just leave the existing code to do whatever it already does in this case. Case 4 depends on what we do in case 1 and 3. The "right" answer is for the missing column to be of type Untyped NULL. But, this case is obscure, so we can leave it to do whatever it currently does. _Suggestion_: handle case 1 as described above, and just test that cases 3 and 4 still work however they used to work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org