[jira] [Commented] (DRILL-8507) Missing parquet columns quoted with backticks conflict with existing ones

ASF GitHub Bot (Jira) Thu, 29 Aug 2024 13:55:06 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877893#comment-17877893
 ]


ASF GitHub Bot commented on DRILL-8507:
---------------------------------------

paul-rogers commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2318972159

   @rymarm, thanks for this fix. I'm a bit rusty on Drill, but let's see what I 
can sort out. This stuff is complex, so I'm going to throw a wall of text at 
you so we get on the same page.
   
   First some background. If memory serves, the Parquet reader was written 
quickly by an intern early on: it is so complex that few have summoned the 
effort to understand or improve it. So, digging into it is a classic "legacy 
code" experience.
   
   In the early days, every single format reader created its own way to write 
data to vectors. This lead to all manner of bugs (getting the code right once 
is hard, doing it a dozen times is impossible). It also resulted in 
inconsistent behavior. To solve these (and to avoid Drill's memory 
fragmentation issues of the time), we created EVF (extended vector framework) 
to unify how we write to vectors and how we solve schema consistency issues for 
readers. EVF replaced the older column writer mechanisms. By now, all Drill 
readers _except Parquet_ are based on EVF.
   
   However, the Parquet reader is special: it is the only one that still uses 
the original, fragile mechanisms. As the story goes, Parquet was written very 
quickly by an intern, and the result was so complex that few people have been 
brave enough to try to sort out the code. Since Parquet still uses the old 
mechanisms, it has its own way to solve schema issues, its own way to handle 
unprojected columns, and still suffers from the bugs in the original, fragile 
mechanisms. It looks like your PR tries to fix some of these. In particular, it 
may be that that you're trying to fix a bug in Parquet that EVF solves for 
other readers. It would be great if your fix is consistent with EVF. I'll try 
to check this when I review the code.
   
   What we really need is for someone to take on a "weekend project" to rewrite 
the Parquet reader to use EVF so we have one complete schema mechanism rather 
than two inconsistent versions. (I can dream.)
   
   Now let's look at the bug in question. You received a schema change error. 
This means that some operator in the DAG saw two different schemas for the same 
table. In particular, the SORT operator can't handle the case of, say, `(a: 
INT, b: VARCHAR, c: double)` and `(a: INT, b: VARCHAR)` or` (a: INT: b: INT)`, 
even if our query only has `...ORDER BY a`. There was discussion about using 
`UNION` columns, or dynamically reformatting data. But, all of that was far too 
complex (and slow) to actually build. 
   
   Most DB and query tools impose a schema at read time. That is, if we have 
files that have undergone schema evolution, as above, we work out at plan time 
that the common schema is, say `(a: INT, b: VARCHAR, c: DOUBLE)`, with `c` 
being `OPTIONAL` (`NULLABLE` in standard DB parlance) since it appears in only 
some of the files. For example, Presto and Impala avoid the issue by reading 
into the correct common schema in the first place, rather than cleaning up the 
mess later in each downstream operator.
   
   Of course, Drill's claim to fame is that it is schema-free: it works out the 
schema on the fly. That's all cool, except that Drill also supports operators 
that only work for a single schema. (Joins have the same issue.) That is, we 
promote a really cool feature, but we can't actually make it work. This is 
great marketing, but causes frustration in reality.
   
   And, as befits any bit of  software that has a few miles on its tires, Drill 
has multiple ways to work around the schema issue. There is a "provided schema" 
feature that tells readers the common schema. The reader's job is then to map 
the actual file columns into that schema. The provided schema is supported in 
EVF, but not, alas, by the Parquet reader. Also, the planner never got around 
to supporting schemas, so provided schemas only work in rather special cases. 
There is also a Drill Metastore feature that emulates the Hive Metastore (HMS), 
but be in common use.
   
   The original way to manage schemas, for Parquet only, is to scan all the 
files and build up a JSON file that contains the union of all the schemas. I 
can't recall if this mechanism was supposed to provide type consistency, but it 
certainly could: we just look at the `c` column in all schemas of all Parquet 
files in a directory tree and work out a common type. We do this at plan time 
and send that type along to Parquet in the physical plan. I didn't work in that 
area, so my knowledge here is a bit light. It's worth a check.
   
   Now, one might say, hold on, isn't there an easy answer? If we read file A 
and get one schema, then read file B and get another schema, can't we blend 
them on the fly? Indeed, in some cases this is possible, and EVF implements 
those cases. However as I'm fond of saying, "Drill can't predict the future", 
so there are cases where we get it wrong. For example, the query asks for 
columns `(a, b, c)`. The first file has `(a, b)` and creates a dummy column for 
`c` that will hold all `NULL`s. But, of what type? Drill traditionally chooses 
`INT`. Then, we read file 2 that has `(a, b, c: VARCHAR)`. Now we realize that 
`c` should have been `VARCHAR`, but we've sent a batch of rows downstream 
already with the "wrong" `INT` type. 
   
   Plus, remember, Drill is distributed. Those two files might have been read 
by different fragments running on different machines. They can't communicate 
until the rows come together in the sort, after which it is too late to fix the 
problem.
   
   One other tricky item to remember: Drill is fastest when columns are 
`REQUIRED` (i.e. non-nullable or `NOT NULL`). This avoids the `NULL` check on 
each column operation. **BUT**, Drill also allows schema to vary, and we fill 
in missing columns with `NULL` values. So, unless we have some guarantee that 
column `c` actually exists in all files read by a query, we are pretty much 
forced to make `c` be `OPTIONAL` (nullable), even when reading a file that 
contains `c` as a `NOT NULL` column. Again, Drill can't predict the future, so 
we can't know that some future file (in, perhaps, some other fragment on some 
other machine) will read a file that doesn't have `c`.
   
   There is a reason that SQL DBs, for 50+ years, have required a schema. It 
isn't that all those folks were backward, it is that SQL doesn't work without a 
common schema. (Apache Druid has a similar problem, but it solves it by 
treating all values as, essentially, untyped: the values change type on the fly 
as needed.) When faced with bugs of the kind here, it is important to sort out 
which are just "bad code" bugs and which are the "this design just can't work" 
bugs.
   
   Now, with that background, we can try to sort out the problem you are trying 
to solve. Does your test case have Parquet files with differing column sets or 
types? If so, let's identify exactly how the schemas differ, then we can 
discuss possible solutions.
   
   Looking at the specific line you pointed out, I'll hazard a guess as to what 
it is doing: that case we discussed above. Suppose our SQL is `SELECT a, b FROM 
'my-dir'`. Suppose `my-dir` is a directory that contains two files. 
`file1.parquet` contains column `(a: VARCHAR)` and `file2.parquet` contains 
`(a: VARCHAR, b: VARCHAR)`. Both are read by the same reader (the scan is not 
distributed in this case.) File read order is random. Suppose we read them in 
the order `(file2, file1)`. We can set up a schema of `(a: VARCHAR, b: 
VARCHAR)`. When we read `file1`, we notice that 'b' is missing, but that, when 
it appeared previously, it was `VARCHAR`, so we keep that type and fill with 
nulls. (This is what EVF does, I'll speculate that Parquet's reader does 
something similar.) Great. What if we read the other order? We see we want `(a, 
b)`, but we have only` (a: VARCHAR)`, so we create that vector. What about `b`? 
Well, we helpfully create `(b: OPTIONAL INT)`. That is exactly what the code 
that you pointed to does. When we read `file2`, we see that `b` has "changed 
schema" to `VARCHAR` so we throw away the `(a: VARCHAR, b: INT)` schema and 
start reading with `(a: VARCHAR, b: VARCHAR)`. This then blows up the SORT.
   
   Given this, I don't think the problem is with the column name (which is what 
that referenced code change handles). The code change in question allowed 
handling a column name of the form `foo.a` where `foo` is not a MAP, with `a` 
as a member, but just a (poorly chosen) name of a column. That is, the problem 
is probably not that the test Parquet file columns are actually named `foo.a` 
and `foo.b` (with dots). You can try changing the code and rerunning the test, 
but I suspect that the problem won't go away unless you happen to run a test 
that reverses the file read order.
   
   Instead, the problem may be with the part of that code that did not change: 
Drill is trying to predict the future and predicting that when the missing 
column appears, it will be `INT`. It may be that Drill is predicting 
incorrectly. We need to use the "crystal ball" module to improve the 
prediction. Sadly, however, we never got around to writing that module. Hence 
the multiple attempts to manage schemas globally.
   
   Where does this leave us? If you can pin things down to one very specific 
case, we can sort out if it is a "bad code" bug or a "that just won't work 
given Drill's design" bug. In particular, reading Parquet files with 
inconsistent schemas, projecting the inconsistent columns, and adding a SORT 
won't work unless the missing columns will be of type `INT` when they appear. 
You can get lucky with file read order and work distribution so that, 
sometimes, it does work. Relying on luck produces flaky tests, however.
   
   On the other hand, you can have as much inconsistency as you want as long as 
the columns you project appear in all files and the type of those columns stays 
the same. Feel free to add as many other, inconsistent, columns as you like: 
just don't project them in queries with a SORT.
   
   I'd suggest that, since Drill doesn't handle Parquet schema changes well 
(though that is Drill's compelling feature), we maybe should not test stuff 
that can't actually work. Test with files with a consistent schema instead. Or, 
if files have an inconsistent schema, test with the Parquet schema feature 
enabled and after doing a scan of all the files. (I forget the command: 
`ANALYZE` maybe?) It _may_ be that the Parquet schema feature unifies column 
types, but then, given the haste with which it was written way back when, maybe 
it doesn't. Or, to be more modern, test with the Drill Metastore enabled.
   
   This stuff is horribly complex, and I may have missed the mark on this 
particular bug. But, at least we're now on the same page.




> Missing parquet columns quoted with backticks conflict with existing ones
> -------------------------------------------------------------------------
>
>                 Key: DRILL-8507
>                 URL: https://issues.apache.org/jira/browse/DRILL-8507
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.21.2
>            Reporter: Yaroslav
>            Priority: Major
>         Attachments: people.tar.gz
>
>
> {*}NOTE{*}: I worked on this issue along with DRILL-8508. It turned out that 
> it required this bug to get fixed first. And since these 2 are about a little 
> bit different things it was decided to report them as separate issues. So, 
> I'm going to link this issue as a requirement for that issue and open one PR 
> for both (if it's allowed..). I think single PR would make it easier to 
> review the code since the issues are quite related anyway.
> h3. Prerequisites
> If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
> null-filled {{NullableIntVector}} with the column's name and the correct 
> value count set. The field name for the vector is derived from 
> {{SchemaPath#toExpr}} method, which always enquotes the outcome string with 
> backticks.
> h3. Problems
> This causes some wrong field name equality checks (comparing two strings of 
> field names, non-quoted and quoted, returns false, but essentially is 
> supposed to return true) that lead to some errors.
> For example, the errors occur when you select a column from a table where 
> some parquet files contain it and some do not. Consider a {{dfs.tmp.people}} 
> table with such parquet files and their schemas:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
> age<INT(OPTIONAL)>
> /tmp/people/1.parquet: id<INT(REQUIRED)>{code}
> Now let's try to use an operator that doesn't support schema change. For 
> example, {{{}ORDER BY{}}}:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` 
> (INT:OPTIONAL)]], selectionVector=NONE]
> Fragment: 0:0
> [Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> ORDER BY error clearly shows us that ``age`` is an extra column here and the 
> incoming schema should only have the unquoted field to match the previous 
> schema.
> Another example is in {{UNION ALL}} operator:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1));
> Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch 
> have different field counthas!
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> Again, "different field counts" issue is caused by an extra quoted column 
> that counts as different field.
> h3. Solution
> The solution for these errors would be to replace {{SchemaPath#toExpr}} call 
> with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome 
> string. Simply enough, but note that we used to use 
> {{SchemaPath#getAsUnescapedPath}} before DRILL-4264  where we switched to 
> {{{}SchemaPath#toExpr{}}}. The author even put a comment:
> {code:java}
> // col.toExpr() is used here as field name since we don't want to see these 
> fields in the existing maps{code}
> So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision. 
> But, honestly, I don't really understand the motivation for this, even with 
> the comment. I don't really understand what "existing maps" are here. Maybe 
> someone from the community can help here, but I will further consider it as a 
> mistake, simply because it causes the above problems.
> h3. Regressions
> The change brings some regressions detected by unit tests. Those I found fail 
> because they changed their execution flow after the fix in these places:
>  * 
> [FieldIdUtil#getFieldId|https://github.com/apache/drill/blob/7c813ff6440a118de15f552145b40eb07bb8e7a2/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206]
>  * 
> [ScanBatch$Mutator#addField|https://github.com/apache/drill/blob/097a4717ac998ec6bf3c70a99575c7ff53f47430/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L523-L524]
> Before the fix, the field name equality checks returned false from comparing 
> quoted and unquoted field names. The failing tests relied on that behavior 
> and worked only with this condition. But after the fix, when we compare two 
> quoted names and return true, we fall to a different branch and the tests 
> aren't ready for that.
> But obviously the change _is fixing the problem_ so the tests that relied on 
> that problem should now be adjusted.
> Please see more technical details on each failed unit test in the linked PR.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (DRILL-8507) Missing parquet columns quoted with backticks conflict with existing ones

Reply via email to