[ https://issues.apache.org/jira/browse/DRILL-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877893#comment-17877893 ]
ASF GitHub Bot commented on DRILL-8507: --------------------------------------- paul-rogers commented on PR #2937: URL: https://github.com/apache/drill/pull/2937#issuecomment-2318972159 @rymarm, thanks for this fix. I'm a bit rusty on Drill, but let's see what I can sort out. This stuff is complex, so I'm going to throw a wall of text at you so we get on the same page. First some background. If memory serves, the Parquet reader was written quickly by an intern early on: it is so complex that few have summoned the effort to understand or improve it. So, digging into it is a classic "legacy code" experience. In the early days, every single format reader created its own way to write data to vectors. This lead to all manner of bugs (getting the code right once is hard, doing it a dozen times is impossible). It also resulted in inconsistent behavior. To solve these (and to avoid Drill's memory fragmentation issues of the time), we created EVF (extended vector framework) to unify how we write to vectors and how we solve schema consistency issues for readers. EVF replaced the older column writer mechanisms. By now, all Drill readers _except Parquet_ are based on EVF. However, the Parquet reader is special: it is the only one that still uses the original, fragile mechanisms. As the story goes, Parquet was written very quickly by an intern, and the result was so complex that few people have been brave enough to try to sort out the code. Since Parquet still uses the old mechanisms, it has its own way to solve schema issues, its own way to handle unprojected columns, and still suffers from the bugs in the original, fragile mechanisms. It looks like your PR tries to fix some of these. In particular, it may be that that you're trying to fix a bug in Parquet that EVF solves for other readers. It would be great if your fix is consistent with EVF. I'll try to check this when I review the code. What we really need is for someone to take on a "weekend project" to rewrite the Parquet reader to use EVF so we have one complete schema mechanism rather than two inconsistent versions. (I can dream.) Now let's look at the bug in question. You received a schema change error. This means that some operator in the DAG saw two different schemas for the same table. In particular, the SORT operator can't handle the case of, say, `(a: INT, b: VARCHAR, c: double)` and `(a: INT, b: VARCHAR)` or` (a: INT: b: INT)`, even if our query only has `...ORDER BY a`. There was discussion about using `UNION` columns, or dynamically reformatting data. But, all of that was far too complex (and slow) to actually build. Most DB and query tools impose a schema at read time. That is, if we have files that have undergone schema evolution, as above, we work out at plan time that the common schema is, say `(a: INT, b: VARCHAR, c: DOUBLE)`, with `c` being `OPTIONAL` (`NULLABLE` in standard DB parlance) since it appears in only some of the files. For example, Presto and Impala avoid the issue by reading into the correct common schema in the first place, rather than cleaning up the mess later in each downstream operator. Of course, Drill's claim to fame is that it is schema-free: it works out the schema on the fly. That's all cool, except that Drill also supports operators that only work for a single schema. (Joins have the same issue.) That is, we promote a really cool feature, but we can't actually make it work. This is great marketing, but causes frustration in reality. And, as befits any bit of software that has a few miles on its tires, Drill has multiple ways to work around the schema issue. There is a "provided schema" feature that tells readers the common schema. The reader's job is then to map the actual file columns into that schema. The provided schema is supported in EVF, but not, alas, by the Parquet reader. Also, the planner never got around to supporting schemas, so provided schemas only work in rather special cases. There is also a Drill Metastore feature that emulates the Hive Metastore (HMS), but be in common use. The original way to manage schemas, for Parquet only, is to scan all the files and build up a JSON file that contains the union of all the schemas. I can't recall if this mechanism was supposed to provide type consistency, but it certainly could: we just look at the `c` column in all schemas of all Parquet files in a directory tree and work out a common type. We do this at plan time and send that type along to Parquet in the physical plan. I didn't work in that area, so my knowledge here is a bit light. It's worth a check. Now, one might say, hold on, isn't there an easy answer? If we read file A and get one schema, then read file B and get another schema, can't we blend them on the fly? Indeed, in some cases this is possible, and EVF implements those cases. However as I'm fond of saying, "Drill can't predict the future", so there are cases where we get it wrong. For example, the query asks for columns `(a, b, c)`. The first file has `(a, b)` and creates a dummy column for `c` that will hold all `NULL`s. But, of what type? Drill traditionally chooses `INT`. Then, we read file 2 that has `(a, b, c: VARCHAR)`. Now we realize that `c` should have been `VARCHAR`, but we've sent a batch of rows downstream already with the "wrong" `INT` type. Plus, remember, Drill is distributed. Those two files might have been read by different fragments running on different machines. They can't communicate until the rows come together in the sort, after which it is too late to fix the problem. One other tricky item to remember: Drill is fastest when columns are `REQUIRED` (i.e. non-nullable or `NOT NULL`). This avoids the `NULL` check on each column operation. **BUT**, Drill also allows schema to vary, and we fill in missing columns with `NULL` values. So, unless we have some guarantee that column `c` actually exists in all files read by a query, we are pretty much forced to make `c` be `OPTIONAL` (nullable), even when reading a file that contains `c` as a `NOT NULL` column. Again, Drill can't predict the future, so we can't know that some future file (in, perhaps, some other fragment on some other machine) will read a file that doesn't have `c`. There is a reason that SQL DBs, for 50+ years, have required a schema. It isn't that all those folks were backward, it is that SQL doesn't work without a common schema. (Apache Druid has a similar problem, but it solves it by treating all values as, essentially, untyped: the values change type on the fly as needed.) When faced with bugs of the kind here, it is important to sort out which are just "bad code" bugs and which are the "this design just can't work" bugs. Now, with that background, we can try to sort out the problem you are trying to solve. Does your test case have Parquet files with differing column sets or types? If so, let's identify exactly how the schemas differ, then we can discuss possible solutions. Looking at the specific line you pointed out, I'll hazard a guess as to what it is doing: that case we discussed above. Suppose our SQL is `SELECT a, b FROM 'my-dir'`. Suppose `my-dir` is a directory that contains two files. `file1.parquet` contains column `(a: VARCHAR)` and `file2.parquet` contains `(a: VARCHAR, b: VARCHAR)`. Both are read by the same reader (the scan is not distributed in this case.) File read order is random. Suppose we read them in the order `(file2, file1)`. We can set up a schema of `(a: VARCHAR, b: VARCHAR)`. When we read `file1`, we notice that 'b' is missing, but that, when it appeared previously, it was `VARCHAR`, so we keep that type and fill with nulls. (This is what EVF does, I'll speculate that Parquet's reader does something similar.) Great. What if we read the other order? We see we want `(a, b)`, but we have only` (a: VARCHAR)`, so we create that vector. What about `b`? Well, we helpfully create `(b: OPTIONAL INT)`. That is exactly what the code that you pointed to does. When we read `file2`, we see that `b` has "changed schema" to `VARCHAR` so we throw away the `(a: VARCHAR, b: INT)` schema and start reading with `(a: VARCHAR, b: VARCHAR)`. This then blows up the SORT. Given this, I don't think the problem is with the column name (which is what that referenced code change handles). The code change in question allowed handling a column name of the form `foo.a` where `foo` is not a MAP, with `a` as a member, but just a (poorly chosen) name of a column. That is, the problem is probably not that the test Parquet file columns are actually named `foo.a` and `foo.b` (with dots). You can try changing the code and rerunning the test, but I suspect that the problem won't go away unless you happen to run a test that reverses the file read order. Instead, the problem may be with the part of that code that did not change: Drill is trying to predict the future and predicting that when the missing column appears, it will be `INT`. It may be that Drill is predicting incorrectly. We need to use the "crystal ball" module to improve the prediction. Sadly, however, we never got around to writing that module. Hence the multiple attempts to manage schemas globally. Where does this leave us? If you can pin things down to one very specific case, we can sort out if it is a "bad code" bug or a "that just won't work given Drill's design" bug. In particular, reading Parquet files with inconsistent schemas, projecting the inconsistent columns, and adding a SORT won't work unless the missing columns will be of type `INT` when they appear. You can get lucky with file read order and work distribution so that, sometimes, it does work. Relying on luck produces flaky tests, however. On the other hand, you can have as much inconsistency as you want as long as the columns you project appear in all files and the type of those columns stays the same. Feel free to add as many other, inconsistent, columns as you like: just don't project them in queries with a SORT. I'd suggest that, since Drill doesn't handle Parquet schema changes well (though that is Drill's compelling feature), we maybe should not test stuff that can't actually work. Test with files with a consistent schema instead. Or, if files have an inconsistent schema, test with the Parquet schema feature enabled and after doing a scan of all the files. (I forget the command: `ANALYZE` maybe?) It _may_ be that the Parquet schema feature unifies column types, but then, given the haste with which it was written way back when, maybe it doesn't. Or, to be more modern, test with the Drill Metastore enabled. This stuff is horribly complex, and I may have missed the mark on this particular bug. But, at least we're now on the same page. > Missing parquet columns quoted with backticks conflict with existing ones > ------------------------------------------------------------------------- > > Key: DRILL-8507 > URL: https://issues.apache.org/jira/browse/DRILL-8507 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.21.2 > Reporter: Yaroslav > Priority: Major > Attachments: people.tar.gz > > > {*}NOTE{*}: I worked on this issue along with DRILL-8508. It turned out that > it required this bug to get fixed first. And since these 2 are about a little > bit different things it was decided to report them as separate issues. So, > I'm going to link this issue as a requirement for that issue and open one PR > for both (if it's allowed..). I think single PR would make it easier to > review the code since the issues are quite related anyway. > h3. Prerequisites > If a {{ParquetRecordReader}} doesn't find a selected column, it creates a > null-filled {{NullableIntVector}} with the column's name and the correct > value count set. The field name for the vector is derived from > {{SchemaPath#toExpr}} method, which always enquotes the outcome string with > backticks. > h3. Problems > This causes some wrong field name equality checks (comparing two strings of > field names, non-quoted and quoted, returns false, but essentially is > supposed to return true) that lead to some errors. > For example, the errors occur when you select a column from a table where > some parquet files contain it and some do not. Consider a {{dfs.tmp.people}} > table with such parquet files and their schemas: > {code:java} > /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | > age<INT(OPTIONAL)> > /tmp/people/1.parquet: id<INT(REQUIRED)>{code} > Now let's try to use an operator that doesn't support schema change. For > example, {{{}ORDER BY{}}}: > {code:java} > apache drill> SELECT age FROM dfs.tmp.people ORDER BY age; > Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External > Sort. Please enable Union type. > Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], > selectionVector=NONE] > Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` > (INT:OPTIONAL)]], selectionVector=NONE] > Fragment: 0:0 > [Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010] > (state=,code=0) > {code} > ORDER BY error clearly shows us that ``age`` is an extra column here and the > incoming schema should only have the unquoted field to match the previous > schema. > Another example is in {{UNION ALL}} operator: > {code:java} > apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1)); > Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch > have different field counthas! > Fragment: 0:0 > Please, refer to logs for more information. > [Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010] > (state=,code=0) > {code} > Again, "different field counts" issue is caused by an extra quoted column > that counts as different field. > h3. Solution > The solution for these errors would be to replace {{SchemaPath#toExpr}} call > with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome > string. Simply enough, but note that we used to use > {{SchemaPath#getAsUnescapedPath}} before DRILL-4264 where we switched to > {{{}SchemaPath#toExpr{}}}. The author even put a comment: > {code:java} > // col.toExpr() is used here as field name since we don't want to see these > fields in the existing maps{code} > So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision. > But, honestly, I don't really understand the motivation for this, even with > the comment. I don't really understand what "existing maps" are here. Maybe > someone from the community can help here, but I will further consider it as a > mistake, simply because it causes the above problems. > h3. Regressions > The change brings some regressions detected by unit tests. Those I found fail > because they changed their execution flow after the fix in these places: > * > [FieldIdUtil#getFieldId|https://github.com/apache/drill/blob/7c813ff6440a118de15f552145b40eb07bb8e7a2/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206] > * > [ScanBatch$Mutator#addField|https://github.com/apache/drill/blob/097a4717ac998ec6bf3c70a99575c7ff53f47430/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L523-L524] > Before the fix, the field name equality checks returned false from comparing > quoted and unquoted field names. The failing tests relied on that behavior > and worked only with this condition. But after the fix, when we compare two > quoted names and return true, we fall to a different branch and the tests > aren't ready for that. > But obviously the change _is fixing the problem_ so the tests that relied on > that problem should now be adjusted. > Please see more technical details on each failed unit test in the linked PR. > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)