[ https://issues.apache.org/jira/browse/DRILL-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878660#comment-17878660 ]
ASF GitHub Bot commented on DRILL-8507: --------------------------------------- ychernysh commented on PR #2937: URL: https://github.com/apache/drill/pull/2937#issuecomment-2325315052 @paul-rogers > though, it seems, without using Calcite for the type propagation Sorry, I don't really understand the question regarding Calcite (I'm not familiar with it). I didn't consider Calcite in my work, so, probably, your statement is correct. > Missing parquet columns quoted with backticks conflict with existing ones > ------------------------------------------------------------------------- > > Key: DRILL-8507 > URL: https://issues.apache.org/jira/browse/DRILL-8507 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.21.2 > Reporter: Yaroslav > Priority: Major > Attachments: people.tar.gz > > > {*}NOTE{*}: I worked on this issue along with DRILL-8508. It turned out that > it required this bug to get fixed first. And since these 2 are about a little > bit different things it was decided to report them as separate issues. So, > I'm going to link this issue as a requirement for that issue and open one PR > for both (if it's allowed..). I think single PR would make it easier to > review the code since the issues are quite related anyway. > h3. Prerequisites > If a {{ParquetRecordReader}} doesn't find a selected column, it creates a > null-filled {{NullableIntVector}} with the column's name and the correct > value count set. The field name for the vector is derived from > {{SchemaPath#toExpr}} method, which always enquotes the outcome string with > backticks. > h3. Problems > This causes some wrong field name equality checks (comparing two strings of > field names, non-quoted and quoted, returns false, but essentially is > supposed to return true) that lead to some errors. > For example, the errors occur when you select a column from a table where > some parquet files contain it and some do not. Consider a {{dfs.tmp.people}} > table with such parquet files and their schemas: > {code:java} > /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | > age<INT(OPTIONAL)> > /tmp/people/1.parquet: id<INT(REQUIRED)>{code} > Now let's try to use an operator that doesn't support schema change. For > example, {{{}ORDER BY{}}}: > {code:java} > apache drill> SELECT age FROM dfs.tmp.people ORDER BY age; > Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External > Sort. Please enable Union type. > Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], > selectionVector=NONE] > Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` > (INT:OPTIONAL)]], selectionVector=NONE] > Fragment: 0:0 > [Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010] > (state=,code=0) > {code} > ORDER BY error clearly shows us that ``age`` is an extra column here and the > incoming schema should only have the unquoted field to match the previous > schema. > Another example is in {{UNION ALL}} operator: > {code:java} > apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1)); > Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch > have different field counthas! > Fragment: 0:0 > Please, refer to logs for more information. > [Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010] > (state=,code=0) > {code} > Again, "different field counts" issue is caused by an extra quoted column > that counts as different field. > h3. Solution > The solution for these errors would be to replace {{SchemaPath#toExpr}} call > with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome > string. Simply enough, but note that we used to use > {{SchemaPath#getAsUnescapedPath}} before DRILL-4264 where we switched to > {{{}SchemaPath#toExpr{}}}. The author even put a comment: > {code:java} > // col.toExpr() is used here as field name since we don't want to see these > fields in the existing maps{code} > So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision. > But, honestly, I don't really understand the motivation for this, even with > the comment. I don't really understand what "existing maps" are here. Maybe > someone from the community can help here, but I will further consider it as a > mistake, simply because it causes the above problems. > h3. Regressions > The change brings some regressions detected by unit tests. Those I found fail > because they changed their execution flow after the fix in these places: > * > [FieldIdUtil#getFieldId|https://github.com/apache/drill/blob/7c813ff6440a118de15f552145b40eb07bb8e7a2/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206] > * > [ScanBatch$Mutator#addField|https://github.com/apache/drill/blob/097a4717ac998ec6bf3c70a99575c7ff53f47430/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L523-L524] > Before the fix, the field name equality checks returned false from comparing > quoted and unquoted field names. The failing tests relied on that behavior > and worked only with this condition. But after the fix, when we compare two > quoted names and return true, we fall to a different branch and the tests > aren't ready for that. > But obviously the change _is fixing the problem_ so the tests that relied on > that problem should now be adjusted. > Please see more technical details on each failed unit test in the linked PR. > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)