[ https://issues.apache.org/jira/browse/HIVE-28938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Soumyakanti Das updated HIVE-28938: ----------------------------------- Description: To reproduce, run: {noformat} create external table test(id int, arr array<string>) stored by iceberg; insert into test values (1, array("a", "b")), (2, array("c", "d")), (3, array("e", "f")); select * from test lateral view explode(arr) tbl1 as name lateral view explode(arr) tbl2 as name1; {noformat} Error: {noformat} org.apache.hadoop.hive.ql.parse.SemanticException: Line 0:-1 Invalid column reference 'BLOCK__OFFSET__INSIDE__FILE' at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:14032) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:13971) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:5027) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4801) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:11888) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:11827) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12812) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12665) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12696) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12678) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:688) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:13572) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:489) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:324) at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:227) at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:108){noformat} BLOCK_{_}OFFSET{_}{_}INSIDE{_}_FILE should not be present for non native tables, but it is present because of the misclassification of ICEBERG tables as TableType.NATIVE in the method [obtainTableType(Table tabMetaData)|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3272]. This returns NATIVE for ICEBERG tables and subsequently adds incorrect virtual columns in [genTableLogicalPlan|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3079]. Virtual column adding logic should be similar to [here|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L11990]. P.S. This only occurs for multiple lateral views. My guess is that when we just have one lateral view, unnecessary fields are trimmed from the HiveProject on top of the HiveTableFunctionScan. But when we have two or more lateral views, HiveProjects which are not at the top of the plan has all fields as the lateral function on top of the Project (in HiveTableFunctionScan) refers to them. For example, this is a plan with two lateral views: {noformat} query: explain cbo select * from test lateral view explode(arr) tbl1 as name lateral view explode(arr) tbl2 as name1 CBO PLAN: HiveProject(id=[$0], arr=[$1], name=[$8], name1=[$9]) HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, $5, $6, $7, $8)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name, VARCHAR(2147483647) tbl2.name1)]) HiveProject(id=[$0], arr=[$1], PARTITION__SPEC__ID=[$2], PARTITION__HASH=[$3], FILE__PATH=[$4], ROW__POSITION=[$5], PARTITION__PROJECTION=[$6], SNAPSHOT__ID=[$7], tbl1.name=[$8]) HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, $5, $6, $7)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name)]) HiveTableScan(table=[[default, test]], table:alias=[test]) {noformat} We can see the second Project has all columns because the lateral function on top of it refers to them. Whereas, this is a plan with just one lateral view: {noformat} query: explain cbo select * from test lateral view explode(arr) tbl1 as name CBO PLAN: HiveProject(id=[$0], arr=[$1], name=[$8]) HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, $5, $6, $7)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name)]) HiveTableScan(table=[[default, test]], table:alias=[test]){noformat} P.P.S: The plans above are after applying the fix, so we only see correct virtual columns, but if the fix was not present, we would also see virtual columns like BLOCK_{_}OFFSET{_}{_}INSIDE{_}_FILE. was: To reproduce, run: {noformat} create external table test(id int, arr array<string>) stored by iceberg; insert into test values (1, array("a", "b")), (2, array("c", "d")), (3, array("e", "f")); select * from test lateral view explode(arr) tbl1 as name lateral view explode(arr) tbl2 as name1; {noformat} Error: {noformat} org.apache.hadoop.hive.ql.parse.SemanticException: Line 0:-1 Invalid column reference 'BLOCK__OFFSET__INSIDE__FILE' at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:14032) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:13971) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:5027) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4801) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:11888) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:11827) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12812) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12665) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12696) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12678) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:688) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:13572) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:489) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:324) at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:227) at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:108){noformat} BLOCK_{_}OFFSET{_}{_}INSIDE{_}_FILE should not be present for non native tables, but it is present because of the misclassification of ICEBERG tables as TableType.NATIVE in the method [obtainTableType(Table tabMetaData)|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3272]. This returns NATIVE for ICEBERG tables and subsequently adds incorrect virtual columns in [genTableLogicalPlan|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3079]. Virtual column adding logic should be similar to [here|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L11990]. P.S. This only occurs for multiple lateral views. My guess is that when we just have one lateral view, unnecessary fields are trimmed from the HiveProject on top of the HiveTableFunctionScan. But when we have two or more lateral views, HiveProjects which are not at the top of the plan has all fields as the lateral function on top of the Project (in HiveTableFunctionScan) refers to them. For example, this is a plan with two lateral views: {noformat} query: explain cbo select * from test lateral view explode(arr) tbl1 as name lateral view explode(arr) tbl2 as name1 CBO PLAN: HiveProject(id=[$0], arr=[$1], name=[$8], name1=[$9]) HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, $5, $6, $7, $8)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name, VARCHAR(2147483647) tbl2.name1)]) HiveProject(id=[$0], arr=[$1], PARTITION__SPEC__ID=[$2], PARTITION__HASH=[$3], FILE__PATH=[$4], ROW__POSITION=[$5], PARTITION__PROJECTION=[$6], SNAPSHOT__ID=[$7], tbl1.name=[$8]) HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, $5, $6, $7)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name)]) HiveTableScan(table=[[default, test]], table:alias=[test]) {noformat} We can see the second Project has all columns because the lateral function on top of it refers to them. Whereas, this is a plan with just one lateral view: {noformat} query: explain cbo select * from test lateral view explode(arr) tbl1 as name CBO PLAN: HiveProject(id=[$0], arr=[$1], name=[$8]) HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, $5, $6, $7)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name)]) HiveTableScan(table=[[default, test]], table:alias=[test]){noformat} > Error in LATERAL VIEW with non native tables due to presence of incorrect > virtual columns in RowResolver > -------------------------------------------------------------------------------------------------------- > > Key: HIVE-28938 > URL: https://issues.apache.org/jira/browse/HIVE-28938 > Project: Hive > Issue Type: Bug > Components: Hive > Affects Versions: 4.0.1 > Reporter: Soumyakanti Das > Assignee: Soumyakanti Das > Priority: Major > > To reproduce, run: > {noformat} > create external table test(id int, arr array<string>) stored by iceberg; > insert into test values (1, array("a", "b")), (2, array("c", "d")), (3, > array("e", "f")); > select * from test > lateral view explode(arr) tbl1 as name > lateral view explode(arr) tbl2 as name1; > {noformat} > Error: > {noformat} > org.apache.hadoop.hive.ql.parse.SemanticException: Line 0:-1 Invalid column > reference 'BLOCK__OFFSET__INSIDE__FILE' > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:14032) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:13971) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:5027) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4801) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:11888) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:11827) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12812) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12665) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12696) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12678) > at > org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:688) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:13572) > at > org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:489) > at > org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:324) > at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:227) > at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:108){noformat} > > BLOCK_{_}OFFSET{_}{_}INSIDE{_}_FILE should not be present for non native > tables, but it is present because of the misclassification of ICEBERG tables > as TableType.NATIVE in the method [obtainTableType(Table > tabMetaData)|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3272]. > This returns NATIVE for ICEBERG tables and subsequently adds incorrect > virtual columns in > [genTableLogicalPlan|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3079]. > Virtual column adding logic should be similar to > [here|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L11990]. > P.S. This only occurs for multiple lateral views. My guess is that when we > just have one lateral view, unnecessary fields are trimmed from the > HiveProject on top of the HiveTableFunctionScan. But when we have two or more > lateral views, HiveProjects which are not at the top of the plan has all > fields as the lateral function on top of the Project (in > HiveTableFunctionScan) refers to them. > For example, this is a plan with two lateral views: > {noformat} > query: explain cbo > select * from test > lateral view explode(arr) tbl1 as name > lateral view explode(arr) tbl2 as name1 > CBO PLAN: > HiveProject(id=[$0], arr=[$1], name=[$8], name1=[$9]) > HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, > $5, $6, $7, $8)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY > arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) > FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, > BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name, VARCHAR(2147483647) > tbl2.name1)]) > HiveProject(id=[$0], arr=[$1], PARTITION__SPEC__ID=[$2], > PARTITION__HASH=[$3], FILE__PATH=[$4], ROW__POSITION=[$5], > PARTITION__PROJECTION=[$6], SNAPSHOT__ID=[$7], tbl1.name=[$8]) > HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, > $4, $5, $6, $7)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY > arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) > FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, > BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name)]) > HiveTableScan(table=[[default, test]], table:alias=[test]) {noformat} > We can see the second Project has all columns because the lateral function on > top of it refers to them. Whereas, this is a plan with just one lateral view: > {noformat} > query: explain cbo > select * from test > lateral view explode(arr) tbl1 as name > CBO PLAN: > HiveProject(id=[$0], arr=[$1], name=[$8]) > HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, > $5, $6, $7)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY arr, > INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) > FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, > BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name)]) > HiveTableScan(table=[[default, test]], table:alias=[test]){noformat} > > P.P.S: The plans above are after applying the fix, so we only see correct > virtual columns, but if the fix was not present, we would also see virtual > columns like BLOCK_{_}OFFSET{_}{_}INSIDE{_}_FILE. -- This message was sent by Atlassian Jira (v8.20.10#820010)