[ 
https://issues.apache.org/jira/browse/HIVE-28938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Soumyakanti Das updated HIVE-28938:
-----------------------------------
    Description: 
To reproduce, run:
{noformat}
create external table test(id int, arr array<string>) stored by iceberg;
insert into test values (1, array("a", "b")), (2, array("c", "d")), (3, 
array("e", "f"));

select * from test
lateral view explode(arr) tbl1 as name
lateral view explode(arr) tbl2 as name1;
{noformat}
Error:
{noformat}
org.apache.hadoop.hive.ql.parse.SemanticException: Line 0:-1 Invalid column 
reference 'BLOCK__OFFSET__INSIDE__FILE'
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:14032)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:13971)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:5027)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4801)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:11888)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:11827)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12812)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12665)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12696)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12678)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:688)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:13572)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:489)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:324)
at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:227)
at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:108){noformat}
 

BLOCK_{_}OFFSET{_}{_}INSIDE{_}_FILE should not be present for non native 
tables, but it is present because of the misclassification of ICEBERG tables as 
TableType.NATIVE in the method [obtainTableType(Table 
tabMetaData)|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3272].
 This returns NATIVE for ICEBERG tables and subsequently adds incorrect virtual 
columns in 
[genTableLogicalPlan|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3079].

Virtual column adding logic should be similar to 
[here|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L11990].

P.S. This only occurs for multiple lateral views. My guess is that when we just 
have one lateral view, unnecessary fields are trimmed from the HiveProject on 
top of the HiveTableFunctionScan. But when we have two or more lateral views, 
HiveProjects which are not at the top of the plan has all fields as the lateral 
function on top of the Project (in HiveTableFunctionScan) refers to them.

For example, this is a plan with two lateral views:
{noformat}
query: explain cbo
select * from test
lateral view explode(arr) tbl1 as name
lateral view explode(arr) tbl2 as name1

CBO PLAN:
HiveProject(id=[$0], arr=[$1], name=[$8], name1=[$9])
  HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, 
$5, $6, $7, $8)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY 
arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) 
FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, 
BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name, VARCHAR(2147483647) 
tbl2.name1)])
    HiveProject(id=[$0], arr=[$1], PARTITION__SPEC__ID=[$2], 
PARTITION__HASH=[$3], FILE__PATH=[$4], ROW__POSITION=[$5], 
PARTITION__PROJECTION=[$6], SNAPSHOT__ID=[$7], tbl1.name=[$8])
      HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, 
$4, $5, $6, $7)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY 
arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) 
FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, 
BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name)])
        HiveTableScan(table=[[default, test]], table:alias=[test]) {noformat}
We can see the second Project has all columns because the lateral function on 
top of it refers to them. Whereas, this is a plan with just one lateral view:
{noformat}
query: explain cbo
select * from test
lateral view explode(arr) tbl1 as name

CBO PLAN:
HiveProject(id=[$0], arr=[$1], name=[$8])
  HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, 
$5, $6, $7)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY arr, 
INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) 
FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, 
BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name)])
    HiveTableScan(table=[[default, test]], table:alias=[test]){noformat}
 

P.P.S: The plans above are after applying the fix, so we only see correct 
virtual columns, but if the fix was not present, we would also see virtual 
columns like BLOCK_{_}OFFSET{_}{_}INSIDE{_}_FILE.

  was:
To reproduce, run:
{noformat}
create external table test(id int, arr array<string>) stored by iceberg;
insert into test values (1, array("a", "b")), (2, array("c", "d")), (3, 
array("e", "f"));

select * from test
lateral view explode(arr) tbl1 as name
lateral view explode(arr) tbl2 as name1;
{noformat}
Error:
{noformat}
org.apache.hadoop.hive.ql.parse.SemanticException: Line 0:-1 Invalid column 
reference 'BLOCK__OFFSET__INSIDE__FILE'
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:14032)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:13971)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:5027)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4801)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:11888)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:11827)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12812)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12665)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12696)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12678)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:688)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:13572)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:489)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:324)
at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:227)
at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:108){noformat}
 

BLOCK_{_}OFFSET{_}{_}INSIDE{_}_FILE should not be present for non native 
tables, but it is present because of the misclassification of ICEBERG tables as 
TableType.NATIVE in the method [obtainTableType(Table 
tabMetaData)|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3272].
 This returns NATIVE for ICEBERG tables and subsequently adds incorrect virtual 
columns in 
[genTableLogicalPlan|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3079].

Virtual column adding logic should be similar to 
[here|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L11990].

P.S. This only occurs for multiple lateral views. My guess is that when we just 
have one lateral view, unnecessary fields are trimmed from the HiveProject on 
top of the HiveTableFunctionScan. But when we have two or more lateral views, 
HiveProjects which are not at the top of the plan has all fields as the lateral 
function on top of the Project (in HiveTableFunctionScan) refers to them.

For example, this is a plan with two lateral views:
{noformat}
query: explain cbo
select * from test
lateral view explode(arr) tbl1 as name
lateral view explode(arr) tbl2 as name1

CBO PLAN:
HiveProject(id=[$0], arr=[$1], name=[$8], name1=[$9])
  HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, 
$5, $6, $7, $8)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY 
arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) 
FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, 
BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name, VARCHAR(2147483647) 
tbl2.name1)])
    HiveProject(id=[$0], arr=[$1], PARTITION__SPEC__ID=[$2], 
PARTITION__HASH=[$3], FILE__PATH=[$4], ROW__POSITION=[$5], 
PARTITION__PROJECTION=[$6], SNAPSHOT__ID=[$7], tbl1.name=[$8])
      HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, 
$4, $5, $6, $7)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY 
arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) 
FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, 
BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name)])
        HiveTableScan(table=[[default, test]], table:alias=[test]) {noformat}
We can see the second Project has all columns because the lateral function on 
top of it refers to them. Whereas, this is a plan with just one lateral view:
{noformat}
query: explain cbo
select * from test
lateral view explode(arr) tbl1 as name

CBO PLAN:
HiveProject(id=[$0], arr=[$1], name=[$8])
  HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, 
$5, $6, $7)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY arr, 
INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) 
FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, 
BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name)])
    HiveTableScan(table=[[default, test]], table:alias=[test]){noformat}


> Error in LATERAL VIEW with non native tables due to presence of incorrect 
> virtual columns in RowResolver
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-28938
>                 URL: https://issues.apache.org/jira/browse/HIVE-28938
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 4.0.1
>            Reporter: Soumyakanti Das
>            Assignee: Soumyakanti Das
>            Priority: Major
>
> To reproduce, run:
> {noformat}
> create external table test(id int, arr array<string>) stored by iceberg;
> insert into test values (1, array("a", "b")), (2, array("c", "d")), (3, 
> array("e", "f"));
> select * from test
> lateral view explode(arr) tbl1 as name
> lateral view explode(arr) tbl2 as name1;
> {noformat}
> Error:
> {noformat}
> org.apache.hadoop.hive.ql.parse.SemanticException: Line 0:-1 Invalid column 
> reference 'BLOCK__OFFSET__INSIDE__FILE'
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:14032)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:13971)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:5027)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4801)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:11888)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:11827)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12812)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12665)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12696)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12678)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:688)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:13572)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:489)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:324)
> at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:227)
> at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:108){noformat}
>  
> BLOCK_{_}OFFSET{_}{_}INSIDE{_}_FILE should not be present for non native 
> tables, but it is present because of the misclassification of ICEBERG tables 
> as TableType.NATIVE in the method [obtainTableType(Table 
> tabMetaData)|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3272].
>  This returns NATIVE for ICEBERG tables and subsequently adds incorrect 
> virtual columns in 
> [genTableLogicalPlan|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L3079].
> Virtual column adding logic should be similar to 
> [here|https://github.com/apache/hive/blob/e7cdc0cac935b8e2c573edad9d6e790d8beeee40/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L11990].
> P.S. This only occurs for multiple lateral views. My guess is that when we 
> just have one lateral view, unnecessary fields are trimmed from the 
> HiveProject on top of the HiveTableFunctionScan. But when we have two or more 
> lateral views, HiveProjects which are not at the top of the plan has all 
> fields as the lateral function on top of the Project (in 
> HiveTableFunctionScan) refers to them.
> For example, this is a plan with two lateral views:
> {noformat}
> query: explain cbo
> select * from test
> lateral view explode(arr) tbl1 as name
> lateral view explode(arr) tbl2 as name1
> CBO PLAN:
> HiveProject(id=[$0], arr=[$1], name=[$8], name1=[$9])
>   HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, 
> $5, $6, $7, $8)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY 
> arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) 
> FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, 
> BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name, VARCHAR(2147483647) 
> tbl2.name1)])
>     HiveProject(id=[$0], arr=[$1], PARTITION__SPEC__ID=[$2], 
> PARTITION__HASH=[$3], FILE__PATH=[$4], ROW__POSITION=[$5], 
> PARTITION__PROJECTION=[$6], SNAPSHOT__ID=[$7], tbl1.name=[$8])
>       HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, 
> $4, $5, $6, $7)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY 
> arr, INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) 
> FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, 
> BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name)])
>         HiveTableScan(table=[[default, test]], table:alias=[test]) {noformat}
> We can see the second Project has all columns because the lateral function on 
> top of it refers to them. Whereas, this is a plan with just one lateral view:
> {noformat}
> query: explain cbo
> select * from test
> lateral view explode(arr) tbl1 as name
> CBO PLAN:
> HiveProject(id=[$0], arr=[$1], name=[$8])
>   HiveTableFunctionScan(invocation=[LATERAL(explode($1), $0, $1, $2, $3, $4, 
> $5, $6, $7)], rowType=[RecordType(INTEGER id, VARCHAR(2147483647) ARRAY arr, 
> INTEGER PARTITION__SPEC__ID, BIGINT PARTITION__HASH, VARCHAR(2147483647) 
> FILE__PATH, BIGINT ROW__POSITION, VARCHAR(2147483647) PARTITION__PROJECTION, 
> BIGINT SNAPSHOT__ID, VARCHAR(2147483647) tbl1.name)])
>     HiveTableScan(table=[[default, test]], table:alias=[test]){noformat}
>  
> P.P.S: The plans above are after applying the fix, so we only see correct 
> virtual columns, but if the fix was not present, we would also see virtual 
> columns like BLOCK_{_}OFFSET{_}{_}INSIDE{_}_FILE.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to