Z1Wu commented on issue #6673:
URL: 
https://github.com/apache/incubator-gluten/issues/6673#issuecomment-2262441003

   > @Z1Wu It looks like table schema is same.(DESCRIBE FORMATTED <table_name>) 
old table:
   
   Hive orc table have table schema and its orc data file should also contain 
schema too, but orc data file written by some old engine(like hive-1.x)  
contains incomplete schema ( lack of column name).
   
   For a hive orc table create by : 
   ```
   CREATE TABLE `test_orc_table_hive_gluten`(
     `id` int,
     `name` string)
   PARTITIONED BY (
     `dt` string)
   ROW FORMAT SERDE
     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
   STORED AS INPUTFORMAT
     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
   OUTPUTFORMAT
     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
   ```
   
   You can get orc data file schema using this command :
   
   ```
   # hive --orcfiledump <your orc data file>
   hive --orcfiledump 
hdfs:///data/warehouse2/test_orc_table_hive_gluten/dt=20240728/000000_0
   ```
   
   Malformed orc schema output looks like below. Orc file with schema like 
`Type: struct<_col0:int,_col1:string>`  can't be read by gluten. Result will 
always be NULL. Expected orc file schema should be `Type: 
struct<id:int,name:string>` 
   
   ```
   File Version: 0.12 with HIVE_8732
   24/08/01 15:28:01 INFO orc.ReaderImpl: Reading ORC rows from 
hdfs://data/warehouse2/test_orc_table_hive_gluten/dt=20240728/000000_0 with 
{include: null, offset: 0, length: 9223372036854775807}
   Rows: 2
   Compression: SNAPPY
   Compression size: 262144
   Type: struct<_col0:int,_col1:string>
   
   Stripe Statistics:
     Stripe 1:
       Column 0: count: 2 hasNull: false
       Column 1: count: 2 hasNull: false min: 1 max: 2 sum: 3
       Column 2: count: 2 hasNull: false min: a max: b sum: 2
   ```
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to