[I] hive sql查询hudi分区表，如果分区字段是表最后一列，解析parquet返回的数据，在分区字段列位置自动增加了分区字段值，导致后续列错误发生类型转换问题 [hudi]

via GitHub Wed, 10 Jul 2024 07:15:15 -0700


liucongjy opened a new issue, #11609:
URL: https://github.com/apache/hudi/issues/11609


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.flink sql create table and insert data
   CREATE CATALOG hoodie_catalog
     WITH (
       'type'='hudi',
       'catalog.path' = '/tmp/hudi',
       'hive.conf.dir' = '/opt/hive-2.3.9/conf',
       'mode'='hms'
     );
   USE CATALOG hoodie_catalog;
   create database hoodie_catalog.flink_hudi;
   CREATE CATALOG myhive WITH (
     'type' = 'hive',
     'hive-conf-dir' = '/opt/hive-2.3.9/conf'
   );
   CREATE TABLE hoodie_catalog.flink_hudi.TEST_COW(
       jllsh VARCHAR(100),
       syh VARCHAR(50) PRIMARY KEY NOT ENFORCED,
       jzlsh VARCHAR(40),
       grbsh VARCHAR(20),
       dalx VARCHAR(10),
       dzjkkkh VARCHAR(40),
       yzzh VARCHAR(40),
       yzbs VARCHAR(40),
       fqsj VARCHAR(10),
       dj DOUBLE,
       ze DOUBLE,
       zzsj TIMESTAMP
   )
   PARTITIONED BY (`fqsj`)
   WITH (
     'connector' = 'hudi',
     'path' = 'hdfs://bigdata01:8020/tmp/hudi/flink_hudi/TEST_COW',
     'table.type' = 'COPY_ON_WRITE',
     'precombine.field' = 'syh',
     'write.operation' = 'upsert',
     'hoodie.datasource.hive_sync.support_timestamp' = 'true'
   );
   insert into hoodie_catalog.flink_hudi.TEST_COW select 
jllsh,syh,jzlsh,grbsh,dalx,dzjkkkh,yzzh,yzbs,fqsj,dj,ze,zzsj from 
myhive.ods.prdata where pch='202407090102000' limit 100;
   
   2.flink sql read table，Read out the data normally 
   select * from hoodie_catalog.flink_hudi.EHR_ZYYZMX_PRE_COW;
   
   3.use hive sql read data
   use flink_hudi;
   select dj,ze,zzsj from test_cow;
   
   
   **Expected behavior**
   Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to 
org.apache.hadoop.io.DoubleWritable
   
   **Environment Description**
   
   * Hudi version : 0.14
   
   * Spark version : 3.3
   
   * Hive version : 2.3.9
   
   * Hadoop version : 3.0
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   调试hive日志输出，看到读出数据格式为[2014-06-06,0.0,0.0]，但我在hive查询的sql语句（select dj,ze,zzsj 
from test_cow;）第一列是dj，是一个double类型字段，但在这里读出来的是分区的字段的值
   
   
hive调试输出[2014-06-06,0.0,0.0]这个格式数据的类名为：org.apache.hadoop.hive.ql.exec.ListSinkOperator，调用方法为
   ```
   public void process(Object row, int tag) throws HiveException {
       try {
         LOG.info("row数据:{},tag:{},class:{}", row, tag, 
row.getClass().getName());
         LOG.info("row对应数据类型：{}", inputObjInspectors);
   
         ClassLoader classLoader = fetcher.getClass().getClassLoader();
         if (classLoader != null) {
           // 尝试获取资源URLs
           try {
             Enumeration<URL> urls = 
classLoader.getResources(fetcher.getClass().getName().replace('.', '/') + 
".class");
             while (urls.hasMoreElements()) {
               URL url = urls.nextElement();
               LOG.info("Class " + fetcher.getClass().getName() + " is loaded 
from: " + url);
             }
           } catch (Exception e) {
             e.printStackTrace();
           }
         } else {
           LOG.info("Class " + fetcher.getClass().getName() + " is loaded by 
the bootstrap class loader.");
         }
   
         res.add(fetcher.convert(row, inputObjInspectors[0]));
         numRows++;
         runTimeNumRows++;
       } catch (Exception e) {
         throw new HiveException(e);
       }
     }
   ```
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   2024-07-08T14:42:03,814 ERROR [62dfdf81-d91C-40f8-91a2-d84f761a3671 main] 
liDriver: Failed with exception 
java.io.IOException:org.apache.haoophive.ql.metadata.HiveException：java.lang.ClassCastException：org.apache.hadoop.io.Text
 cannot be cast to org.apache.hadoop.io.DoubleWriteable
   
java.io.IOException:org.apache.haoophive.ql.metadata.HiveException：java.lang.ClassCastException：org.apache.hadoop.io.Text
 cannot be cast to org.apache.hadoop.io.DoubleWriteable
       at org.apache.hadoop.hive.ql.exec. FetchTask. fetch(FetchTask.java:165)
       at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
       at orgapachehadop-hivecL1.CLriver.processocaCr0Evpriver:ava:257)
       at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:407)
       at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:825)
       at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:763)
       at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver. java:690)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] hive sql查询hudi分区表，如果分区字段是表最后一列，解析parquet返回的数据，在分区字段列位置自动增加了分区字段值，导致后续列错误发生类型转换问题 [hudi]

Reply via email to