affei opened a new issue #3478:
URL: https://github.com/apache/hudi/issues/3478


   **Describe the problem you faced**
   
   Using `set 
hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat` gives an 
error:
   
   `Error: java.lang.RuntimeException: java.lang.IndexOutOfBoundsException: 
Index: 3, Size: 3 at 
org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.<init>(VectorizedParquetRecordReader.java:128)
 at 
org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat.getRecordReader(VectorizedParquetInputFormat.java:41)
 at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:69)
 at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:216)
 at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169) 
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doA
 s(UserGroupInformation.java:1869) at 
org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: 
java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at 
java.util.ArrayList.rangeCheck(ArrayList.java:659) at 
java.util.ArrayList.get(ArrayList.java:435) at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getProjectedGroupFields(DataWritableReadSupport.java:121)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getSchemaByName(DataWritableReadSupport.java:181)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:369)
 at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
 at 
org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.<init>(VectorizedParquetRecordReader.java:122)`
   
   At the same time if I use 
`hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat` no 
error but data is included from all files (not only last commits)
   
   Problem appears only on aggregates. Query like `select * from orders` 
doesn't give an error. But If I try something like:
   `select order_id, count(*) as co
   from orders
   group by order_id
   order by co desc`
   or just
   `select count(*) as co from orders`
   error appears
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat
   2. try select with `id, count(*)` on the table - you will get an error (at 
least I do)
   3. set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
   4.try select `id, count(*)` on the table - you will get duplicated records, 
but id is a primary key, so there should be no duplicated. Looks like it runs 
on all parquets, not just last commits
   
   **Expected behavior**
   
   `hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat` should 
not raise an error and return values without duplicates
   
   **Environment Description**
   
   * Hudi version : master
   
   * Spark version : 2.4.7
   
   * Hive version : 2.3.8
   
   * Hadoop version : 2.7.3.2.6.5.1175-1
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to