affei opened a new issue #3478: URL: https://github.com/apache/hudi/issues/3478
**Describe the problem you faced** Using `set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat` gives an error: `Error: java.lang.RuntimeException: java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.<init>(VectorizedParquetRecordReader.java:128) at org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat.getRecordReader(VectorizedParquetInputFormat.java:41) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:69) at org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:216) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doA s(UserGroupInformation.java:1869) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.rangeCheck(ArrayList.java:659) at java.util.ArrayList.get(ArrayList.java:435) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getProjectedGroupFields(DataWritableReadSupport.java:121) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getSchemaByName(DataWritableReadSupport.java:181) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:369) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84) at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.<init>(VectorizedParquetRecordReader.java:122)` At the same time if I use `hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat` no error but data is included from all files (not only last commits) Problem appears only on aggregates. Query like `select * from orders` doesn't give an error. But If I try something like: `select order_id, count(*) as co from orders group by order_id order by co desc` or just `select count(*) as co from orders` error appears **To Reproduce** Steps to reproduce the behavior: 1. set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat 2. try select with `id, count(*)` on the table - you will get an error (at least I do) 3. set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat 4.try select `id, count(*)` on the table - you will get duplicated records, but id is a primary key, so there should be no duplicated. Looks like it runs on all parquets, not just last commits **Expected behavior** `hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat` should not raise an error and return values without duplicates **Environment Description** * Hudi version : master * Spark version : 2.4.7 * Hive version : 2.3.8 * Hadoop version : 2.7.3.2.6.5.1175-1 * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
