[
https://issues.apache.org/jira/browse/HUDI-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-313:
--------------------------------
Labels: pull-request-available (was: )
> Unable to SELECT COUNT(*) from a MOR realtime table
> ---------------------------------------------------
>
> Key: HUDI-313
> URL: https://issues.apache.org/jira/browse/HUDI-313
> Project: Apache Hudi (incubating)
> Issue Type: Bug
> Reporter: Wenning Ding
> Priority: Major
> Labels: pull-request-available
>
> While I query like this in Hive:
> {code:java}
> SELECT COUNT(*) FROM hudi_test_rt;
> OR:
> SELECT COUNT(1) FROM hudi_test_rt;{code}
> It returns:
> {code:java}
> 2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|:
> java.lang.RuntimeException: java.io.IOException:
> java.lang.NumberFormatException: For input string: ""
> at
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
> at
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
> at
> org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
> at
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
> at
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
> at
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
> at
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.lang.NumberFormatException: For input
> string: ""
> at
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
> at
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
> at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
> at
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
> ... 19 more
> Caused by: java.lang.NumberFormatException: For input string: ""
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Integer.parseInt(Integer.java:592)
> at java.lang.Integer.parseInt(Integer.java:615)
> at
> org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186)
> at
> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377)
> at
> org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
> at
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75)
> at
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
> at
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
> at
> org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197)
> at
> org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222)
> at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
> ... 20 more
> {code}
> Some investigations:
> Basically, Hive try to update projection column ids during each
> getRecordReader stage. But for COUNT(\*) or COUNT(1), they don't need any
> projection column id which is an empty string.
> And for Hudi, to support compaction in MOR table, Hudi manually adds three
> Hudi required columns in the projection column ids and make the column ids
> like "2,0,3".
> Therefore, when Hive trying to update projection column ids, it combines an
> empty string with Hudi required columns ids and finally get the column ids
> like ",2,0,3". This first comma will cause an error during the parsing stage.
>
> One possible solution is to add a method to check if the projection column
> ids start with comma. If it is start with comma, then remove first comma.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)