[jira] [Created] (HUDI-313) Unable SELECT COUNT(*) from a MOR realtime table

Wenning Ding (Jira) Wed, 23 Oct 2019 10:55:16 -0700

Wenning Ding created HUDI-313:
---------------------------------

             Summary: Unable SELECT COUNT(*) from a MOR realtime table
                 Key: HUDI-313
                 URL: https://issues.apache.org/jira/browse/HUDI-313
             Project: Apache Hudi (incubating)
          Issue Type: Bug
            Reporter: Wenning Ding



While I query like this in Hive:
{code:java}
SELECT COUNT(*) FROM hudi_test_rt;
OR:
SELECT COUNT(1) FROM hudi_test_rt;{code}
It returns:
{code:java}
2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: 
java.lang.RuntimeException: java.io.IOException: 
java.lang.NumberFormatException: For input string: ""
    at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
    at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
    at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
    at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
    at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
    at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
    at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
    at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
    at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
    at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
    at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
    at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.NumberFormatException: For input 
string: ""
    at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
    at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
    at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
    at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
    ... 19 more
Caused by: java.lang.NumberFormatException: For input string: ""
    at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:592)
    at java.lang.Integer.parseInt(Integer.java:615)
    at 
org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186)
    at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377)
    at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
    at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75)
    at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
    at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
    at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197)
    at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222)
    at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
    ... 20 more
{code}
Some investigations:

Basically, Hive try to update projection column ids during each getRecordReader 
stage. But for COUNT(*) or COUNT(1), they don't need any projection column id 
which is an empty string.
And for Hudi, to support compaction in MOR table, Hudi manually adds three Hudi 
required columns in the projection column ids and make the column ids like 
"2,0,3".
Therefore, when Hive trying to update projection column ids, it combines an 
empty string with Hudi required columns ids and finally get the column ids like 
",2,0,3". This first comma will cause an error during the parsing stage.

 

One possible solution is to add a method to check if the projection column ids 
start with comma. If it is start with comma, then remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-313) Unable SELECT COUNT(*) from a MOR realtime table

Reply via email to