xiarixiaoyao commented on pull request #2722:
URL: https://github.com/apache/hudi/pull/2722#issuecomment-806721657
test step:
before patch:
step1:
val df = spark.range(0, 100000).toDF("keyid")
.withColumn("col3", expr("keyid"))
.withColumn("p", lit(0))
.withColumn("p1", lit(0))
.withColumn("p2", lit(7))
.withColumn("a1", lit(Array[String] ("sb1", "rz")))
.withColumn("a2", lit(Array[String] ("sb1", "rz")))
// create hoodie table hive_14b
merge(df, 4, "default", "hive_14b",
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
notice: bulk_insert will produce 4 files in hoodie table
step2:
val df = spark.range(99999, 100002).toDF("keyid")
.withColumn("col3", expr("keyid"))
.withColumn("p", lit(0))
.withColumn("p1", lit(0))
.withColumn("p2", lit(7))
.withColumn("a1", lit(Array[String] ("sb1", "rz")))
.withColumn("a2", lit(Array[String] ("sb1", "rz")))
// upsert table
merge(df, 4, "default", "hive_14b",
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
now : we have four base files and one log file in hoodie table
step3:
spark-sql/beeline:
select count(col3) from hive_14b_rt;
then the query failed.
2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler |
Diagnostics report from attempt_1615883368881_0038_m_000000_0: Error:
java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO | AsyncDispatcher
event handler | Diagnostics report from attempt_1615883368881_0038_m_000000_0:
Error: java.lang.NullPointerException at
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101)
at
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
at
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
at
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
at
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92)
at
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
at org.apache.hudi.
hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
at
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68)
at
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77)
at
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at
org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at
org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at
java.security.AccessController.doPrivileged(Native Method) at javax
.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)
after patch:
spark-sql/hive-beeline
select count(col3) from hive_14b_rt;
+---------+
| _c0 |
+---------+
| 100002 |
+---------+
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]