Vinoth Chandar created HUDI-468: ----------------------------------- Summary: Not an avro data file - error while archiving post rename() change Key: HUDI-468 URL: https://issues.apache.org/jira/browse/HUDI-468 Project: Apache Hudi (incubating) Issue Type: Bug Components: Write Client Reporter: Vinoth Chandar Assignee: Balaji Varadarajan Fix For: 0.5.1
Here are reproduce steps. 1, Build from latest source mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true 2, Write Data export SPARK_HOME=/work/BigData/install/spark/spark-2.3.3-bin-hadoop2.6 ${SPARK_HOME}/bin/spark-shell --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar` --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' import org.apache.spark.sql.SaveMode._ var datas = List("\{ \"name\": \"kenken\", \"ts\": 1574297893836, \"age\": 12, \"location\": \"latitude\"}") val df = spark.read.json(spark.sparkContext.parallelize(datas, 2)) df.write.format("org.apache.hudi"). option("hoodie.insert.shuffle.parallelism", "10"). option("hoodie.upsert.shuffle.parallelism", "10"). option("hoodie.delete.shuffle.parallelism", "10"). option("hoodie.bulkinsert.shuffle.parallelism", "10"). option("hoodie.datasource.write.recordkey.field", "name"). option("hoodie.datasource.write.partitionpath.field", "location"). option("hoodie.datasource.write.precombine.field", "ts"). option("[hoodie.table.name|http://hoodie.table.name/]", "hudi_mor_table"). mode(Overwrite). save("file:///tmp/hudi_mor_table") 3, Append Data df.write.format("org.apache.hudi"). option("hoodie.insert.shuffle.parallelism", "10"). option("hoodie.upsert.shuffle.parallelism", "10"). option("hoodie.delete.shuffle.parallelism", "10"). option("hoodie.bulkinsert.shuffle.parallelism", "10"). option("hoodie.datasource.write.recordkey.field", "name"). option("hoodie.datasource.write.partitionpath.field", "location"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.keep.max.commits", "5"). option("hoodie.keep.min.commits", "4"). option("hoodie.cleaner.commits.retained", "3"). option("[hoodie.table.name|http://hoodie.table.name/]", "hudi_mor_table"). mode(Append). save("file:///tmp/hudi_mor_table") 4, Repeat about six times Append Data operation(above), will get the stackstrace 19/12/24 13:34:09 ERROR HoodieCommitArchiveLog: Failed to archive commits, .commit file: 20191224132942.clean.requested java.io.IOException: Not an Avro data file at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147) at org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88) at [org.apache.hudi.io|http://org.apache.hudi.io/].HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294) at [org.apache.hudi.io|http://org.apache.hudi.io/].HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253) at [org.apache.hudi.io|http://org.apache.hudi.io/].HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122) at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562) at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523) at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) -- This message was sent by Atlassian Jira (v8.3.4#803005)