[
https://issues.apache.org/jira/browse/HUDI-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar reassigned HUDI-4597:
------------------------------------
Assignee: sivabalan narayanan (was: Alexey Kudinkin)
> [GCP] 0 byte files appearing on GCS
> -----------------------------------
>
> Key: HUDI-4597
> URL: https://issues.apache.org/jira/browse/HUDI-4597
> Project: Apache Hudi
> Issue Type: Bug
> Components: writer-core
> Affects Versions: 0.10.1
> Reporter: Alexey Kudinkin
> Assignee: sivabalan narayanan
> Priority: Major
> Fix For: 0.14.0
>
>
> During recent troubleshooting session w/ Walmart folks we've identified an
> issue where spurious 0-byte file have appeared from {*}6m ago{*}.
> What's more is that behavior around it from GCS was very unusual
> # GCS was reporting that this file exists in file-listing RPC calls (and
> we're able to see it from the UI Dashboard as well)
> # But when you'd try to read this file GCS client would fail w/
> FileNotFoundException (below)
>
> Hudi 0.10.1
> Spark 2.4.8
> {code:java}
> java.io.FileNotFoundException: Item not found:
> 'gs://castar_audit_prod/storage/svccatl/castar_time_series/.hoodie/20210926193508.rollback.requested'.
> If you enabled STRICT generation consistency, it is possible that the live
> version is still available but the intended generation is deleted.
> 2022-08-08 18:02:47,311 ERROR SplunkStreamListener gg-castar-audit-kafka2hdfs
> : |exception=java.io.FileNotFoundException
> java.io.FileNotFoundException: Item not found:
> 'gs://castar_audit_prod/storage/svccatl/castar_time_series/.hoodie/20210926193508.rollback.requested'.
> If you enabled STRICT generation consistency, it is possible that the live
> version is still available but the intended generation is deleted.
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.getFileNotFoundException(GoogleCloudStorageExceptions.java:38)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.open(GoogleCloudStorageImpl.java:631)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.open(GoogleCloudStorageFileSystem.java:322)
> at
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleHadoopFSInputStream.java:77)
> at
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.open(GoogleHadoopFileSystemBase.java:740)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:914)
> at
> org.apache.hudi.common.fs.HoodieWrapperFileSystem.open(HoodieWrapperFileSystem.java:459)
> at
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:621)
> at
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readRollbackInfoAsBytes(HoodieActiveTimeline.java:251)
> at
> org.apache.hudi.table.action.rollback.RollbackUtils.getRollbackPlan(RollbackUtils.java:70)
> at
> org.apache.hudi.client.AbstractHoodieWriteClient.getPendingRollbackInfos(AbstractHoodieWriteClient.java:911)
> at
> org.apache.hudi.client.AbstractHoodieWriteClient.rollbackFailedWrites(AbstractHoodieWriteClient.java:942)
> at
> org.apache.hudi.client.AbstractHoodieWriteClient.rollbackFailedWrites(AbstractHoodieWriteClient.java:932)
> at
> org.apache.hudi.client.AbstractHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(AbstractHoodieWriteClient.java:816)
> at
> org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:143)
> at
> org.apache.hudi.client.AbstractHoodieWriteClient.startCommitWithTime(AbstractHoodieWriteClient.java:815)
> at
> org.apache.hudi.client.AbstractHoodieWriteClient.startCommitWithTime(AbstractHoodieWriteClient.java:808)
> at
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:276)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
> at
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:136)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:132)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:160)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:157)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:132)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
> at
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
> at
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
> at
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
> at
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
> at
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
> at
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
> at
> com.jet.mystique.coldstorage.ColdStorageWriter.writeRecords(ColdStorageWriter.scala:92)
> at
> com.jet.mystique.gorillagrodd.GorillaGrodd$.writeHDFS(GorillaGrodd.scala:130)
> at
> com.jet.mystique.gorillagrodd.martlistener.HDFSDataWriter.write(MystiqueDataWriter.scala:69)
> at
> com.jet.mystique.gorillagrodd.martlistener.MartListener.process(MartListener.scala:76)
> at
> com.jet.mystique.gorillagrodd.martlistener.JobProcessor.runProjector(JobProcessor.scala:46)
> at
> com.jet.mystique.gorillagrodd.GorillaGroddProjectorSet.processRdd(GorillaGroddProjectorSet.scala:36)
> at
> com.jet.mystique.gorillagrodd.GorillaGroddProjectorSet$$anonfun$processDStream$1.apply(GorillaGroddProjectorSet.scala:56)
> at
> com.jet.mystique.gorillagrodd.GorillaGroddProjectorSet$$anonfun$processDStream$1.apply(GorillaGroddProjectorSet.scala:49)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
> at scala.util.Try$.apply(Try.scala:192)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2022-08-08 18:02:47,311 ERROR SplunkStreamListener gg-castar-audit-kafka2hdfs
> : There was an notConfiguredException while streaming. No failover
> configured. Failing hard{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)