[I] Native parquet writer writing a corrupt / invalid parquet file on spark task failure [incubator-gluten]

via GitHub Thu, 29 May 2025 07:07:53 -0700


RushabhK opened a new issue, #9801:
URL: https://github.com/apache/incubator-gluten/issues/9801


   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   Hello team, I am using Gluten 1.3.0 compiled with JDK 17 on Ubuntu 20.04.
   I ingested parquet files with a spark job with gluten which underwent 1 task 
failure and a successful retry, with the job succeeding.
   Although, while reading the parquet folder, I can find one file which says 
the parquet file is corrupt with the following exception:
   `OSError: Could not open parquet input source 
'gluten-part-239f94b0-2c93-4b5d-a8a7-3be9a9e05dac.zstd.parquet': Invalid: 
Parquet magic bytes not found in footer. Either the file is corrupted or this 
is not a parquet file.`
   This is a 300 MB file and I verified from the parquet-tools that it is not a 
valid parquet file.
   
   Although, reading the path with spark.read.option("ignoreCorruptFiles", 
"true").parquet(path), I get exactly the data that is expected (I did the 
checksum validation the dataframe with the expectation, and it passed).
   
   I am suspecting the failed task should not have copied the partial parquet 
file which it had written to the temp location to the final location as a part 
of the commit.
   
   I am using the Manifest committer with version 2, please find the following 
configs:
   `--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
   --conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true \
   --conf spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
   --conf 
spark.hadoop.mapreduce.outputcommitter.factory.scheme.gs=org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory
 \
   --conf 
spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter`
   
   Can anyone point out if I am missing some gluten configs while ingesting 
parquet which is not preventing this copy? Also, does Gluten honor the 
committer specified in the spark conf?
   I am not facing any such issues with vanilla spark for the same runs.
   
   ### Gluten version
   
   Gluten-1.3
   
   ### Spark version
   
   Spark-3.5.x
   
   ### Spark configurations
   
   --conf 
spark.hadoop.mapreduce.outputcommitter.factory.scheme.gs=org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory
 \
   --conf 
spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
 \
   --conf 
spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
 \
   --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
   --conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true \
   --conf spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
   --conf spark.executor.cores=8 \
   --conf spark.executor.memory=10G \
   --conf spark.executor.memoryOverhead=6G \
   --conf spark.memory.offHeap.enabled=true \
   --conf spark.memory.offHeap.size=35G \
   --conf spark.driver.core=4 \
   --conf spark.driver.memoryOverhead=1500M \
   --conf spark.network.timeout=900s \
   --conf spark.driver.maxResultSize=4G \
   --conf spark.executor.instances=12 \
   --conf spark.dynamicAllocation.enabled=false \
   --conf spark.hadoop.hive.exec.dynamic.partition=true \
   --conf spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict \
   --conf spark.plugins=org.apache.gluten.GlutenPlugin \
   --conf 
spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
   --conf 
spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
   --conf 
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
   --conf spark.gluten.sql.complexType.scan.fallback.enabled=false \
   --conf spark.gluten.sql.columnar.maxBatchSize=4096 \
   --conf spark.gluten.sql.columnar.tableCache=true \
   --conf spark.gluten.memory.isolation=true \
   --conf spark.cleaner.periodicGC.interval="3min" \
   --conf spark.gluten.sql.columnar.shuffle.codec="zstd" \
   --conf 
spark.executorEnv.LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2" \
   
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   ```bash
   Caused by: org.apache.spark.SparkException: [CANNOT_READ_FILE_FOOTER] Could 
not read footer for file: 
gs://some_path/gluten-part-239f94b0-2c93-4b5d-a8a7-3be9a9e05dac.zstd.parquet. 
Please ensure that the file is in either ORC or Parquet format. If not, please 
convert it to a valid format. If the file is in the valid format, please check 
if it is corrupt. If it is, you can choose to either ignore it or fix the 
corruption.
        at 
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFooterForFileError(QueryExecutionErrors.scala:1057)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:456)
        at 
org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:384)
        at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
        at scala.util.Success.$anonfun$map$1(Try.scala:255)
        at scala.util.Success.map(Try.scala:213)
        at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
        at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
        at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at 
java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
   Caused by: java.lang.RuntimeException: 
gs://some_path/gluten-part-239f94b0-2c93-4b5d-a8a7-3be9a9e05dac.zstd.parquet is 
not a Parquet file. Expected magic number at tail, but found [2, 0, 0, 0]
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:565)
        at 
org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
        at 
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:76)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:450)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Native parquet writer writing a corrupt / invalid parquet file on spark task failure [incubator-gluten]

Reply via email to