RushabhK opened a new issue, #9801:
URL: https://github.com/apache/incubator-gluten/issues/9801
### Backend
VL (Velox)
### Bug description
Hello team, I am using Gluten 1.3.0 compiled with JDK 17 on Ubuntu 20.04.
I ingested parquet files with a spark job with gluten which underwent 1 task
failure and a successful retry, with the job succeeding.
Although, while reading the parquet folder, I can find one file which says
the parquet file is corrupt with the following exception:
`OSError: Could not open parquet input source
'gluten-part-239f94b0-2c93-4b5d-a8a7-3be9a9e05dac.zstd.parquet': Invalid:
Parquet magic bytes not found in footer. Either the file is corrupted or this
is not a parquet file.`
This is a 300 MB file and I verified from the parquet-tools that it is not a
valid parquet file.
Although, reading the path with spark.read.option("ignoreCorruptFiles",
"true").parquet(path), I get exactly the data that is expected (I did the
checksum validation the dataframe with the expectation, and it passed).
I am suspecting the failed task should not have copied the partial parquet
file which it had written to the temp location to the final location as a part
of the commit.
I am using the Manifest committer with version 2, please find the following
configs:
`--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
--conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true \
--conf spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
--conf
spark.hadoop.mapreduce.outputcommitter.factory.scheme.gs=org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory
\
--conf
spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter`
Can anyone point out if I am missing some gluten configs while ingesting
parquet which is not preventing this copy? Also, does Gluten honor the
committer specified in the spark conf?
I am not facing any such issues with vanilla spark for the same runs.
### Gluten version
Gluten-1.3
### Spark version
Spark-3.5.x
### Spark configurations
--conf
spark.hadoop.mapreduce.outputcommitter.factory.scheme.gs=org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory
\
--conf
spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
\
--conf
spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
\
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
--conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true \
--conf spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
--conf spark.executor.cores=8 \
--conf spark.executor.memory=10G \
--conf spark.executor.memoryOverhead=6G \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=35G \
--conf spark.driver.core=4 \
--conf spark.driver.memoryOverhead=1500M \
--conf spark.network.timeout=900s \
--conf spark.driver.maxResultSize=4G \
--conf spark.executor.instances=12 \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.hadoop.hive.exec.dynamic.partition=true \
--conf spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict \
--conf spark.plugins=org.apache.gluten.GlutenPlugin \
--conf
spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
--conf
spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
--conf
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
--conf spark.gluten.sql.complexType.scan.fallback.enabled=false \
--conf spark.gluten.sql.columnar.maxBatchSize=4096 \
--conf spark.gluten.sql.columnar.tableCache=true \
--conf spark.gluten.memory.isolation=true \
--conf spark.cleaner.periodicGC.interval="3min" \
--conf spark.gluten.sql.columnar.shuffle.codec="zstd" \
--conf
spark.executorEnv.LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2" \
### System information
_No response_
### Relevant logs
```bash
Caused by: org.apache.spark.SparkException: [CANNOT_READ_FILE_FOOTER] Could
not read footer for file:
gs://some_path/gluten-part-239f94b0-2c93-4b5d-a8a7-3be9a9e05dac.zstd.parquet.
Please ensure that the file is in either ORC or Parquet format. If not, please
convert it to a valid format. If the file is in the valid format, please check
if it is corrupt. If it is, you can choose to either ignore it or fix the
corruption.
at
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFooterForFileError(QueryExecutionErrors.scala:1057)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:456)
at
org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:384)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at
java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: java.lang.RuntimeException:
gs://some_path/gluten-part-239f94b0-2c93-4b5d-a8a7-3be9a9e05dac.zstd.parquet is
not a Parquet file. Expected magic number at tail, but found [2, 0, 0, 0]
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:565)
at
org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
at
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:76)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:450)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]