[ https://issues.apache.org/jira/browse/PARQUET-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634320#comment-17634320 ]
Emil Kleszcz commented on PARQUET-632: -------------------------------------- The same issue was observed in spark 3.2.1 when writing a parquet file to HDFS using parquet v. 1.12.2. Some of the logs from the server side in HDFS, NN: {code:java} <2022-11-14T08:12:31.941+0100> <INFO> <org.apache.hadoop.hdfs.StateChange>: <BLOCK* allocate blk_5112663052_4039631919, replicas=xxx:1004, xxx:1004, xxx:1004 for /project//part-00003-a1dc7884-5736-437f-a573-35c8d0ea922f.c000.snappy.parquet> <2022-11-14T08:13:43.029+0100> <INFO> <org.apache.hadoop.hdfs.StateChange>: <BLOCK* allocate blk_5112664776_4039633643, replicas=xxx:1004, xxx:1004, xxx:1004 for /project/437f-a573-35c8d0ea922f.c000.snappy.parquet> <2022-11-14T08:14:51.025+0100> <INFO> <org.apache.hadoop.ipc.Server>: <IPC Server handler 124 on default port 8020, call Call#64349 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:45140: java.io.FileNotFoundException: File does not exist: /project/part-00003-a1dc7884-5736-437f-a573-35c8d0ea922f.c000.snappy.parquet (inode 7450217277) Holder DFSClient_NONMAPREDUCE_-885269500_210 does not have any open files.>{code} DN logs for one of the file blocks: {code:java} <2022-11-14T08:12:31.945+0100> <INFO> <org.apache.hadoop.hdfs.server.datanode.DataNode>: <Receiving BP-1877419787-xxx-1507645851253:blk_5112663052_4039631919 src: /xxx:43856 dest: /xxx:1004> <2022-11-14T08:13:43.028+0100> <INFO> <org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace>: <src: /xxx:43856, dest: /xxx:1004, bytes: 268435456, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-885269500_210, offset: 0, srvID: e20a1db9-5297-41b3-a49a-a6dc6a2169b6, blockid: BP-1877419787-xxx-1507645851253:blk_5112663052_4039631919, duration(ns): 71075002589> <2022-11-14T08:13:43.028+0100> <INFO> <org.apache.hadoop.hdfs.server.datanode.DataNode>: <PacketResponder: BP-1877419787-xxx-1507645851253:blk_5112663052_4039631919, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=2:[xxx:1004, xxx:1004] terminating> <2022-11-14T08:15:01.095+0100> <INFO> <org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService>: <Scheduling blk_5112663052_4039631919 replica FinalizedReplica, blk_5112663052_4039631919, FINALIZED <2022-11-14T08:15:01.157+0100> <INFO> <org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService>: <Deleted BP-1877419787-xxx-1507645851253 blk_5112663052_4039631919 URI file:/data18/hdfs/data/current/BP-1877419787-xxx-1507645851253/current/finalized/subdir29/subdir12/blk_5112663052>{code} > Parquet file in invalid state while writing to S3 from EMR > ---------------------------------------------------------- > > Key: PARQUET-632 > URL: https://issues.apache.org/jira/browse/PARQUET-632 > Project: Parquet > Issue Type: Bug > Affects Versions: 1.7.0 > Reporter: Peter Halliday > Priority: Blocker > > I'm writing parquet to S3 from Spark 1.6.1 on EMR. And when it got to the > last few files to write to S3, I received this stacktrace in the log with no > other errors before or after it. It's very consistent. This particular > batch keeps erroring the same way. > {noformat} > 2016-06-10 01:46:05,282] WARN org.apache.spark.scheduler.TaskSetManager > [task-result-getter-2hread] - Lost task 3737.0 in stage 2.0 (TID 10585, > ip-172-16-96-32.ec2.internal): org.apache.spark.SparkException: Task failed > while writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: The file being written is in an invalid > state. Probably caused by an error thrown previously. Current state: COLUMN > at > org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146) > at > org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138) > at > org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113) > at > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:405) > ... 8 more > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)