[jira] [Commented] (PARQUET-632) Parquet file in invalid state while writing to S3 from EMR

Emil Kleszcz (Jira) Tue, 15 Nov 2022 03:35:35 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634320#comment-17634320
 ]


Emil Kleszcz commented on PARQUET-632:
--------------------------------------

The same issue was observed in spark 3.2.1 when writing a parquet file to HDFS 
using parquet v. 1.12.2. 

Some of the logs from the server side in HDFS, NN:
{code:java}
<2022-11-14T08:12:31.941+0100> <INFO> <org.apache.hadoop.hdfs.StateChange>: 
<BLOCK* allocate blk_5112663052_4039631919, replicas=xxx:1004, xxx:1004, 
xxx:1004 for 
/project//part-00003-a1dc7884-5736-437f-a573-35c8d0ea922f.c000.snappy.parquet>
<2022-11-14T08:13:43.029+0100> <INFO> <org.apache.hadoop.hdfs.StateChange>: 
<BLOCK* allocate blk_5112664776_4039633643, replicas=xxx:1004, xxx:1004, 
xxx:1004 for /project/437f-a573-35c8d0ea922f.c000.snappy.parquet>
<2022-11-14T08:14:51.025+0100> <INFO> <org.apache.hadoop.ipc.Server>: <IPC 
Server handler 124 on default port 8020, call Call#64349 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:45140: 
java.io.FileNotFoundException: File does not exist: 
/project/part-00003-a1dc7884-5736-437f-a573-35c8d0ea922f.c000.snappy.parquet 
(inode 7450217277) Holder DFSClient_NONMAPREDUCE_-885269500_210 does not have 
any open files.>{code}

DN logs for one of the file blocks:
 
{code:java}
<2022-11-14T08:12:31.945+0100> <INFO> 
<org.apache.hadoop.hdfs.server.datanode.DataNode>: <Receiving 
BP-1877419787-xxx-1507645851253:blk_5112663052_4039631919 src: /xxx:43856 dest: 
/xxx:1004> <2022-11-14T08:13:43.028+0100> <INFO> 
<org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace>: <src: 
/xxx:43856, dest: /xxx:1004, bytes: 268435456, op: HDFS_WRITE, cliID: 
DFSClient_NONMAPREDUCE_-885269500_210, offset: 0, srvID: 
e20a1db9-5297-41b3-a49a-a6dc6a2169b6, blockid: 
BP-1877419787-xxx-1507645851253:blk_5112663052_4039631919, duration(ns): 
71075002589> <2022-11-14T08:13:43.028+0100> <INFO> 
<org.apache.hadoop.hdfs.server.datanode.DataNode>: <PacketResponder: 
BP-1877419787-xxx-1507645851253:blk_5112663052_4039631919, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=2:[xxx:1004, xxx:1004] 
terminating> <2022-11-14T08:15:01.095+0100> <INFO> 
<org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService>:
 <Scheduling blk_5112663052_4039631919 replica FinalizedReplica, 
blk_5112663052_4039631919, FINALIZED <2022-11-14T08:15:01.157+0100> <INFO> 
<org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService>:
 <Deleted BP-1877419787-xxx-1507645851253 blk_5112663052_4039631919 URI 
file:/data18/hdfs/data/current/BP-1877419787-xxx-1507645851253/current/finalized/subdir29/subdir12/blk_5112663052>{code}

> Parquet file in invalid state while writing to S3 from EMR
> ----------------------------------------------------------
>
>                 Key: PARQUET-632
>                 URL: https://issues.apache.org/jira/browse/PARQUET-632
>             Project: Parquet
>          Issue Type: Bug
>    Affects Versions: 1.7.0
>            Reporter: Peter Halliday
>            Priority: Blocker
>
> I'm writing parquet to S3 from Spark 1.6.1 on EMR.  And when it got to the 
> last few files to write to S3, I received this stacktrace in the log with no 
> other errors before or after it.  It's very consistent.  This particular 
> batch keeps erroring the same way.
> {noformat}
> 2016-06-10 01:46:05,282] WARN org.apache.spark.scheduler.TaskSetManager 
> [task-result-getter-2hread] - Lost task 3737.0 in stage 2.0 (TID 10585, 
> ip-172-16-96-32.ec2.internal): org.apache.spark.SparkException: Task failed 
> while writing rows.
>       at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414)
>       at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>       at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>       at org.apache.spark.scheduler.Task.run(Task.scala:89)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: The file being written is in an invalid 
> state. Probably caused by an error thrown previously. Current state: COLUMN
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195)
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
>       at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
>       at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
>       at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:405)
>       ... 8 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-632) Parquet file in invalid state while writing to S3 from EMR

Reply via email to