[jira] [Commented] (SPARK-9345) Failure to cleanup on exceptions causes persistent I/O problems later on

Simeon Simeonov (JIRA) Wed, 12 Aug 2015 07:12:12 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693537#comment-14693537
 ]


Simeon Simeonov commented on SPARK-9345:
----------------------------------------

[~marmbrus] Yes, Michael: {{kill -9}} is the way some of the time. However, 
there are types of OOM exceptions that keep {{spark-shell}} running but create 
side effects. One example I've discovered recently is temporary folders in the 
Hive managed table space in HDFS not getting cleaned up which causes exceptions 
when, say, {{saveAsTable}} with the same table name runs later.

> Failure to cleanup on exceptions causes persistent I/O problems later on
> ------------------------------------------------------------------------
>
>                 Key: SPARK-9345
>                 URL: https://issues.apache.org/jira/browse/SPARK-9345
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Shell, SQL
>    Affects Versions: 1.3.1, 1.4.0, 1.4.1
>         Environment: Ubuntu on AWS
>            Reporter: Simeon Simeonov
>            Priority: Minor
>
> When using spark-shell in local mode, I've observed the following behavior on 
> a number of nodes:
> # Some operation generates an exception related to Spark SQL processing via 
> {{HiveContext}}.
> # From that point on, nothing could be written to Hive with {{saveAsTable}}.
> # Another identically-configured version of Spark on the same machine may not 
> exhibit the problem initially but, with enough exceptions, it will start 
> exhibiting the problem also.
> # A new identically-configured installation of the same version on the same 
> machine would exhibit the problem.
> The error is always related to inability to create a temporary folder on HDFS:
> {code}
> 15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task.
> java.io.IOException: Mkdirs failed to create 
> file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_000001_0
>  (exists=false, cwd=file:/home/ubuntu)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
>       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
>       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
>       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
>       at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154)
>       at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
>       at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>       at 
> org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83)
>       at 
> org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)
>       at 
> org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)
>       at 
> org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)
>       at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)
>       at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
>       at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>       at org.apache.spark.scheduler.Task.run(Task.scala:70)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
>         ...
> {code}
> The behavior does not seem related to HDFS as it persists even if the HDFS 
> volume is reformatted. 
> The behavior is difficult to reproduce reliably but consistently observable 
> with sufficient Spark SQL experimentation (dozens of exceptions arising from 
> Spark SQL processing). 
> The likelihood of this happening goes up substantially if some Spark SQL 
> operation runs out of memory, which suggests
> that the problem is related to cleanup.
> In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you 
> can see how on the same machine, identically configured 1.3.1 and 1.4.1 
> versions sharing the same HDFS system and Hive metastore, behave differently. 
> 1.3.1 can write to Hive. 1.4.1 cannot. The behavior started happening on 
> 1.4.1 after an out of memory exception on a large job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-9345) Failure to cleanup on exceptions causes persistent I/O problems later on

Reply via email to