[
https://issues.apache.org/jira/browse/SPARK-9345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693537#comment-14693537
]
Simeon Simeonov commented on SPARK-9345:
----------------------------------------
[~marmbrus] Yes, Michael: {{kill -9}} is the way some of the time. However,
there are types of OOM exceptions that keep {{spark-shell}} running but create
side effects. One example I've discovered recently is temporary folders in the
Hive managed table space in HDFS not getting cleaned up which causes exceptions
when, say, {{saveAsTable}} with the same table name runs later.
> Failure to cleanup on exceptions causes persistent I/O problems later on
> ------------------------------------------------------------------------
>
> Key: SPARK-9345
> URL: https://issues.apache.org/jira/browse/SPARK-9345
> Project: Spark
> Issue Type: Bug
> Components: Spark Shell, SQL
> Affects Versions: 1.3.1, 1.4.0, 1.4.1
> Environment: Ubuntu on AWS
> Reporter: Simeon Simeonov
> Priority: Minor
>
> When using spark-shell in local mode, I've observed the following behavior on
> a number of nodes:
> # Some operation generates an exception related to Spark SQL processing via
> {{HiveContext}}.
> # From that point on, nothing could be written to Hive with {{saveAsTable}}.
> # Another identically-configured version of Spark on the same machine may not
> exhibit the problem initially but, with enough exceptions, it will start
> exhibiting the problem also.
> # A new identically-configured installation of the same version on the same
> machine would exhibit the problem.
> The error is always related to inability to create a temporary folder on HDFS:
> {code}
> 15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task.
> java.io.IOException: Mkdirs failed to create
> file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_000001_0
> (exists=false, cwd=file:/home/ubuntu)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
> at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
> at
> org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)
> at
> org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)
> at
> org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)
> at
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)
> at
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
> at
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> ...
> {code}
> The behavior does not seem related to HDFS as it persists even if the HDFS
> volume is reformatted.
> The behavior is difficult to reproduce reliably but consistently observable
> with sufficient Spark SQL experimentation (dozens of exceptions arising from
> Spark SQL processing).
> The likelihood of this happening goes up substantially if some Spark SQL
> operation runs out of memory, which suggests
> that the problem is related to cleanup.
> In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you
> can see how on the same machine, identically configured 1.3.1 and 1.4.1
> versions sharing the same HDFS system and Hive metastore, behave differently.
> 1.3.1 can write to Hive. 1.4.1 cannot. The behavior started happening on
> 1.4.1 after an out of memory exception on a large job.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]