Spark Master crashes job on task failure

2014-11-10 Thread Griffiths, Michael (NYC-RPM)
Hi,

I'm running Spark in standalone mode: 1 master, 15 slaves. I started the node 
with the ec2 script, and I'm currently breaking the job into many small parts 
(~2,000) to better examine progress and failure.

Pretty basic - submitting a PySpark job (via spark-submit) to the cluster. The 
job consists of loading a file from S3, performing minor parsing, storing the 
results in a RDD. The results are then saveAsTextFile to Hadoop.

Unfortunately, it keeps crashing. A small number of the jobs fail - I believe 
timeout errors - and for over half of the jobs that fail, when they are re-run 
they succeed. Still, a task failing shouldn't crash the entire job: it should 
just retry up to four times, and then give up.

However, the entire job does crash. I was wondering why, but I believe that 
when a job is assigned to SPARK_MASTER and it fails multiple times, it throws a 
SparkException and brings down Spark Master. If it was a slave, it would be OK 
- it could either re-register and continue, or not, but the entire job would 
continue (to completion).

I've run the job a few times now, and the point at which it crashes depends on 
when one of the failing jobs gets assigned to master.

The short-term solution would be exclude Master from running jobs, but I don't 
see that option. Does that exist? Can I exclude Master from accepting tasks in 
Spark standalone mode?

The long term solution, of course, is figuring what part of the job (or what 
file in S3) is causing the error and fixing it. But right now I'd just like to 
get the first results back, knowing I'll be missing 0.25% of data.

Thanks,
Michael


RE: Spark Master crashes job on task failure

2014-11-10 Thread Griffiths, Michael (NYC-RPM)
Nevermind - I don't know what I was thinking with the below. It's just 
maxTaskFailures causing the job to failure.

From: Griffiths, Michael (NYC-RPM) [mailto:michael.griffi...@reprisemedia.com]
Sent: Monday, November 10, 2014 4:48 PM
To: user@spark.apache.org
Subject: Spark Master crashes job on task failure

Hi,

I'm running Spark in standalone mode: 1 master, 15 slaves. I started the node 
with the ec2 script, and I'm currently breaking the job into many small parts 
(~2,000) to better examine progress and failure.

Pretty basic - submitting a PySpark job (via spark-submit) to the cluster. The 
job consists of loading a file from S3, performing minor parsing, storing the 
results in a RDD. The results are then saveAsTextFile to Hadoop.

Unfortunately, it keeps crashing. A small number of the jobs fail - I believe 
timeout errors - and for over half of the jobs that fail, when they are re-run 
they succeed. Still, a task failing shouldn't crash the entire job: it should 
just retry up to four times, and then give up.

However, the entire job does crash. I was wondering why, but I believe that 
when a job is assigned to SPARK_MASTER and it fails multiple times, it throws a 
SparkException and brings down Spark Master. If it was a slave, it would be OK 
- it could either re-register and continue, or not, but the entire job would 
continue (to completion).

I've run the job a few times now, and the point at which it crashes depends on 
when one of the failing jobs gets assigned to master.

The short-term solution would be exclude Master from running jobs, but I don't 
see that option. Does that exist? Can I exclude Master from accepting tasks in 
Spark standalone mode?

The long term solution, of course, is figuring what part of the job (or what 
file in S3) is causing the error and fixing it. But right now I'd just like to 
get the first results back, knowing I'll be missing 0.25% of data.

Thanks,
Michael


PySpark Error on Windows with sc.wholeTextFiles

2014-10-16 Thread Griffiths, Michael (NYC-RPM)
Hi,

I'm running into an error on Windows (x64, 8.1) running Spark 1.1.0 (pre-builet 
for Hadoop 2.4: 
spark-1.1.0-bin-hadoop2.4.tgzhttp://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.4.tgz)
 with Java SE Version 8 Update 20 (build 1.8.0_20-b26); just getting started 
with Spark.

When running sc.wholeTextFiles() on a directory, I can run the command but not 
do anything with the resulting RDD - specifically, I get an error in 
py4j.protocol.Py4JJavaError; the error is unspecified, though the location is 
included. I've attached the traceback below.

In this situation, I'm trying to load all files from a folder on the local 
filesystem, located at D:\testdata. The folder contains one file, which can be 
loaded successfully with sc.textFile(d:/testdata/filename) - no problems at 
all - so I do not believe the file is throwing the error.

Is there any advice on what I should look at further to isolate or fix the 
error? Am I doing something obviously wrong?

Thanks,
Michael


Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.1.0
  /_/

Using Python version 2.7.7 (default, Jun 11 2014 10:40:02)
SparkContext available as sc.
 file = sc.textFile(d:/testdata/cbcc5b470ec06f212990c68c8f76e887b884)
 file.count()
732
 file.first()
u'!DOCTYPE html'
 data = sc.wholeTextFiles('d:/testdata')
 data.first()
Traceback (most recent call last):
  File stdin, line 1, in module
  File D:\spark\python\pyspark\rdd.py, line 1167, in first
return self.take(1)[0]
  File D:\spark\python\pyspark\rdd.py, line 1126, in take
totalParts = self._jrdd.partitions().size()
  File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py, line 
538, in __call__
  File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py, line 300, 
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions.
: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534)
at 
org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)
   at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263)
at 
org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54)
at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
at 
org.apache.spark.api.java.JavaPairRDD.partitions(JavaPairRDD.scala:44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)

 data.count()
Traceback (most recent call last):
  File stdin, line 1, in module
  File D:\spark\python\pyspark\rdd.py, line 847, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File D:\spark\python\pyspark\rdd.py, line 838, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File D:\spark\python\pyspark\rdd.py, line 759, in reduce
vals = self.mapPartitions(func).collect()
  File