Hi Ewan, Sorry it took a while for us to reply. I don't know spark-perf that well, but I think this would be problematic if it works with only a specific version of Hadoop. Maybe we can take a different approach -- just have a bunch of tasks using the HDFS client API to read data, and not relying on input formats?
On Fri, Mar 6, 2015 at 1:41 AM, Ewan Higgs <ewan.hi...@ugent.be> wrote: > Hi all, > I never heard from anyone on this and have received emails in private that > people would like to add terasort to their spark-perf installs so it > becomes part of their cluster validation checks. > > Yours, > Ewan > > > -------- Forwarded Message -------- > Subject: SparkSpark-perf terasort WIP branch > Date: Wed, 14 Jan 2015 14:33:45 +0100 > From: Ewan Higgs <ewan.hi...@ugent.be> > To: dev@spark.apache.org <dev@spark.apache.org> > > > > Hi all, > I'm trying to build the Spark-perf WIP code but there are some errors to > do with Hadoop APIs. I presume this is because there is some Hadoop > version set and it's referring to that. But I can't seem to find it. > > The errors are as follows: > > [info] Compiling 15 Scala sources and 2 Java sources to > /home/ehiggs/src/spark-perf/spark-tests/target/scala-2.10/classes... > [error] > /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/ > spark/perf/terasort/TeraInputFormat.scala:40: > object task is not a member of package org.apache.hadoop.mapreduce > [error] import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > [error] ^ > [error] > /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/ > spark/perf/terasort/TeraInputFormat.scala:132: > not found: type TaskAttemptContextImpl > [error] val context = new TaskAttemptContextImpl( > [error] ^ > [error] > /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/ > spark/perf/terasort/TeraScheduler.scala:37: > object TTConfig is not a member of package > org.apache.hadoop.mapreduce.server.tasktracker > [error] import org.apache.hadoop.mapreduce.server.tasktracker.TTConfig > [error] ^ > [error] > /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/ > spark/perf/terasort/TeraScheduler.scala:91: > not found: value TTConfig > [error] var slotsPerHost : Int = conf.getInt(TTConfig.TT_MAP_SLOTS, 4) > [error] ^ > [error] > /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/ > spark/perf/terasort/TeraSortAll.scala:7: > value run is not a member of org.apache.spark.examples.terasort.TeraGen > [error] tg.run(Array[String]("10M", "/tmp/terasort_in")) > [error] ^ > [error] > /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/ > spark/perf/terasort/TeraSortAll.scala:9: > value run is not a member of org.apache.spark.examples.terasort.TeraSort > [error] ts.run(Array[String]("/tmp/terasort_in", "/tmp/terasort_out")) > [error] ^ > [error] 6 errors found > [error] (compile:compile) Compilation failed > [error] Total time: 13 s, completed 05-Jan-2015 12:21:47 > > I can build the same code if it's in the Spark tree using the following > command: > mvn -Dhadoop.version=2.5.0 -DskipTests=true install > > Is there a way I can convince spark-perf to build this code with the > appropriate Hadoop library version? I tried to apply the following to > spark-tests/project/SparkTestsBuild.scala but it didn't seem to work as > I expected: > > $ git diff project/SparkTestsBuild.scala > diff --git a/spark-tests/project/SparkTestsBuild.scala > b/spark-tests/project/SparkTestsBuild.scala > index 4116326..4ed5f0c 100644 > --- a/spark-tests/project/SparkTestsBuild.scala > +++ b/spark-tests/project/SparkTestsBuild.scala > @@ -16,7 +16,9 @@ object SparkTestsBuild extends Build { > "org.scalatest" %% "scalatest" % "2.2.1" % "test", > "com.google.guava" % "guava" % "14.0.1", > "org.apache.spark" %% "spark-core" % "1.0.0" % "provided", > - "org.json4s" %% "json4s-native" % "3.2.9" > + "org.json4s" %% "json4s-native" % "3.2.9", > + "org.apache.hadoop" % "hadoop-common" % "2.5.0", > + "org.apache.hadoop" % "hadoop-mapreduce" % "2.5.0" > ), > test in assembly := {}, > outputPath in assembly := > file("target/spark-perf-tests-assembly.jar"), > @@ -36,4 +38,4 @@ object SparkTestsBuild extends Build { > case _ => MergeStrategy.first > } > )) > -} > \ No newline at end of file > +} > > > Yours, > Ewan > > > >