Spark S3 LZO input files

2014-07-03 Thread hassan
I'm trying to read input files from S3. The files are compressed using LZO.
i-e from spark-shell 

sc.textFile(s3n://path/xx.lzo).first returns 'String = �LZO?'

Spark does not uncompress the data from the file. I am using cloudera
manager 5, with CDH 5.0.2. I've already installed 'GPLEXTRAS' parcel and
have included 'opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/hadoop-lzo.jar'
and '/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native/' in
SPARK_CLASS_PATH. What am I missing?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-S3-LZO-input-files-tp8706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: MLLib : Math on Vector and Matrix

2014-07-03 Thread Xiangrui Meng
Hi Thunder,

Please understand that both MLlib and breeze are in active
development. Before v1.0, we used jblas but in the public APIs we only
exposed Array[Double]. In v1.0, we introduced Vector that supports
both dense and sparse data and switched the backend to
breeze/netlib-java (except ALS). We only used few breeze methods in
our implementation and we benchmarked them one by one. It was hard to
foresee problems caused by including breeze at that time, for example,
https://issues.apache.org/jira/browse/SPARK-1520. Being conservative
in v1.0 was not a bad choice. We should benchmark breeze v0.8.1 for
v1.1 and consider make toBreeze a developer API. For now, if you are
migrating code from v0.9, you can use `Vector.toArray` to get the
value array. Sorry for the inconvenience!

Best,
Xiangrui

On Wed, Jul 2, 2014 at 2:42 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 in my humble opinion Spark should've supported linalg a-la [1] before it
 even started dumping methodologies into mllib.

 [1] http://mahout.apache.org/users/sparkbindings/home.html


 On Wed, Jul 2, 2014 at 2:16 PM, Thunder Stumpges
 thunder.stump...@gmail.com wrote:

 Thanks. I always hate having to do stuff like this. It seems like they
 went a bit overboard with all the private[mllib] declarations... possibly
 all in the name of thou shalt not change your public API. If you don't
 make your public API usable, we end up having to work around it anyway...

 Oh well.

 Thunder



 On Wed, Jul 2, 2014 at 2:05 PM, Koert Kuipers ko...@tresata.com wrote:

 i did the second option: re-implemented .toBreeze as .breeze using pimp
 classes


 On Wed, Jul 2, 2014 at 5:00 PM, Thunder Stumpges
 thunder.stump...@gmail.com wrote:

 I am upgrading from Spark 0.9.0 to 1.0 and I had a pretty good amount of
 code working with internals of MLLib. One of the big changes was the move
 from the old jblas.Matrix to the Vector/Matrix classes included in MLLib.

 However I don't see how we're supposed to use them for ANYTHING other
 than a container for passing data to the included APIs... how do we do any
 math on them? Looking at the internal code, there are quite a number of
 private[mllib] declarations including access to the Breeze representations
 of the classes.

 Was there a good reason this was not exposed? I could see maybe not
 wanting to expose the 'toBreeze' function which would tie it to the breeze
 implementation, however it would be nice to have the various mathematics
 wrapped at least.

 Right now I see no way to code any vector/matrix math without moving my
 code namespaces into org.apache.spark.mllib or duplicating the code in
 'toBreeze' in my own util functions. Not very appealing.

 What are others doing?
 thanks,
 Thunder






Re: MLLib : Math on Vector and Matrix

2014-07-03 Thread Xiangrui Meng
Hi Dmitriy,

It is sweet to have the bindings, but it is very easy to downgrade the
performance with them. The BLAS/LAPACK APIs have been there for more
than 20 years and they are still the top choice for high-performance
linear algebra. I'm thinking about whether it is possible to make the
evaluation lazy in bindings. For example,

y += a * x

can be translated to an AXPY call instead of creating a temporary
vector for a*x. There were some work in C++ but none achieved good
performance. I'm not sure whether this is a good direction to explore.

Best,
Xiangrui


Re: Integrate Spark Editor with Hue for source compiled installation of spark/spark-jobServer

2014-07-03 Thread Sunita Arvind
That's good to know. I will try it out.

Thanks Romain

On Friday, June 27, 2014, Romain Rigaux romain.rig...@gmail.com wrote:

 So far Spark Job Server does not work with Spark 1.0:
 https://github.com/ooyala/spark-jobserver

 So this works only with Spark 0.9 currently:

 http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/

 Romain



 Romain


 On Tue, Jun 24, 2014 at 9:04 AM, Sunita Arvind sunitarv...@gmail.com
 javascript:_e(%7B%7D,'cvml','sunitarv...@gmail.com'); wrote:

 Hello Experts,

 I am attempting to integrate Spark Editor with Hue on CDH5.0.1. I have
 the spark installation build manually from the sources for spark1.0.0. I am
 able to integrate this with cloudera manager.

 Background:
 ---
 We have a 3 node VM cluster with CDH5.0.1
 We requried spark1.0.0 due to some features in it, so I did a

  yum remove spark-core spark-master spark-worker spark-python

  of the default spark0.9.0 and compiled spark1.0.0 from source:

 Downloaded the spark-trunk from

 git clone https://github.com/apache/spark.git
 cd spark
 SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true ./sbt/sbt assembly

  The spark-assembly-1.0.0-SNAPSHOT-hadoop2.2.0.jar was built and spark by
 itself seems to work well. I was even able to run a text file count.

 Current attempt:
 
 Referring to this article -
 http://gethue.com/a-new-spark-web-ui-spark-app/
 Now I am trying to add the Spark editor to Hue. AFAIK, this requires
 git clone https://github.com/ooyala/spark-jobserver.git
 cd spark-jobserver
 sbt
 re-start

 This was successful after lot of struggle with the proxy settings.
 However, is this the job Server itself? Will that mean the job Server has
 to be manually started. I intend to have the spark editor show up in hue
 web UI and I am no way close. Can some one please help?

 Note, the 3 VMs are Linux CentOS. Not sure if setting something like can
 be expected to work.:

 [desktop]
 app_blacklist=


 Also, I have made the changes to vim .
 /job-server/src/main/resources/application.conf as recommended, however,
 I do not expect this to impact hue in any way.

 Also, I intend to let the editor stay available, not spawn it everytime
 it is required.


 Thanks in advance.

 regards





Re: Run spark unit test on Windows 7

2014-07-03 Thread Konstantin Kudryavtsev
It sounds really strange...

I guess it is a bug, critical bug and must be fixed... at least some flag
must be add (unable.hadoop)

I found the next workaround :
1) download compiled winutils.exe from
http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight
2) put this file into d:\winutil\bin
3) add in my test: System.setProperty(hadoop.home.dir, d:\\winutil\\)

after that test runs

Thank you,
Konstantin Kudryavtsev


On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee denny.g@gmail.com wrote:

 You don't actually need it per se - its just that some of the Spark
 libraries are referencing Hadoop libraries even if they ultimately do not
 call them. When I was doing some early builds of Spark on Windows, I
 admittedly had Hadoop on Windows running as well and had not run into this
 particular issue.



 On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev 
 kudryavtsev.konstan...@gmail.com wrote:

 No, I don't

 why do I need to have HDP installed? I don't use Hadoop at all and I'd
 like to read data from local filesystem

 On Jul 2, 2014, at 9:10 PM, Denny Lee denny.g@gmail.com wrote:

 By any chance do you have HDP 2.1 installed? you may need to install the
 utils and update the env variables per
 http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows


 On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:

 Hi Andrew,

 it's windows 7 and I doesn't set up any env variables here

 The full stack trace:

 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in
 the hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in
 the Hadoop binaries.
 at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
 at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
  at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
 at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
  at org.apache.hadoop.security.Groups.init(Groups.java:77)
 at
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
  at
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
 at
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
  at
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
 at
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
  at
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
 at org.apache.spark.SparkContext.init(SparkContext.scala:228)
  at org.apache.spark.SparkContext.init(SparkContext.scala:97)
 at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
  at junit.framework.TestCase.runTest(TestCase.java:168)
 at junit.framework.TestCase.runBare(TestCase.java:134)
  at junit.framework.TestResult$1.protect(TestResult.java:110)
 at junit.framework.TestResult.runProtected(TestResult.java:128)
  at junit.framework.TestResult.run(TestResult.java:113)
 at junit.framework.TestCase.run(TestCase.java:124)
  at junit.framework.TestSuite.runTest(TestSuite.java:232)
 at junit.framework.TestSuite.run(TestSuite.java:227)
  at
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
 at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
  at
 com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
 at
 com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
  at
 com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)


 Thank you,
 Konstantin Kudryavtsev


 On Wed, Jul 2, 2014 at 8:15 PM, Andrew Or and...@databricks.com wrote:

 Hi Konstatin,

 We use hadoop as a library in a few places in Spark. I wonder why the
 path includes null though.

 Could you provide the full stack trace?

 Andrew


 2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com:

 Hi all,

 I'm trying to run some transformation on *Spark*, it works 

Re: installing spark 1 on hadoop 1

2014-07-03 Thread Akhil Das
Are you having sbt directory inside your spark directory?

Thanks
Best Regards


On Wed, Jul 2, 2014 at 10:17 PM, Imran Akbar im...@infoscoutinc.com wrote:

 Hi,
I'm trying to install spark 1 on my hadoop cluster running on EMR.  I
 didn't have any problem installing the previous versions, but on this
 version I couldn't find any 'sbt' folder.  However, the README still
 suggests using this to install Spark:
 ./sbt/sbt assembly

 which fails:
 ./sbt/sbt: No such file or directory

 any suggestions?

 thanks,
 imran




Re: Spark SQL - groupby

2014-07-03 Thread Subacini B
Hi,

Can someone provide me pointers for this issue.

Thanks
Subacini


On Wed, Jul 2, 2014 at 3:34 PM, Subacini B subac...@gmail.com wrote:

 Hi,

 Below code throws  compilation error , not found: *value Sum* . Can
 someone help me on this. Do i need to add any jars or imports ? even for
 Count , same error is thrown

 val queryResult = sql(select * from Table)
  queryResult.groupBy('colA)('colA,*Sum*('colB) as 'totB).aggregate(
 *Sum*('totB)).collect().foreach(println)

 Thanks
 subacini



Re: installing spark 1 on hadoop 1

2014-07-03 Thread Akhil Das
If you have downloaded the pre-compiled binary, it will not have sbt
directory inside it.

Thanks
Best Regards


On Thu, Jul 3, 2014 at 12:35 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Are you having sbt directory inside your spark directory?

 Thanks
 Best Regards


 On Wed, Jul 2, 2014 at 10:17 PM, Imran Akbar im...@infoscoutinc.com
 wrote:

 Hi,
I'm trying to install spark 1 on my hadoop cluster running on EMR.  I
 didn't have any problem installing the previous versions, but on this
 version I couldn't find any 'sbt' folder.  However, the README still
 suggests using this to install Spark:
 ./sbt/sbt assembly

 which fails:
 ./sbt/sbt: No such file or directory

 any suggestions?

 thanks,
 imran





Re: RDD join: composite keys

2014-07-03 Thread Andrew Ash
Hi Sameer,

If you set those two IDs to be a Tuple2 in the key of the RDD, then you can
join on that tuple.

Example:

val rdd1: RDD[Tuple3[Int, Int, String]] = ...
val rdd2: RDD[Tuple3[Int, Int, String]] = ...

val resultRDD = rdd1.map(k = ((k._1, k._2), k._3)).join(
rdd2.map(k = ((k._1, k._2), k.)3)))


Note that when using .join though, that is an inner join so you only get
results from (id1, id2) pairs that have BOTH a score1 and a score2.

Andrew


On Wed, Jul 2, 2014 at 5:12 PM, Sameer Tilak ssti...@live.com wrote:

 Hi everyone,

 Is it possible to join RDDs using composite keys? I would like to join
 these two RDDs with RDD1.id1 = RDD2.id1 and RDD1.id2 RDD2.id2

 RDD1 (id1, id2, scoretype1)
 RDD2 (id1, id2, scoretype2)

 I want the result to be ResultRDD = (id1, id2, (score1, score2))

 Would really appreciate if you can point me in the right direction.




Re: Spark SQL - groupby

2014-07-03 Thread Takuya UESHIN
Hi,

You need to import Sum and Count like:

import org.apache.spark.sql.catalyst.expressions.{Sum,Count} // or
with wildcard _

or if you use current master branch build, you can use sum('colB)
instead of Sum('colB).

Thanks.



2014-07-03 16:09 GMT+09:00 Subacini B subac...@gmail.com:
 Hi,

 Can someone provide me pointers for this issue.

 Thanks
 Subacini


 On Wed, Jul 2, 2014 at 3:34 PM, Subacini B subac...@gmail.com wrote:

 Hi,

 Below code throws  compilation error , not found: value Sum . Can
 someone help me on this. Do i need to add any jars or imports ? even for
 Count , same error is thrown

 val queryResult = sql(select * from Table)
  queryResult.groupBy('colA)('colA,Sum('colB) as
 'totB).aggregate(Sum('totB)).collect().foreach(println)

 Thanks
 subacini





-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: Shark Vs Spark SQL

2014-07-03 Thread 田毅
add MASTER=yarn-client then the JDBC / Thrift server will run on yarn



2014-07-02 16:57 GMT-07:00 田毅 tia...@asiainfo.com:

 hi, Matei


 Do you know how to run the JDBC / Thrift server on Yarn?


 I did not find any suggestion in docs.


 2014-07-02 16:06 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com:

 Spark SQL in Spark 1.1 will include all the functionality in Shark; take a
 look at
 http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html.
 We decided to do this because at the end of the day, the only code left in
 Shark was the JDBC / Thrift server, which is a very small amount of code.
 There’s also a branch of Spark 1.0 that includes this server if you want to
 replace Shark on Spark 1.0:
 https://github.com/apache/spark/tree/branch-1.0-jdbc. The server runs in
 a very similar way to how Shark did.

 Matei

 On Jul 2, 2014, at 3:57 PM, Shrikar archak shrika...@gmail.com wrote:

 As of the spark summit 2014 they mentioned that there will be no active
 development on shark.

 Thanks,
 Shrikar


  On Wed, Jul 2, 2014 at 3:53 PM, Subacini B subac...@gmail.com wrote:

 Hi,


 http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3cb75376b8-7a57-4161-b604-f919886cf...@gmail.com%3E

 This talks about  Shark backend will be replaced with Spark SQL engine
 in future.
 Does that mean Spark will continue to support Shark + Spark SQL for long
 term? OR
 After some period, Shark will be decommissioned ??

 Thanks
 Subacini







Case class in java

2014-07-03 Thread Kevin Jung
Hi,
I'm trying to convert scala spark job into java.
In case of scala, I typically use 'case class' to apply schema to RDD.
It can be converted into POJO class in java, but what I really want to do is
dynamically creating POJO classes like scala REPL do.
For this reason, I import javassist to create POJO class in runtime easily.
But the problem is Worker nodes can't find this class.
The error message is..
 host workernode2.com: java.lang.ClassNotFoundException: GeneratedClass_no1
 java.net.URLClassLoader$1.run(URLClassLoader.java:366)
java.net.URLClassLoader$1.run(URLClassLoader.java:355)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findClass(URLClassLoader.java:354)
java.lang.ClassLoader.loadClass(ClassLoader.java:423)
java.lang.ClassLoader.loadClass(ClassLoader.java:356)
java.lang.Class.forName0(Native Method)
java.lang.Class.forName(Class.java:266)
Generated class's classloader is
'Thread.currentThread().getContextClassLoader()'.
I expect it can be visible for Driver-node but Worker node's executor can
not see it.
Are changing classloader for loading Generated class and broadcasting
Generated class by spark context effective?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Case-class-in-java-tp8720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: write event logs with YARN

2014-07-03 Thread Christophe Préaud
Hi Andrew,

This does not work (the application failed), I have the following error when I 
put 3 slashes in the hdfs scheme:
(...)
Caused by: java.lang.IllegalArgumentException: Pathname 
/dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442
 from 
hdfs:/dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442
 is not a valid DFS filename.
(...)

Besides, I do not think that there is an issue with the hdfs path name since 
only the empty APPLICATION_COMPLETE file is missing (with 
spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events), 
all other files are correctly created, e.g.:
hdfs dfs -ls spark-events/kelkoo.searchkeywordreport-1404376178470
Found 3 items
-rwxrwx---   1 kookel supergroup  0 2014-07-03 08:29 
spark-events/kelkoo.searchkeywordreport-1404376178470/COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
-rwxrwx---   1 kookel supergroup 137948 2014-07-03 08:32 
spark-events/kelkoo.searchkeywordreport-1404376178470/EVENT_LOG_2
-rwxrwx---   1 kookel supergroup  0 2014-07-03 08:29 
spark-events/kelkoo.searchkeywordreport-1404376178470/SPARK_VERSION_1.0.0

You help is appreciated though, do not hesitate if you have any other idea on 
how to fix this.

Thanks,
Christophe.

On 03/07/2014 01:49, Andrew Lee wrote:
Hi Christophe,

Make sure you have 3 slashes in the hdfs scheme.

e.g.

hdfs:///server_name:9000/user/user_name/spark-events

and in the spark-defaults.conf as well.
spark.eventLog.dir=hdfs:///server_name:9000/user/user_name/spark-events


 Date: Thu, 19 Jun 2014 11:18:51 +0200
 From: christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com
 To: user@spark.apache.orgmailto:user@spark.apache.org
 Subject: write event logs with YARN

 Hi,

 I am trying to use the new Spark history server in 1.0.0 to view finished 
 applications (launched on YARN), without success so far.

 Here are the relevant configuration properties in my spark-defaults.conf:

 spark.yarn.historyServer.address=server_name:18080
 spark.ui.killEnabled=false
 spark.eventLog.enabled=true
 spark.eventLog.compress=true
 spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events

 And the history server has been launched with the command below:

 /opt/spark/sbin/start-history-server.sh 
 hdfs://server_name:9000/user/user_name/spark-events


 However, the finished application do not appear in the history server UI 
 (though the UI itself works correctly).
 Apparently, the problem is that the APPLICATION_COMPLETE file is not created:

 hdfs dfs -stat %n spark-events/application_name-1403166516102/*
 COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
 EVENT_LOG_2
 SPARK_VERSION_1.0.0

 Indeed, if I manually create an empty APPLICATION_COMPLETE file in the above 
 directory, the application can now be viewed normally in the history server.

 Finally, here is the relevant part of the YARN application log, which seems 
 to imply that
 the DFS Filesystem is already closed when the APPLICATION_COMPLETE file is 
 created:

 (...)
 14/06/19 08:29:29 INFO ApplicationMaster: finishApplicationMaster with 
 SUCCEEDED
 14/06/19 08:29:29 INFO AMRMClientImpl: Waiting for application to be 
 successfully unregistered.
 14/06/19 08:29:29 INFO ApplicationMaster: AppMaster received a signal.
 14/06/19 08:29:29 INFO ApplicationMaster: Deleting staging directory 
 .sparkStaging/application_1397477394591_0798
 14/06/19 08:29:29 INFO ApplicationMaster$$anon$1: Invoking sc stop from 
 shutdown hook
 14/06/19 08:29:29 INFO SparkUI: Stopped Spark web UI at 
 http://dc1-ibd-corp-hadoop-02.corp.dc1.kelkoo.net:54877
 14/06/19 08:29:29 INFO DAGScheduler: Stopping DAGScheduler
 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Shutting down all 
 executors
 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Asking each executor to 
 shut down
 14/06/19 08:29:30 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
 stopped!
 14/06/19 08:29:30 INFO ConnectionManager: Selector thread was interrupted!
 14/06/19 08:29:30 INFO ConnectionManager: ConnectionManager stopped
 14/06/19 08:29:30 INFO MemoryStore: MemoryStore cleared
 14/06/19 08:29:30 INFO BlockManager: BlockManager stopped
 14/06/19 08:29:30 INFO BlockManagerMasterActor: Stopping BlockManagerMaster
 14/06/19 08:29:30 INFO BlockManagerMaster: BlockManagerMaster stopped
 Exception in thread Thread-44 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
 at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1365)
 at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1307)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:384)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:380)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 

Re: java options for spark-1.0.0

2014-07-03 Thread Wanda Hawk
With spark-1.0.0 this is the cmdline from /proc/#pid: (with the export line 
export _JAVA_OPTIONS=...)

/usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.2.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-rdbms-3.2.1.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-api-jdo-3.2.1.jar-XX:MaxPermSize=128m-Djava.library.path=-Xms512m-Xmx512morg.apache.spark.deploy.SparkSubmit--classSparkKMeans--verbose--masterlocal[24]/home/spark2013/KMeansWorkingDirectory/target/scala-2.10/sparkkmeans_2.10-1.0.jar/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001


This is the cmdline from /proc/#pid with spark-0.8.0 and launching KMeans 
with scala -J-Xms16g -J-Xms16g . The export line from bashrc is ignored 
here also (If I do launch without specifying the java options after the scala 
command , the heap will have the default value) - the results below are from 
launching it with the java options specified after the scala command:

/usr/java/jdk1.7.0_51/bin/java-Xmx256M-Xms32M-Xms16g-Xmx16g-Xbootclasspath/a:/home/spark2013/scala-2.9.3/lib/jline.jar:/home/spark2013/scala-2.9.3/lib/scalacheck.jar:/home/spark2013/scala-2.9.3/lib/scala-compiler.jar:/home/spark2013/scala-2.9.3/lib/scala-dbc.jar:/home/spark2013/scala-2.9.3/lib/scala-library.jar:/home/spark2013/scala-2.9.3/lib/scala-partest.jar:/home/spark2013/scala-2.9.3/lib/scalap.jar:/home/spark2013/scala-2.9.3/lib/scala-swing.jar-Dscala.usejavacp=true-Dscala.home=/home/spark2013/scala-2.9.3-Denv.emacs=scala.tools.nsc.MainGenericRunner-J-Xms16g-J-Xmx16g-cp/home/spark2013/Runs/KMeans/GC/classesSparkKMeanslocal[24]/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001


Launching spark-1.0.0 with spark-submit and --driver-memory-10g gets picked up, 
but the results in the execution are the same, a lot of alocation failures
/usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.2.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-rdbms-3.2.1.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-api-jdo-3.2.1.jar-XX:MaxPermSize=128m-Djava.library.path=-Xms10g-Xmx10gorg.apache.spark.deploy.SparkSubmit--driver-memory10g--classSparkKMeans--masterlocal[24]/home/spark2013/KMeansWorkingDirectory/target/scala-2.10/sparkkmeans_2.10-1.0.jar/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001


Adding --executor-memory 11g will not change the outcome:
cat /proc/13286/cmdline
/usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.2.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-rdbms-3.2.1.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-api-jdo-3.2.1.jar-XX:MaxPermSize=128m-Djava.library.path=-Xms10g-Xmx10gorg.apache.spark.deploy.SparkSubmit--driver-memory10g--executor-memory11g--classSparkKMeans--masterlocal[24]/home/spark2013/KMeansWorkingDirectory/target/scala-2.10/sparkkmeans_2.10-1.0.jar/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001

So the Xmx and Xms can be altered, but the execution is rubbish in performance 
compared to spark 0.8.0. How can I improve it ?


Thanks
On Wednesday, July 2, 2014 9:34 PM, Matei Zaharia matei.zaha...@gmail.com 
wrote:
 


Try looking at the running processes with “ps” to see their full command line 
and see whether any options are different. It seems like in both cases, your 
young generation is quite large (11 GB), which doesn’t make lot of sense with a 
heap of 15 GB. But maybe I’m misreading something.

Matei

On Jul 2, 2014, at 4:50 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with 
spark-0.8.0 with this line in bash.rc  export _JAVA_OPTIONS=-Xmx15g -Xms15g 
-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It finished in a 
decent time, ~50 seconds, and I had only a few Full GC messages from 
Java. (a max of 4-5)


Now, using the same export in bash.rc but with spark-1.0.0  (and running it 
with spark-submit) the first loop never finishes and  I get a lot of:
18.537: [GC (Allocation Failure) --[PSYoungGen: 
11796992K-11796992K(13762560K)] 11797442K-11797450K(13763072K), 2.8420311 
secs] [Times: user=5.81 sys=2.12, real=2.85 secs]

or 


 31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K-3177967K(13762560K)] 
[ParOldGen: 505K-505K(512K)] 11797497K-3178473K(13763072K), [Metaspace: 
37646K-37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11, 
real=2.31 secs]
 
I tried passing different parameters for the JVM through spark-submit, but the 
results are the same
This happens with java 1.7 and also with java 1.8.
I do not know what the Ergonomics stands for ...


How can I get a decent performance from spark-1.0.0 considering that 

Which version of Hive support Spark Shark

2014-07-03 Thread Ravi Prasad
Hi ,
  Can any one please help me to understand which version of Hive support
Spark and Shark

-- 
--
Regards,
RAVI PRASAD. T


Re: Case class in java

2014-07-03 Thread Kevin Jung
I found a web page for hint.
http://ardoris.wordpress.com/2014/03/30/how-spark-does-class-loading/
I learned SparkIMain has internal httpserver to publish class object but
can't figure out how I use it in java.
Any ideas?

Thanks,
Kevin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Case-class-in-java-tp8720p8724.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


hdfs short circuit

2014-07-03 Thread Jahagirdar, Madhu
can i enable spark to use dfs.client.read.shortcircuit property to improve 
performance and ready natively on local nodes instead of hdfs api ?


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.


Re: Enable Parsing Failed or Incompleted jobs on HistoryServer (YARN mode)

2014-07-03 Thread Surendranauth Hiraman
I've had some odd behavior with jobs showing up in the history server in
1.0.0. Failed jobs do show up but it seems they can show up minutes or
hours later. I see in the history server logs messages about bad task ids.
But then eventually the jobs show up.

This might be your situation.

Anecdotally, if you click on the job in the Spark Master GUI after it is
done, this may help it show up in the history server faster. Haven't
reliably tested this though. May just be a coincidence of timing.

-Suren



On Wed, Jul 2, 2014 at 8:01 PM, Andrew Lee alee...@hotmail.com wrote:

 Hi All,

 I have HistoryServer up and running, and it is great.

 Is it possible to also enable HsitoryServer to parse failed jobs event by
 default as well?

 I get No Completed Applications Found if job fails.

-
 -
 *= Event Log Location: *hdfs:///user/test01/spark/logs/

 No Completed Applications Found
 =

 The reason is that it is good to run the HistoryServer to keep track of
 performance and resource usage for each completed job,
 but I found it more useful when job fails. I can identify which stage did
 it fail, etc instead of sipping through the logs
 from the Resource Manager. The same event log is only available when the
 Application Master is still active, once the job fails,
 the Application Master is killed, and I lose the GUI access, even though I
 have the event log in JSON format, I can't open it with
 the HistoryServer.

 This is very helpful especially for long running jobs that last for 2-18
 hours that generates Gigabytes of logs.

 So I have 2 questions:

 1. Any reason why we only render completed jobs? Why can't we bring in all
 jobs and choose from the GUI? Like a time machine to restore the status
 from the Application Master?

 ./core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala

 val logInfos = logDirs

   .sortBy { dir = getModificationTime(dir) }

   .map { dir = (dir,
 EventLoggingListener.parseLoggingInfo(dir.getPath, fileSystem)) }

   .filter { case (dir, info) = info.*applicationComplete* }



 2. If I force to touch a file APPLICATION_COMPLETE in the failed job
 event log folder, will this cause any problem?








-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io


Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-03 Thread Wanda Hawk
I have given this a try in a spark-shell and I still get many Allocation 
Failures


On Thursday, July 3, 2014 9:51 AM, Xiangrui Meng men...@gmail.com wrote:
 


The SparkKMeans is just an example code showing a barebone
implementation of k-means. To run k-means on big datasets, please use
the KMeans implemented in MLlib directly:
http://spark.apache.org/docs/latest/mllib-clustering.html

-Xiangrui


On Wed, Jul 2, 2014 at 9:50 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 I can run it now with the suggested method. However, I have encountered a
 new problem that I have not faced before (sent another email with that one
 but here it goes again ...)

 I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with
 spark-0.8.0 with this line in bash.rc  export _JAVA_OPTIONS=-Xmx15g
 -Xms15g -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It
 finished in a decent time, ~50 seconds, and I had only a few Full GC
 messages from Java. (a max of 4-5)

 Now, using the same export in bash.rc but with spark-1.0.0  (and running it
 with spark-submit) the first loop never finishes and  I get a lot of:
 18.537: [GC (Allocation Failure) --[PSYoungGen:
 11796992K-11796992K(13762560K)] 11797442K-11797450K(13763072K), 2.8420311
 secs] [Times: user=5.81 sys=2.12, real=2.85 secs]
 
 or

  31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K-3177967K(13762560K)]
 [ParOldGen: 505K-505K(512K)] 11797497K-3178473K(13763072K), [Metaspace:
 37646K-37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11,
 real=2.31 secs]

 I tried passing different parameters for the JVM through spark-submit, but
 the results are the same
 This happens with java 1.7 and also with java 1.8.
 I do not know what the Ergonomics stands for ...

 How can I get a decent performance from spark-1.0.0 considering that
 spark-0.8.0 did not need any fine tuning on the gargage collection method
 (the default worked well) ?

 Thank you


 On Wednesday, July 2, 2014 4:45 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:


 The scripts that Xiangrui mentions set up the classpath...Can you run
 ./run-example for the provided example sucessfully?

 What you can try is set SPARK_PRINT_LAUNCH_COMMAND=1 and then call
 run-example -- that will show you the exact java command used to run
 the example at the start of execution. Assuming you can run examples
 succesfully, you should be able to just copy that and add your jar to
 the front of the classpath. If that works you can start removing extra
 jars (run-examples put all the example jars in the cp, which you won't
 need)

 As you said the error you see is indicative of the class not being
 available/seen at runtime but it's hard to tell why.

 On Wed, Jul 2, 2014 at 2:13 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 I want to make some minor modifications in the SparkMeans.scala so running
 the basic example won't do.
 I have also packed my code under a jar file with sbt. It completes
 successfully but when I try to run it : java -jar myjar.jar I get the
 same
 error:
 Exception in thread main java.lang.NoClassDefFoundError:
 breeze/linalg/Vector
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
        at java.lang.Class.getMethod0(Class.java:2774)
        at java.lang.Class.getMethod(Class.java:1663)
        at
 sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
        at
 sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 

 If scalac -d classes/ SparkKMeans.scala can't see my classpath, why does
 it succeeds in compiling and does not give the same error ?
 The error itself NoClassDefFoundError means that the files are available
 at compile time, but for some reason I cannot figure out they are not
 available at run time. Does anyone know why ?

 Thank you


 On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng men...@gmail.com wrote:


 You can use either bin/run-example or bin/spark-summit to run example
 code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark
 classpath. There are examples in the official doc:
 http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here
 -Xiangrui

 On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 Hello,

 I have installed spark-1.0.0 with scala2.10.3. I have built spark with
 sbt/sbt assembly and added


 /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
 to my CLASSPATH variable.
 Then I went here
 ../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples
 created
 a
 new directory classes and compiled SparkKMeans.scala with scalac -d
 classes/ SparkKMeans.scala
 Then I navigated to classes (I commented this line in the scala file :
 package org.apache.spark.examples ) and tried to run it with java -cp .
 SparkKMeans and I get the following error:
 Exception in thread main java.lang.NoClassDefFoundError:
 breeze/linalg/Vector
        at 

Reading text file vs streaming text files

2014-07-03 Thread M Singh
Hi:

I am working on a project where a few thousand text files (~20M in size) will 
be dropped in an hdfs directory every 15 minutes.  Data from the file will used 
to update counters in cassandra (non-idempotent operation).  I was wondering 
what is the best to deal with this:
* Use text streaming and process the files as they are added to the 
directory
* Use non-streaming text input and launch a spark driver every 15 
minutes to process files from a specified directory (new directory for every 15 
minutes).
* Use message queue to ingest data from the files and then read data 
from the queue.
Also, is there a way to to find which text file is being processed and when a 
file has been processed for both the streaming and non-streaming RDDs.  I 
believe filename is available in the WholeTextFileInputFormat but is it 
available in standard or streaming text RDDs.

Thanks

Mans 

matchError:null in ALS.train

2014-07-03 Thread Honey Joshi
Hi All,

We are using ALS.train to generate a model for predictions. We are using
DStream[] to collect the predicted output and then trying to dump in a
text file using these two approaches dstream.saveAsTextFiles() and
dstream.foreachRDD(rdd=rdd.saveAsTextFile).But both these approaches are
giving us the following error :


Exception in thread main org.apache.spark.SparkException: Job aborted
due to stage failure: Task 1.0:0 failed 1 times, most recent failure:
Exception failure in TID 0 on host localhost: scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571)
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:43)
MyOperator$$anonfun$7.apply(MyOperator.scala:213)
MyOperator$$anonfun$7.apply(MyOperator.scala:180)
scala.collection.Iterator$$anon$11.next(Ite
rator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)

We tried it in both spark 0.9.1 as well as 1.0.0 ;scala:2.10.3. Can
anybody help me with the issue.

Thank You
Regards

Honey Joshi
Ideata-Analytics


Re: Reading text file vs streaming text files

2014-07-03 Thread Akhil Das
Hi Singh!

For this use-case its better to have a Streaming context listening to that
directory in hdfs where the files are being dropped and you can set the
Streaming interval as 15 minutes and let this driver program run
continuously, so as soon as new files are arrived they are taken for
processing in every 15 minutes. In this way, you don't have to worry about
the old files unless you are about to restart the driver program. Another
implementation would be after processing of each batch, you can simply move
those processed files to another directory or so.

Thanks
Best Regards


On Thu, Jul 3, 2014 at 6:34 PM, M Singh mans6si...@yahoo.com wrote:

 Hi:

 I am working on a project where a few thousand text files (~20M in size)
 will be dropped in an hdfs directory every 15 minutes.  Data from the file
 will used to update counters in cassandra (non-idempotent operation).  I
 was wondering what is the best to deal with this:

- Use text streaming and process the files as they are added to the
directory
- Use non-streaming text input and launch a spark driver every 15
minutes to process files from a specified directory (new directory for
every 15 minutes).
- Use message queue to ingest data from the files and then read data
from the queue.

 Also, is there a way to to find which text file is being processed and
 when a file has been processed for both the streaming and non-streaming
 RDDs.  I believe filename is available in the WholeTextFileInputFormat but
 is it available in standard or streaming text RDDs.

 Thanks

 Mans



Re: Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-03 Thread Honey Joshi
On Wed, July 2, 2014 2:00 am, Mayur Rustagi wrote:
 two job context cannot share data, are you collecting the data to the
 master  then sending it to the other context?

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi




 On Wed, Jul 2, 2014 at 11:57 AM, Honey Joshi 
 honeyjo...@ideata-analytics.com wrote:

 On Wed, July 2, 2014 1:11 am, Mayur Rustagi wrote:

 Ideally you should be converting RDD to schemardd ?
 You are creating UnionRDD to join across dstream rdd?




 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi





 On Tue, Jul 1, 2014 at 3:11 PM, Honey Joshi
 honeyjo...@ideata-analytics.com


 wrote:



 Hi,
 I am trying to run a project which takes data as a DStream and dumps
 the data in the Shark table after various operations. I am getting
 the following error :

 Exception in thread main org.apache.spark.SparkException: Job
 aborted:
 Task 0.0:0 failed 1 times (most recent failure: Exception failure:
 java.lang.ClassCastException: org.apache.spark.rdd.UnionPartition
 cannot be cast to org.apache.spark.rdd.HadoopPartition) at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$s
 ched uler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$s
 ched uler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArra
 y.sc ala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala
 :102
 6)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.ap
 ply( DAGScheduler.scala:619)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.ap
 ply( DAGScheduler.scala:619)
 at scala.Option.foreach(Option.scala:236) at

 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.s
 cala :619)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$a
 nonf un$receive$1.applyOrElse(DAGScheduler.scala:207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at
 akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at
 akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(Ab
 stra ctDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260
 )
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPo
 ol.j ava:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:
 1979
 )
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerTh
 read .java:107)



 Can someone please explain the cause of this error, I am also using
 a Spark Context with the existing Streaming Context.





 I am using spark 0.9.0-Incubating, so it doesnt have anything to do
 with schemaRDD.This error is probably coming when I am trying to use one
 spark context and one shark context in the same job.Is there any way to
 incorporate two context in one job? Regards


 Honey Joshi
 Ideata-Analytics




Both of these contexts are independently executing but they were still
giving us issues, mostly because of the lazy evaluation in scala.This
error is probably coming when I am trying to use one spark context and one
shark context in the same job.Got it resolved by stopping the existing
spark context before calling the shark context. Thanks for your help
Mayur.



Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-03 Thread Eustache DIEMERT
Printing the model show the intercept is always 0 :(

Should I open a bug for that ?


2014-07-02 16:11 GMT+02:00 Eustache DIEMERT eusta...@diemert.fr:

 Hi list,

 I'm benchmarking MLlib for a regression task [1] and get strange results.

 Namely, using RidgeRegressionWithSGD it seems the predicted points miss
 the intercept:

 {code}
 val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000)
 ...
 valuesAndPreds.take(10).map(t = println(t))
 {code}

 output:

 (2007.0,-3.784588726958493E75)
 (2003.0,-1.9562390324037716E75)
 (2005.0,-4.147413202985629E75)
 (2003.0,-1.524938024096847E75)
 ...

 If I change the parameters (step size, regularization and iterations) I
 get NaNs more often than not:
 (2007.0,NaN)
 (2003.0,NaN)
 (2005.0,NaN)
 ...

 On the other hand DecisionTree model give sensible results.

 I see there is a `setIntercept()` method in abstract class
 GeneralizedLinearAlgorithm that seems to trigger the use of the intercept
 but I'm unable to use it from the public interface :(

 Any help appreciated :)

 Eustache

 [1] https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD



Running the BroadcastTest.scala with TorrentBroadcastFactory in a standalone cluster

2014-07-03 Thread jackxucs
Hello,

I am running the BroadcastTest example in a standalone cluster using
spark-submit. I have 8 host machines and made Host1 the master. Host2 to
Host8 act as 7 workers to connect to the master. The connection was fine as
I could see all 7 hosts on the master web ui. The BroadcastTest example with
Http broadcast also works fine, I think, as there was no error msg and all
workers EXITED at the end. But when I changed the third argument from
Http to Torrent to use Torrent broadcast, all workers got a KILLED
status once they reached sc.stop(). 

Below is the stderr on one of the workers when running Torrent broadcast (I
masked the IP addresses):
==
14/07/02 18:20:03 INFO SecurityManager: Changing view acls to: root
14/07/02 18:20:03 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(root)
14/07/02 18:20:04 INFO Slf4jLogger: Slf4jLogger started
14/07/02 18:20:04 INFO Remoting: Starting remoting
14/07/02 18:20:04 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://driverPropsFetcher@dyn-xxx-xx-xx-xx:37771]
14/07/02 18:20:04 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://driverPropsFetcher@dyn-xxx-xx-xx-xx:37771]
14/07/02 18:20:04 INFO SecurityManager: Changing view acls to: root
14/07/02 18:20:04 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(root)
14/07/02 18:20:04 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
down remote daemon.
14/07/02 18:20:04 INFO RemoteActorRefProvider$RemotingTerminator: Remote
daemon shut down; proceeding with flushing remote transports.
14/07/02 18:20:04 INFO Slf4jLogger: Slf4jLogger started
14/07/02 18:20:04 INFO Remoting: Starting remoting
14/07/02 18:20:04 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkExecutor@dyn-xxx-xx-xx-xx:53661]
14/07/02 18:20:04 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkExecutor@dyn-xxx-xx-xx-xx:53661]
14/07/02 18:20:04 INFO CoarseGrainedExecutorBackend: Connecting to driver:
akka.tcp://spark@dyn-xxx-xx-xx-xx:42436/user/CoarseGrainedScheduler
14/07/02 18:20:04 INFO WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@dyn-xxx-xx-xx-xx:32818/user/Worker
14/07/02 18:20:04 INFO Remoting: Remoting shut down
14/07/02 18:20:04 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
shut down.
14/07/02 18:20:04 INFO WorkerWatcher: Successfully connected to
akka.tcp://sparkWorker@dyn-xxx-xx-xx-xx:32818/user/Worker
14/07/02 18:20:04 INFO CoarseGrainedExecutorBackend: Successfully registered
with driver
14/07/02 18:20:04 INFO SecurityManager: Changing view acls to: root
14/07/02 18:20:04 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(root)
14/07/02 18:20:04 INFO Slf4jLogger: Slf4jLogger started
14/07/02 18:20:04 INFO Remoting: Starting remoting
14/07/02 18:20:05 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@dyn-xxx-xx-xx-xx:57883]
14/07/02 18:20:05 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@dyn-xxx-xx-xx-xx:57883]
14/07/02 18:20:05 INFO SparkEnv: Connecting to MapOutputTracker:
akka.tcp://spark@dyn-xxx-xx-xx-xx:42436/user/MapOutputTracker
14/07/02 18:20:05 INFO SparkEnv: Connecting to BlockManagerMaster:
akka.tcp://spark@dyn-xxx-xx-xx-xx:42436/user/BlockManagerMaster
14/07/02 18:20:05 INFO DiskBlockManager: Created local directory at
/tmp/spark-local-20140702182005-30bd
14/07/02 18:20:05 INFO ConnectionManager: Bound socket to port 60368 with id
= ConnectionManagerId(dyn-xxx-xx-xx-xx,60368)
14/07/02 18:20:05 INFO MemoryStore: MemoryStore started with capacity 294.6
MB
14/07/02 18:20:05 INFO BlockManagerMaster: Trying to register BlockManager
14/07/02 18:20:05 INFO BlockManagerMaster: Registered BlockManager
14/07/02 18:20:05 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-35f65442-e0e8-4122-9359-ca8232ca97a6
14/07/02 18:20:05 INFO HttpServer: Starting HTTP Server
14/07/02 18:20:06 INFO CoarseGrainedExecutorBackend: Got assigned task 9
14/07/02 18:20:06 INFO Executor: Running task ID 9
14/07/02 18:20:06 INFO Executor: Fetching
http://xxx.xx.xx.xx:54292/jars/broadcast-test_2.10-1.0.jar with timestamp
1404339601903
14/07/02 18:20:06 INFO Utils: Fetching
http://xxx.xx.xx.xx:54292/jars/broadcast-test_2.10-1.0.jar to
/tmp/fetchFileTemp5382215579021312284.tmp
14/07/02 18:20:06 INFO BlockManager: Removing broadcast 0
14/07/02 18:20:07 INFO Executor: Adding
file:/home/lrl/Desktop/spark-master/work/app-20140702182002-0006/3/./broadcast-test_2.10-1.0.jar
to class loader
14/07/02 18:20:07 INFO TorrentBroadcast: Started reading broadcast variable
1
14/07/02 18:20:07 INFO SendingConnection: Initiating connection to
[dyn-xxx-xx-xx-xx:60179]
14/07/02 18:20:07 INFO SendingConnection: Connected to
[dyn-xxx-xx-xx-xx:60179], 1 

Re: Run spark unit test on Windows 7

2014-07-03 Thread Denny Lee
Hi Konstantin,

Could you please create a jira item at: 
https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked?

Thanks,
Denny


On July 2, 2014 at 11:45:24 PM, Konstantin Kudryavtsev 
(kudryavtsev.konstan...@gmail.com) wrote:

It sounds really strange...

I guess it is a bug, critical bug and must be fixed... at least some flag must 
be add (unable.hadoop)

I found the next workaround :
1) download compiled winutils.exe from 
http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight
2) put this file into d:\winutil\bin
3) add in my test: System.setProperty(hadoop.home.dir, d:\\winutil\\)

after that test runs

Thank you,
Konstantin Kudryavtsev


On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee denny.g@gmail.com wrote:
You don't actually need it per se - its just that some of the Spark libraries 
are referencing Hadoop libraries even if they ultimately do not call them. When 
I was doing some early builds of Spark on Windows, I admittedly had Hadoop on 
Windows running as well and had not run into this particular issue.



On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev 
kudryavtsev.konstan...@gmail.com wrote:
No, I don’t

why do I need to have HDP installed? I don’t use Hadoop at all and I’d like to 
read data from local filesystem

On Jul 2, 2014, at 9:10 PM, Denny Lee denny.g@gmail.com wrote:

By any chance do you have HDP 2.1 installed? you may need to install the utils 
and update the env variables per 
http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows


On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev 
kudryavtsev.konstan...@gmail.com wrote:

Hi Andrew,

it's windows 7 and I doesn't set up any env variables here 

The full stack trace:

14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
at org.apache.hadoop.security.Groups.init(Groups.java:77)
at 
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at 
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
at org.apache.spark.SparkContext.init(SparkContext.scala:228)
at org.apache.spark.SparkContext.init(SparkContext.scala:97)
at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at junit.framework.TestCase.runTest(TestCase.java:168)
at junit.framework.TestCase.runBare(TestCase.java:134)
at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
at junit.framework.TestSuite.runTest(TestSuite.java:232)
at junit.framework.TestSuite.run(TestSuite.java:227)
at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)


Thank you,
Konstantin Kudryavtsev


On Wed, Jul 2, 2014 at 8:15 PM, Andrew Or and...@databricks.com wrote:
Hi Konstatin,

We use hadoop as a library in a few places in Spark. I wonder why the path 
includes null though.

Could you provide the full stack trace?


Re: Kafka - streaming from multiple topics

2014-07-03 Thread Sergey Malov
That’s an obvious workaround, yes, thank you Tobias.
However, I’m prototyping substitution to real batch process, where I’d have to 
create six streams (and possibly more).  It could be a bit messy.
On the other hand, under the hood KafkaInputDStream which is create with this 
KafkaUtils call,  calls ConsumerConnector.createMessageStream which returns a 
Map[String, List[KafkaStream] keyed by topic. It is, however, not exposed.
So Kafka does provide capability of creating multiple streams based on topic, 
but Spark doesn’t use it, which is unfortunate.

Sergey

From: Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Date: Wednesday, July 2, 2014 at 9:54 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Kafka - streaming from multiple topics

Sergey,

you might actually consider using two streams, like
  val stream1 = KafkaUtils.createStream(ssc,localhost:2181,logs, 
Map(retarget - 2))
  val stream2 = KafkaUtils.createStream(ssc,localhost:2181,logs, 
Map(datapair - 2))
to achieve what you want. This has the additional advantage that there are 
actually two connections to Kafka and data is possibly received on different 
cluster nodes, already increasing parallelity in an early stage of processing.

Tobias



On Thu, Jul 3, 2014 at 6:47 AM, Sergey Malov 
sma...@collective.commailto:sma...@collective.com wrote:
HI,
I would like to set up streaming from Kafka cluster, reading multiple topics 
and then processing each of the differently.
So, I’d create a stream

  val stream = KafkaUtils.createStream(ssc,localhost:2181,logs, 
Map(retarget - 2,datapair - 2))

And then based on whether it’s “retarget” topic or “datapair”, set up different 
filter function, map function, reduce function, etc. Is it possible ?  I’d 
assume it should be, since ConsumerConnector can map of KafkaStreams keyed on 
topic, but I can’t find that it would be visible to Spark.

Thank you,

Sergey Malov




Re: reduceByKey Not Being Called by Spark Streaming

2014-07-03 Thread Dan H.
Hi All,

I was able to resolve this matter with a simple fix.  It seems that in order
to process a reduceByKey and the flat map operations at the same time, the
only way to resolve was to increase the number of threads to  1.

Since I'm developing on my personal machine for speed, I simply updated the
sparkURL argument to:
   private static String sparkURL = local[2];  //Instead of local

,which is then used by the JavaStreamingContext method as a parameter.

After I made this change, I was able to see the reduceByKey values properly
aggregated and counted.

Best Regards,

D



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-Not-Being-Called-by-Spark-Streaming-tp8684p8739.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: write event logs with YARN

2014-07-03 Thread Andrew Or
Hi Christophe, another Andrew speaking.

Your configuration looks fine to me. From the stack trace it seems that we
are in fact closing the file system pre-maturely elsewhere in the system,
such that when it tries to write the APPLICATION_COMPLETE file it throws
the exception you see. This does look like a potential bug in Spark.
Tracing the source of this may take a little, but we will start looking
into it.

I'm assuming if you manually create your own APPLICATION_COMPLETE file then
the entries should show up. Unfortunately I don't see another workaround
for this, but we'll fix this as soon as possible.

Andrew


2014-07-03 1:44 GMT-07:00 Christophe Préaud christophe.pre...@kelkoo.com:

  Hi Andrew,

 This does not work (the application failed), I have the following error
 when I put 3 slashes in the hdfs scheme:
 (...)
 Caused by: java.lang.IllegalArgumentException: Pathname /
 dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442
 from hdfs:/
 dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442
 is not a valid DFS filename.
 (...)

 Besides, I do not think that there is an issue with the hdfs path name
 since only the empty APPLICATION_COMPLETE file is missing (with
 spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events),
 all other files are correctly created, e.g.:
 hdfs dfs -ls spark-events/kelkoo.searchkeywordreport-1404376178470
 Found 3 items
 -rwxrwx---   1 kookel supergroup  0 2014-07-03 08:29
 spark-events/kelkoo.searchkeywordreport-1404376178470/COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
 -rwxrwx---   1 kookel supergroup 137948 2014-07-03 08:32
 spark-events/kelkoo.searchkeywordreport-1404376178470/EVENT_LOG_2
 -rwxrwx---   1 kookel supergroup  0 2014-07-03 08:29
 spark-events/kelkoo.searchkeywordreport-1404376178470/SPARK_VERSION_1.0.0

 You help is appreciated though, do not hesitate if you have any other idea
 on how to fix this.

 Thanks,
 Christophe.


 On 03/07/2014 01:49, Andrew Lee wrote:

 Hi Christophe,

  Make sure you have 3 slashes in the hdfs scheme.

  e.g.

  hdfs:*///*server_name:9000/user/user_name/spark-events

  and in the spark-defaults.conf as well.
 spark.eventLog.dir=hdfs:*///*
 server_name:9000/user/user_name/spark-events


  Date: Thu, 19 Jun 2014 11:18:51 +0200
  From: christophe.pre...@kelkoo.com
  To: user@spark.apache.org
  Subject: write event logs with YARN
 
  Hi,
 
  I am trying to use the new Spark history server in 1.0.0 to view
 finished applications (launched on YARN), without success so far.
 
  Here are the relevant configuration properties in my spark-defaults.conf:
 
  spark.yarn.historyServer.address=server_name:18080
  spark.ui.killEnabled=false
  spark.eventLog.enabled=true
  spark.eventLog.compress=true
 
 spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events
 
  And the history server has been launched with the command below:
 
  /opt/spark/sbin/start-history-server.sh
 hdfs://server_name:9000/user/user_name/spark-events
 
 
  However, the finished application do not appear in the history server UI
 (though the UI itself works correctly).
  Apparently, the problem is that the APPLICATION_COMPLETE file is not
 created:
 
  hdfs dfs -stat %n spark-events/application_name-1403166516102/*
  COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
  EVENT_LOG_2
  SPARK_VERSION_1.0.0
 
  Indeed, if I manually create an empty APPLICATION_COMPLETE file in the
 above directory, the application can now be viewed normally in the history
 server.
 
  Finally, here is the relevant part of the YARN application log, which
 seems to imply that
  the DFS Filesystem is already closed when the APPLICATION_COMPLETE file
 is created:
 
  (...)
  14/06/19 08:29:29 INFO ApplicationMaster: finishApplicationMaster with
 SUCCEEDED
  14/06/19 08:29:29 INFO AMRMClientImpl: Waiting for application to be
 successfully unregistered.
  14/06/19 08:29:29 INFO ApplicationMaster: AppMaster received a signal.
  14/06/19 08:29:29 INFO ApplicationMaster: Deleting staging directory
 .sparkStaging/application_1397477394591_0798
  14/06/19 08:29:29 INFO ApplicationMaster$$anon$1: Invoking sc stop from
 shutdown hook
  14/06/19 08:29:29 INFO SparkUI: Stopped Spark web UI at
 http://dc1-ibd-corp-hadoop-02.corp.dc1.kelkoo.net:54877
  14/06/19 08:29:29 INFO DAGScheduler: Stopping DAGScheduler
  14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Shutting down all
 executors
  14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Asking each
 executor to shut down
  14/06/19 08:29:30 INFO MapOutputTrackerMasterActor:
 MapOutputTrackerActor stopped!
  14/06/19 08:29:30 INFO ConnectionManager: Selector thread was
 interrupted!
  14/06/19 08:29:30 INFO ConnectionManager: ConnectionManager stopped
  14/06/19 08:29:30 INFO MemoryStore: MemoryStore cleared
  14/06/19 08:29:30 INFO BlockManager: 

spark text processing

2014-07-03 Thread M Singh
Hi:

Is there a way to find out when spark has finished processing a text file (both 
for streaming and non-streaming cases) ?

Also, after processing, can spark copy the file to another directory ?


Thanks


Re: issue with running example code

2014-07-03 Thread Gurvinder Singh
Just to provide more information on this issue. It seems that SPARK_HOME
environment variable is causing the issue. If I unset the variable in
spark-class script and run in the local mode my code runs fine without
the exception. But if I run with SPARK_HOME, I get the exception
mentioned below. I could run without setting SPARK_HOME but it is not
possible to run in the cluster settings, as this tells where is spark on
worker nodes. E.g. we are using Mesos as cluster manager, thus when set
master to mesos we get the exception as SPARK_HOME is not set.

Just to mention again the pyspark works fine as well as spark-shell,
only when we are running compiled jar it seems SPARK_HOME causes some
java run time issues that we get class cast exception.

Thanks,
Gurvinder
On 07/01/2014 09:28 AM, Gurvinder Singh wrote:
 Hi,
 
 I am having issue in running scala example code. I have tested and able
 to run successfully python example code, but when I run the scala code I
 get this error
 
 java.lang.ClassCastException: cannot assign instance of
 org.apache.spark.examples.SparkPi$$anonfun$1 to field
 org.apache.spark.rdd.MappedRDD.f of type scala.Function1 in instance of
 org.apache.spark.rdd.MappedRDD
 
 I have compiled spark from the github directly and running with the
 command as
 
 spark-submit /usr/share/spark/lib/spark-examples_2.10-1.1.0-SNAPSHOT.jar
 --class org.apache.spark.examples.SparkPi 5 --jars
 /usr/share/spark/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.h5.0.1.jar
 
 Any suggestions will be helpful.
 
 Thanks,
 Gurvinder
 



Re: MLLib : Math on Vector and Matrix

2014-07-03 Thread Dmitriy Lyubimov
On Wed, Jul 2, 2014 at 11:40 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Dmitriy,

 It is sweet to have the bindings, but it is very easy to downgrade the
 performance with them. The BLAS/LAPACK APIs have been there for more
 than 20 years and they are still the top choice for high-performance
 linear algebra.


There's no such limitation there. In fact, LAPACK/jblas is very easy fruit
to have there.

 algebraic optimizer is not about so much about in-core block-on-block
techniques. It is about optimizing/simplification of algebraic expressions,
especially their distributed plans/side of things. Another side of the
story is consistent matrix representation for block-2-block in-memory
computations and passing stuff in and out. R-like look  feel.

It is true that in-core-only computations currently are not deferrably
optimized, nor do they have LAPack back but this is a low hanging fruit
there. the main idea is consistency of algebraic API/DSL, be it distributed
or in-core, and having algebraic optimizer, and pluggable backs (both
in-core backs or distributed engine backs as well).

It is so happened the only in-memory back right now is Mahout's Colt
derivation, but there's fairly little reason not to pick-plug
Breeze/Lapack, or say GPU -backed representations. In fact, that was my
first attempt a year ago (Breeze) but it unfortunately was not where it
needed to be (not sure about now).

As for LAPack, yes it is easy to integrate. But the only reason I
(personally) haven't integrated it yet is because my problems tend to be
sparse, not dense, and also fairly invasive in terms of custom matrix
traversals (probabilistic fitting, for the most part). So most specifically
tweaked methodologies are thus really more quasi-algebraic than purely
algebraic, unfortunately. Having LAPack blockwise operartors on dense
matrices would not help me terribly there.

But the architectural problem in terms of foundation, and, more
specifically, customization of processes IMO does exist here (in mllib).
This thread (and there was another one just like this a few threads below
this one) are read by me as the manifestation of such lack of algebraic
foundation apis/optimizers.


Re: Run spark unit test on Windows 7

2014-07-03 Thread Kostiantyn Kudriavtsev
Hi Denny,

just created https://issues.apache.org/jira/browse/SPARK-2356

On Jul 3, 2014, at 7:06 PM, Denny Lee denny.g@gmail.com wrote:

 Hi Konstantin,
 
 Could you please create a jira item at: 
 https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked?
 
 Thanks,
 Denny
 
 
 On July 2, 2014 at 11:45:24 PM, Konstantin Kudryavtsev 
 (kudryavtsev.konstan...@gmail.com) wrote:
 
 It sounds really strange...
 
 I guess it is a bug, critical bug and must be fixed... at least some flag 
 must be add (unable.hadoop)
 
 I found the next workaround :
 1) download compiled winutils.exe from 
 http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight
 2) put this file into d:\winutil\bin
 3) add in my test: System.setProperty(hadoop.home.dir, d:\\winutil\\)
 
 after that test runs
 
 Thank you,
 Konstantin Kudryavtsev
 
 
 On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee denny.g@gmail.com wrote:
 You don't actually need it per se - its just that some of the Spark 
 libraries are referencing Hadoop libraries even if they ultimately do not 
 call them. When I was doing some early builds of Spark on Windows, I 
 admittedly had Hadoop on Windows running as well and had not run into this 
 particular issue.
 
 
 
 On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 No, I don’t
 
 why do I need to have HDP installed? I don’t use Hadoop at all and I’d like 
 to read data from local filesystem
 
 On Jul 2, 2014, at 9:10 PM, Denny Lee denny.g@gmail.com wrote:
 
 By any chance do you have HDP 2.1 installed? you may need to install the 
 utils and update the env variables per 
 http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows
 
 
 On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 
 Hi Andrew,
 
 it's windows 7 and I doesn't set up any env variables here 
 
 The full stack trace:
 
 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in 
 the Hadoop binaries.
 at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
 at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
 at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
 at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
 at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
 at org.apache.hadoop.security.Groups.init(Groups.java:77)
 at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
 at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
 at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
 at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
 at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
 at org.apache.spark.SparkContext.init(SparkContext.scala:228)
 at org.apache.spark.SparkContext.init(SparkContext.scala:97)
 at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at junit.framework.TestCase.runTest(TestCase.java:168)
 at junit.framework.TestCase.runBare(TestCase.java:134)
 at junit.framework.TestResult$1.protect(TestResult.java:110)
 at junit.framework.TestResult.runProtected(TestResult.java:128)
 at junit.framework.TestResult.run(TestResult.java:113)
 at junit.framework.TestCase.run(TestCase.java:124)
 at junit.framework.TestSuite.runTest(TestSuite.java:232)
 at junit.framework.TestSuite.run(TestSuite.java:227)
 at 
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
 at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
 at 
 com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
 at 
 com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
 at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
 
 
 

Re: Run spark unit test on Windows 7

2014-07-03 Thread Denny Lee
Thanks!  will take a look at this later today. HTH!



 On Jul 3, 2014, at 11:09 AM, Kostiantyn Kudriavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 
 Hi Denny,
 
 just created https://issues.apache.org/jira/browse/SPARK-2356
 
 On Jul 3, 2014, at 7:06 PM, Denny Lee denny.g@gmail.com wrote:
 
 Hi Konstantin,
 
 Could you please create a jira item at: 
 https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked?
 
 Thanks,
 Denny
 
 
 On July 2, 2014 at 11:45:24 PM, Konstantin Kudryavtsev 
 (kudryavtsev.konstan...@gmail.com) wrote:
 
 It sounds really strange...
 
 I guess it is a bug, critical bug and must be fixed... at least some flag 
 must be add (unable.hadoop)
 
 I found the next workaround :
 1) download compiled winutils.exe from 
 http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight
 2) put this file into d:\winutil\bin
 3) add in my test: System.setProperty(hadoop.home.dir, d:\\winutil\\)
 
 after that test runs
 
 Thank you,
 Konstantin Kudryavtsev
 
 
 On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee denny.g@gmail.com wrote:
 You don't actually need it per se - its just that some of the Spark 
 libraries are referencing Hadoop libraries even if they ultimately do not 
 call them. When I was doing some early builds of Spark on Windows, I 
 admittedly had Hadoop on Windows running as well and had not run into this 
 particular issue.
 
 
 
 On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 No, I don’t
 
 why do I need to have HDP installed? I don’t use Hadoop at all and I’d 
 like to read data from local filesystem
 
 On Jul 2, 2014, at 9:10 PM, Denny Lee denny.g@gmail.com wrote:
 
 By any chance do you have HDP 2.1 installed? you may need to install the 
 utils and update the env variables per 
 http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows
 
 
 On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 
 Hi Andrew,
 
 it's windows 7 and I doesn't set up any env variables here 
 
 The full stack trace:
 
 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in 
 the hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe 
 in the Hadoop binaries.
 at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
 at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
 at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
 at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
 at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
 at org.apache.hadoop.security.Groups.init(Groups.java:77)
 at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
 at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
 at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
 at org.apache.spark.SparkContext.init(SparkContext.scala:228)
 at org.apache.spark.SparkContext.init(SparkContext.scala:97)
 at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at junit.framework.TestCase.runTest(TestCase.java:168)
 at junit.framework.TestCase.runBare(TestCase.java:134)
 at junit.framework.TestResult$1.protect(TestResult.java:110)
 at junit.framework.TestResult.runProtected(TestResult.java:128)
 at junit.framework.TestResult.run(TestResult.java:113)
 at junit.framework.TestCase.run(TestCase.java:124)
 at junit.framework.TestSuite.runTest(TestSuite.java:232)
 at junit.framework.TestSuite.run(TestSuite.java:227)
 at 
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
 at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
 at 
 com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
 at 
 com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
 at 
 com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 

Anaconda Spark AMI

2014-07-03 Thread Benjamin Zaitlen
Hi All,

I'm a dev a Continuum and we are developing a fair amount of tooling around
Spark.  A few days ago someone expressed interest in numpy+pyspark and
Anaconda came up as a reasonable solution.

I spent a number of hours yesterday trying to rework the base Spark AMI on
EC2 but sadly was defeated by a number of errors.

Aggregations seemed to choke -- where as small takes executed as aspected
(errors are linked to the gist):

 sc.appName
u'PySparkShell'
 sc._conf.getAll()
[(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'),
(u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''), (u'
spark.app.name', u'
PySparkShell'), (u'spark.executor.extraClassPath',
u'/root/ephemeral-hdfs/conf'), (u'spark.master',
u'spark://.compute-1.amazonaws.com:7077')]
 file = sc.textFile(hdfs:///user/root/chekhov.txt)
 file.take(2)
[uProject Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov,
u'']

 lines = file.filter(lambda x: len(x)  0)
 lines.count()
VARIOUS ERROS DISCUSSED BELOW

My first thought was that I could simply get away with including anaconda
on the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
 Doing so resulted in some strange py4j errors like the following:

Py4JError: An error occurred while calling o17.partitions. Trace:
py4j.Py4JException: Method partitions([]) does not exist

At some point I also saw:
SystemError: Objects/cellobject.c:24: bad argument to internal function

which is really strange, possibly the result of a version mismatch?

I had another thought of building spark from master on the AMI, leaving the
spark directory in place, and removing the spark call from the modules list
in spark-ec2 launch script. Unfortunately, this resulted in the following
errors:

https://gist.github.com/quasiben/da0f4778fbc87d02c088

If a spark dev was willing to make some time in the near future, I'm sure
she/he and I could sort out these issues and give the Spark community a
python distro ready to go for numerical computing.  For instance, I'm not
sure how pyspark calls out to launching a python session on a slave?  Is
this done as root or as the hadoop user? (i believe i changed /etc/bashrc
to point to my anaconda bin directory so it shouldn't really matter.  Is
there something special about the py4j zip include in spark dir compared
with the py4j in pypi?

Thoughts?

--Ben


Re: Spark Streaming Error Help - ERROR actor.OneForOneStrategy: key not found:

2014-07-03 Thread jschindler
I think I have found my answers but if anyone has thoughts please share.

After testing for a while I think the error doesn't have any effect on the
process.

I think it is the case that there must be elements left in the window from
last run otherwise my system is completely whack.

Please let me know if any of this looks incorrect, thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Error-Help-ERROR-actor-OneForOneStrategy-key-not-found-tp8746p8750.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Spark logging strategy on YARN

2014-07-03 Thread Kostiantyn Kudriavtsev
Hi all,

Could you please share your the best practices on writing logs in Spark? I’m 
running it on YARN, so when I check logs I’m bit confused… 
Currently, I’m writing System.err.println to put a message in log and access it 
via YARN history server. But, I don’t like this way… I’d like to use 
log4j/slf4j and write them to more concrete place… any practices?

Thank you in advance

Re: Anaconda Spark AMI

2014-07-03 Thread Jey Kottalam
Hi Ben,

Has the PYSPARK_PYTHON environment variable been set in
spark/conf/spark-env.sh to the path of the new python binary?

FYI, there's a /root/copy-dirs script that can be handy when updating
files on an already-running cluster. You'll want to restart the spark
cluster for the changes to take effect, as described at
https://spark.apache.org/docs/latest/ec2-scripts.html

Hope that helps,
-Jey

On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen quasi...@gmail.com wrote:
 Hi All,

 I'm a dev a Continuum and we are developing a fair amount of tooling around
 Spark.  A few days ago someone expressed interest in numpy+pyspark and
 Anaconda came up as a reasonable solution.

 I spent a number of hours yesterday trying to rework the base Spark AMI on
 EC2 but sadly was defeated by a number of errors.

 Aggregations seemed to choke -- where as small takes executed as aspected
 (errors are linked to the gist):

 sc.appName
 u'PySparkShell'
 sc._conf.getAll()
 [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'),
 (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''),
 (u'spark.app.name', u'
 PySparkShell'), (u'spark.executor.extraClassPath',
 u'/root/ephemeral-hdfs/conf'), (u'spark.master',
 u'spark://.compute-1.amazonaws.com:7077')]
 file = sc.textFile(hdfs:///user/root/chekhov.txt)
 file.take(2)
 [uProject Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov,
 u'']

 lines = file.filter(lambda x: len(x)  0)
 lines.count()
 VARIOUS ERROS DISCUSSED BELOW

 My first thought was that I could simply get away with including anaconda on
 the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
 Doing so resulted in some strange py4j errors like the following:

 Py4JError: An error occurred while calling o17.partitions. Trace:
 py4j.Py4JException: Method partitions([]) does not exist

 At some point I also saw:
 SystemError: Objects/cellobject.c:24: bad argument to internal function

 which is really strange, possibly the result of a version mismatch?

 I had another thought of building spark from master on the AMI, leaving the
 spark directory in place, and removing the spark call from the modules list
 in spark-ec2 launch script. Unfortunately, this resulted in the following
 errors:

 https://gist.github.com/quasiben/da0f4778fbc87d02c088

 If a spark dev was willing to make some time in the near future, I'm sure
 she/he and I could sort out these issues and give the Spark community a
 python distro ready to go for numerical computing.  For instance, I'm not
 sure how pyspark calls out to launching a python session on a slave?  Is
 this done as root or as the hadoop user? (i believe i changed /etc/bashrc to
 point to my anaconda bin directory so it shouldn't really matter.  Is there
 something special about the py4j zip include in spark dir compared with the
 py4j in pypi?

 Thoughts?

 --Ben




Re: LIMIT with offset in SQL queries

2014-07-03 Thread Michael Armbrust
Doing an offset is actually pretty expensive in a distributed query engine,
so in many cases it probably makes sense to just collect and then perform
the offset as you are doing now.  This is unless the offset is very large.

Another limitation here is that HiveQL does not support OFFSET.  That said
if you want to open a JIRA we would consider implementing it.


On Wed, Jul 2, 2014 at 1:37 PM, durin m...@simon-schaefer.net wrote:

 Hi,

 in many SQL-DBMS like MySQL, you can set an offset for the LIMIT clause,
 s.t. /LIMIT 5, 10/ will return 10 rows, starting from row 5.

 As far as I can see, this is not possible in Spark-SQL.
 The best solution I have to imitate that (using Scala) is converting the
 RDD
 into an Array via collect() and then using a for-loop to return certain
 elements from that Array.




 Is there a better solution regarding performance and are there plans to
 implement an offset for LIMIT?


 Kind regards,
 Simon



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/LIMIT-with-offset-in-SQL-queries-tp8673.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Which version of Hive support Spark Shark

2014-07-03 Thread Michael Armbrust
Spark SQL is based on Hive 0.12.0.


On Thu, Jul 3, 2014 at 2:29 AM, Ravi Prasad raviprasa...@gmail.com wrote:

 Hi ,
   Can any one please help me to understand which version of Hive support
 Spark and Shark

 --
 --
 Regards,
 RAVI PRASAD. T



Re: Running the BroadcastTest.scala with TorrentBroadcastFactory in a standalone cluster

2014-07-03 Thread Mosharaf Chowdhury
Hi Jack,

   1. Several previous instances of key not valid? error had been
   attributed to memory issues, either memory allocated per executor or per
   task, depending on the context. You can google it to see some examples.
   2. I think your case is similar, even though its happening due to
   broadcast. I suspect specifically, this line 14/07/02 18:20:09 INFO
   BlockManagerMaster: Updated info of block broadcast_2_piece0 after the
   driver commanded a shutdown. It's happening only for TorrentBroadcast,
   because HttpBroadcast does not store intermediate chunks of a broadcast in
   memory.
   3. You might want to allocate more memory to Spark executors to take
   advantage of in-memory processing. ~300MB caching space per machine is
   likely to be too small for most jobs.
   4. Another common cause of disconnection is the
spark.akka.frameSize parameter.
   You can try playing with it. While you don't enough memory to crank it up,
   you can try moving it up and down within reason.
   5. There is one more curious line in your trace: 14/07/02 18:20:06 INFO
   BlockManager: Removing broadcast 0 Nothing should've been there to
   remove in the first place.
   6. Finally, we found in our benchmarks that using TorrentBroadcast in
   smaller clusters (10) and small data size (10MB) has no benefit over
   HttpBroadcast, and often worse. I'd suggest sticking to HttpBroadcast
   unless you have gigantic broadcast (=1GB) or too many nodes (many 10s or
   100s).

Hope it helps,
Mosharaf

--
Mosharaf Chowdhury
http://www.mosharaf.com/


On Thu, Jul 3, 2014 at 7:48 AM, jackxucs jackx...@gmail.com wrote:

 Hello,

 I am running the BroadcastTest example in a standalone cluster using
 spark-submit. I have 8 host machines and made Host1 the master. Host2 to
 Host8 act as 7 workers to connect to the master. The connection was fine as
 I could see all 7 hosts on the master web ui. The BroadcastTest example
 with
 Http broadcast also works fine, I think, as there was no error msg and all
 workers EXITED at the end. But when I changed the third argument from
 Http to Torrent to use Torrent broadcast, all workers got a KILLED
 status once they reached sc.stop().

 Below is the stderr on one of the workers when running Torrent broadcast (I
 masked the IP addresses):

 ==
 14/07/02 18:20:03 INFO SecurityManager: Changing view acls to: root
 14/07/02 18:20:03 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(root)
 14/07/02 18:20:04 INFO Slf4jLogger: Slf4jLogger started
 14/07/02 18:20:04 INFO Remoting: Starting remoting
 14/07/02 18:20:04 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://driverPropsFetcher@dyn-xxx-xx-xx-xx:37771]
 14/07/02 18:20:04 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://driverPropsFetcher@dyn-xxx-xx-xx-xx:37771]
 14/07/02 18:20:04 INFO SecurityManager: Changing view acls to: root
 14/07/02 18:20:04 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(root)
 14/07/02 18:20:04 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
 down remote daemon.
 14/07/02 18:20:04 INFO RemoteActorRefProvider$RemotingTerminator: Remote
 daemon shut down; proceeding with flushing remote transports.
 14/07/02 18:20:04 INFO Slf4jLogger: Slf4jLogger started
 14/07/02 18:20:04 INFO Remoting: Starting remoting
 14/07/02 18:20:04 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkExecutor@dyn-xxx-xx-xx-xx:53661]
 14/07/02 18:20:04 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://sparkExecutor@dyn-xxx-xx-xx-xx:53661]
 14/07/02 18:20:04 INFO CoarseGrainedExecutorBackend: Connecting to driver:
 akka.tcp://spark@dyn-xxx-xx-xx-xx:42436/user/CoarseGrainedScheduler
 14/07/02 18:20:04 INFO WorkerWatcher: Connecting to worker
 akka.tcp://sparkWorker@dyn-xxx-xx-xx-xx:32818/user/Worker
 14/07/02 18:20:04 INFO Remoting: Remoting shut down
 14/07/02 18:20:04 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
 shut down.
 14/07/02 18:20:04 INFO WorkerWatcher: Successfully connected to
 akka.tcp://sparkWorker@dyn-xxx-xx-xx-xx:32818/user/Worker
 14/07/02 18:20:04 INFO CoarseGrainedExecutorBackend: Successfully
 registered
 with driver
 14/07/02 18:20:04 INFO SecurityManager: Changing view acls to: root
 14/07/02 18:20:04 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(root)
 14/07/02 18:20:04 INFO Slf4jLogger: Slf4jLogger started
 14/07/02 18:20:04 INFO Remoting: Starting remoting
 14/07/02 18:20:05 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://spark@dyn-xxx-xx-xx-xx:57883]
 14/07/02 18:20:05 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://spark@dyn-xxx-xx-xx-xx:57883]
 14/07/02 18:20:05 INFO SparkEnv: Connecting to MapOutputTracker:
 

Sample datasets for MLlib and Graphx

2014-07-03 Thread AlexanderRiggers
Hello!

I want to play around with several different cluster settings and measure
performances for MLlib and GraphX  and was wondering if anybody here could
hit me up with datasets for these applications from 5GB onwards? 

I mostly interested in SVM and Triangle Count, but would be glad for any
help.

Best regards,
Alex



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread Nick Pentreath
Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions




For svm there are a couple of ad click prediction datasets of pretty large size.




For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/



—
Sent from Mailbox

On Thu, Jul 3, 2014 at 3:25 PM, AlexanderRiggers
alexander.rigg...@gmail.com wrote:

 Hello!
 I want to play around with several different cluster settings and measure
 performances for MLlib and GraphX  and was wondering if anybody here could
 hit me up with datasets for these applications from 5GB onwards? 
 I mostly interested in SVM and Triangle Count, but would be glad for any
 help.
 Best regards,
 Alex
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread AlexanderRiggers
Nick Pentreath wrote
 Take a look at Kaggle competition datasets
 - https://www.kaggle.com/competitions

I was looking for files in LIBSVM format and never found something on Kaggle
in bigger size. Most competitions I ve seen need data processing and feature
generating, but maybe I ve to take a second look.


Nick Pentreath wrote
 For graph stuff the SNAP has large network
 data: https://snap.stanford.edu/data/

Thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread Nick Pentreath
The Kaggle data is not in libsvm format so you'd have to do some transformation.


The Criteo and KDD cup datasets are if I recall fairly large. Criteo ad 
prediction data is around 2-3GB compressed I think.




To my knowledge these are the largest binary classification datasets I've come 
across which are easily publicly available (very happy to be proved wrong about 
this though :)
—
Sent from Mailbox

On Thu, Jul 3, 2014 at 4:39 PM, AlexanderRiggers
alexander.rigg...@gmail.com wrote:

 Nick Pentreath wrote
 Take a look at Kaggle competition datasets
 - https://www.kaggle.com/competitions
 I was looking for files in LIBSVM format and never found something on Kaggle
 in bigger size. Most competitions I ve seen need data processing and feature
 generating, but maybe I ve to take a second look.
 Nick Pentreath wrote
 For graph stuff the SNAP has large network
 data: https://snap.stanford.edu/data/
 Thanks
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Case class in java

2014-07-03 Thread Kevin Jung
This will load listed jars when SparkContext is created.
In case of REPL, we define and import classes after SparkContext created.
According to above mentioned site, Executor install class loader in
'addReplClassLoaderIfNeeded' method using spark.repl.class.uri
configuration.
Then I will try to make class server distributing *dynamically created
classes* in my driver application to Executors as spark REPL.

Thanks,
Kevin




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Case-class-in-java-tp8720p8765.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Kafka - streaming from multiple topics

2014-07-03 Thread Tobias Pfeiffer
Sergey,


On Fri, Jul 4, 2014 at 1:06 AM, Sergey Malov sma...@collective.com wrote:

 On the other hand, under the hood KafkaInputDStream which is create with
 this KafkaUtils call,  calls ConsumerConnector.createMessageStream which
 returns a Map[String, List[KafkaStream] keyed by topic. It is, however, not
 exposed.


I wonder if this is a bug. After all, KafkaUtils.createStream() returns a
DStream[(String, String)], which pretty much looks like it should be a
(topic - message) mapping. However, for me, the key is always null. Maybe
you could consider filing a bug/wishlist report?

Tobias


[no subject]

2014-07-03 Thread Steven Cox
Folks, I have a program derived from the Kafka streaming wordcount example 
which works fine standalone.


Running on Mesos is not working so well. For starters, I get the error below 
No FileSystem for scheme: hdfs.


I've looked at lots of promising comments on this issue so now I have -

* Every jar under hadoop in my classpath

* Hadoop HDFS and Client in my pom.xml


I find it odd that the app writes checkpoint files to HDFS successfully for a 
couple of cycles then throws this exception. This would suggest the problem is 
not with the syntax of the hdfs URL, for example.


Any thoughts on what I'm missing?


Thanks,


Steve


Mesos : 0.18.2

Spark : 0.9.1



14/07/03 21:14:20 WARN TaskSetManager: Lost TID 296 (task 1514.0:0)

14/07/03 21:14:20 WARN TaskSetManager: Lost TID 297 (task 1514.0:1)

14/07/03 21:14:20 WARN TaskSetManager: Lost TID 298 (task 1514.0:0)

14/07/03 21:14:20 ERROR TaskSetManager: Task 1514.0:0 failed 10 times; aborting 
job

14/07/03 21:14:20 ERROR JobScheduler: Error running job streaming job 
140443646 ms.0

org.apache.spark.SparkException: Job aborted: Task 1514.0:0 failed 10 times 
(most recent failure: Exception failure: java.io.IOException: No FileSystem for 
scheme: hdfs)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

at scala.Option.foreach(Option.scala:236)

at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

at akka.actor.ActorCell.invoke(ActorCell.scala:456)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)




No FileSystem for scheme: hdfs

2014-07-03 Thread Steven Cox
...and a real subject line.

From: Steven Cox [s...@renci.org]
Sent: Thursday, July 03, 2014 9:21 PM
To: user@spark.apache.org
Subject:


Folks, I have a program derived from the Kafka streaming wordcount example 
which works fine standalone.


Running on Mesos is not working so well. For starters, I get the error below 
No FileSystem for scheme: hdfs.


I've looked at lots of promising comments on this issue so now I have -

* Every jar under hadoop in my classpath

* Hadoop HDFS and Client in my pom.xml


I find it odd that the app writes checkpoint files to HDFS successfully for a 
couple of cycles then throws this exception. This would suggest the problem is 
not with the syntax of the hdfs URL, for example.


Any thoughts on what I'm missing?


Thanks,


Steve


Mesos : 0.18.2

Spark : 0.9.1



14/07/03 21:14:20 WARN TaskSetManager: Lost TID 296 (task 1514.0:0)

14/07/03 21:14:20 WARN TaskSetManager: Lost TID 297 (task 1514.0:1)

14/07/03 21:14:20 WARN TaskSetManager: Lost TID 298 (task 1514.0:0)

14/07/03 21:14:20 ERROR TaskSetManager: Task 1514.0:0 failed 10 times; aborting 
job

14/07/03 21:14:20 ERROR JobScheduler: Error running job streaming job 
140443646 ms.0

org.apache.spark.SparkException: Job aborted: Task 1514.0:0 failed 10 times 
(most recent failure: Exception failure: java.io.IOException: No FileSystem for 
scheme: hdfs)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

at scala.Option.foreach(Option.scala:236)

at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

at akka.actor.ActorCell.invoke(ActorCell.scala:456)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)




Re: No FileSystem for scheme: hdfs

2014-07-03 Thread Soren Macbeth
Are the hadoop configuration files on the classpath for your mesos
executors?


On Thu, Jul 3, 2014 at 6:45 PM, Steven Cox s...@renci.org wrote:

  ...and a real subject line.
  --
 *From:* Steven Cox [s...@renci.org]
 *Sent:* Thursday, July 03, 2014 9:21 PM
 *To:* user@spark.apache.org
 *Subject:*

   Folks, I have a program derived from the Kafka streaming wordcount
 example which works fine standalone.


  Running on Mesos is not working so well. For starters, I get the error
 below No FileSystem for scheme: hdfs.


  I've looked at lots of promising comments on this issue so now I have -

 * Every jar under hadoop in my classpath

 * Hadoop HDFS and Client in my pom.xml


  I find it odd that the app writes checkpoint files to HDFS successfully
 for a couple of cycles then throws this exception. This would suggest the
 problem is not with the syntax of the hdfs URL, for example.


  Any thoughts on what I'm missing?


  Thanks,


  Steve


  Mesos : 0.18.2

 Spark : 0.9.1



  14/07/03 21:14:20 WARN TaskSetManager: Lost TID 296 (task 1514.0:0)

 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 297 (task 1514.0:1)

 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 298 (task 1514.0:0)

 14/07/03 21:14:20 ERROR TaskSetManager: Task 1514.0:0 failed 10 times;
 aborting job

 14/07/03 21:14:20 ERROR JobScheduler: Error running job streaming job
 140443646 ms.0

 org.apache.spark.SparkException: Job aborted: Task 1514.0:0 failed 10
 times (most recent failure: Exception failure: java.io.IOException: No
 FileSystem for scheme: hdfs)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

 at akka.actor.ActorCell.invoke(ActorCell.scala:456)

 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)





RE: No FileSystem for scheme: hdfs

2014-07-03 Thread Steven Cox
They weren't. They are now and the logs look a bit better - like perhaps some 
serialization is completing that wasn't before.

But I still get the same error periodically. Other thoughts?


From: Soren Macbeth [so...@yieldbot.com]
Sent: Thursday, July 03, 2014 9:54 PM
To: user@spark.apache.org
Subject: Re: No FileSystem for scheme: hdfs

Are the hadoop configuration files on the classpath for your mesos executors?


On Thu, Jul 3, 2014 at 6:45 PM, Steven Cox 
s...@renci.orgmailto:s...@renci.org wrote:
...and a real subject line.

From: Steven Cox [s...@renci.orgmailto:s...@renci.org]
Sent: Thursday, July 03, 2014 9:21 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject:


Folks, I have a program derived from the Kafka streaming wordcount example 
which works fine standalone.


Running on Mesos is not working so well. For starters, I get the error below 
No FileSystem for scheme: hdfs.


I've looked at lots of promising comments on this issue so now I have -

* Every jar under hadoop in my classpath

* Hadoop HDFS and Client in my pom.xml


I find it odd that the app writes checkpoint files to HDFS successfully for a 
couple of cycles then throws this exception. This would suggest the problem is 
not with the syntax of the hdfs URL, for example.


Any thoughts on what I'm missing?


Thanks,


Steve


Mesos : 0.18.2

Spark : 0.9.1



14/07/03 21:14:20 WARN TaskSetManager: Lost TID 296 (task 1514.0:0)

14/07/03 21:14:20 WARN TaskSetManager: Lost TID 297 (task 1514.0:1)

14/07/03 21:14:20 WARN TaskSetManager: Lost TID 298 (task 1514.0:0)

14/07/03 21:14:20 ERROR TaskSetManager: Task 1514.0:0 failed 10 times; aborting 
job

14/07/03 21:14:20 ERROR JobScheduler: Error running job streaming job 
140443646 ms.0

org.apache.spark.SparkException: Job aborted: Task 1514.0:0 failed 10 times 
(most recent failure: Exception failure: java.io.IOException: No FileSystem for 
scheme: hdfs)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at 
org.apache.spark.scheduler.DAGScheduler.orghttp://org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

at scala.Option.foreach(Option.scala:236)

at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

at akka.actor.ActorCell.invoke(ActorCell.scala:456)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)





Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-03 Thread Ankur Dave
A common reason for the Joining ... is slow message is that you're
joining VertexRDDs without having cached them first. This will cause Spark
to recompute unnecessarily, and as a side effect, the same index will get
created twice and GraphX won't be able to do an efficient zip join.

For example, the following code will counterintuitively produce the
Joining ... is slow message:

val a = VertexRDD(sc.parallelize((1 to 100).map(x = (x.toLong, x
a.leftJoin(a) { (id, a, b) = a + b }

The remedy is to call a.cache() before a.leftJoin(a).

Ankur http://www.ankurdave.com/


SparkSQL with Streaming RDD

2014-07-03 Thread Chang Lim
Would appreciate help on:
1. How to convert streaming RDD into JavaSchemaRDD
2. How to structure the driver program to do interactive SparkSQL

Using Spark 1.0 with Java.

I have steaming code that does upateStateByKey resulting in JavaPairDStream. 
I am using JavaDStream::compute(time) to get JavaRDD.  However I am getting
the runtime expection:
   ERROR at runtime: org.apache.spark.streaming.dstream.StateDStream@18dc1b2
has not been initialized 

I know the code is executed before the stream is initialized.  Does anyone
have suggestions on how the design the code so accommodate async processing?  

Code Fragment:
//Spark SQL for the N seconds interval
SparkConf sparkConf = new
SparkConf().setMaster(SPARK_MASTER).setAppName(SQLStream);
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
final JavaSQLContext sqlCtx = new
org.apache.spark.sql.api.java.JavaSQLContext(ctx);

//convert JavaPairDStream to JavaDStream
JavaDStreamTuple2lt;String,TestConnection.DiscoveryRecord javaDStream =
statefullStream.toJavaDStream();

//Convert to TupleK,U to U
JavaDStreamDiscoveryRecord javaRDD = javaDStream.map(
new FunctionTuple2lt;String,TestConnection.DiscoveryRecord,
DiscoveryRecord(){
public DiscoveryRecord
call(Tuple2String,TestConnection.DiscoveryRecord eachT) {
return eachT._2;
}
}
);

//Convert JavaDStream to JavaRDD
//ERROR next line at runtime:
org.apache.spark.streaming.dstream.StateDStream@18dc1b2 has not been
initialized
JavaRDDDiscoveryRecord computedJavaRDD = javaRDD.compute(new
Time(10));

JavaSchemaRDD schemaStatefull = sqlCtx.applySchema( computedJavaRDD ,
DiscoveryRecord.class);
schemaStatefull.registerAsTable(statefull);



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-with-Streaming-RDD-tp8774.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: No FileSystem for scheme: hdfs

2014-07-03 Thread Akhil Das
​Most likely you are missing the hadoop configuration files (present in
conf/*.xml).​

Thanks
Best Regards


On Fri, Jul 4, 2014 at 7:38 AM, Steven Cox s...@renci.org wrote:

  They weren't. They are now and the logs look a bit better - like perhaps
 some serialization is completing that wasn't before.

  But I still get the same error periodically. Other thoughts?

  --
 *From:* Soren Macbeth [so...@yieldbot.com]
 *Sent:* Thursday, July 03, 2014 9:54 PM
 *To:* user@spark.apache.org
 *Subject:* Re: No FileSystem for scheme: hdfs

   Are the hadoop configuration files on the classpath for your mesos
 executors?


 On Thu, Jul 3, 2014 at 6:45 PM, Steven Cox s...@renci.org wrote:

  ...and a real subject line.
  --
 *From:* Steven Cox [s...@renci.org]
 *Sent:* Thursday, July 03, 2014 9:21 PM
 *To:* user@spark.apache.org
 *Subject:*

   Folks, I have a program derived from the Kafka streaming wordcount
 example which works fine standalone.


  Running on Mesos is not working so well. For starters, I get the error
 below No FileSystem for scheme: hdfs.


  I've looked at lots of promising comments on this issue so now I have -

 * Every jar under hadoop in my classpath

 * Hadoop HDFS and Client in my pom.xml


  I find it odd that the app writes checkpoint files to HDFS successfully
 for a couple of cycles then throws this exception. This would suggest the
 problem is not with the syntax of the hdfs URL, for example.


  Any thoughts on what I'm missing?


  Thanks,


  Steve


  Mesos : 0.18.2

 Spark : 0.9.1



  14/07/03 21:14:20 WARN TaskSetManager: Lost TID 296 (task 1514.0:0)

 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 297 (task 1514.0:1)

 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 298 (task 1514.0:0)

 14/07/03 21:14:20 ERROR TaskSetManager: Task 1514.0:0 failed 10 times;
 aborting job

 14/07/03 21:14:20 ERROR JobScheduler: Error running job streaming job
 140443646 ms.0

 org.apache.spark.SparkException: Job aborted: Task 1514.0:0 failed 10
 times (most recent failure: Exception failure: java.io.IOException: No
 FileSystem for scheme: hdfs)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

 at akka.actor.ActorCell.invoke(ActorCell.scala:456)

 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)