Re: Setting queue for spark job on yarn

2014-05-20 Thread Ron Gonzalez
Btw, I'm on 0.9.1. Will setting a queue programmatically be available in 1.0? Thanks, Ron Sent from my iPad > On May 20, 2014, at 6:27 PM, Ron Gonzalez wrote: > > Hi Sandy, > Is there a programmatic way? We're building a platform as a service and > need to assign it to different queues that

Re: Python, Spark and HBase

2014-05-20 Thread Nick Pentreath
Yes actually if you could possibly test the patch out and see how easy it is to load HBase Rdds that would be great.  That way I could make any amendments required to make HBase / Cassandra etc easier  — Sent from Mailbox On Wed, May 21, 2014 at 4:41 AM, Matei Zaharia wrote: > Unfortunately

any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-20 Thread Francis . Hu
sparkers, Is there a better way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ? Thanks, Francis.Hu

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
Aaron: I see this in the Master's logs: 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker worker-20140520011737-hdn3.int.meetup.com-50038 There was

Re: Python, Spark and HBase

2014-05-20 Thread Matei Zaharia
Unfortunately this is not yet possible. There’s a patch in progress posted here though: https://github.com/apache/spark/pull/455 — it would be great to get your feedback on it. Matei On May 20, 2014, at 4:21 PM, twizansk wrote: > Hello, > > This seems like a basic question but I have been un

IllegalStateException when creating Job from shell

2014-05-20 Thread Alex Holmes
Hi, I'm trying to work with Spark from the shell and create a Hadoop Job instance. I get the exception you see below because the Job.toString doesn't like to be called until it has been submitted. I tried using the :silent command but that didn't seem to have any impact. scala> import org.apache

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Nicholas Chammas
Yes, it does work with fewer GZipped files. I am reading the files in using sc.textFile() and a pattern string. For example: a = sc.textFile('s3n://bucket/2014-??-??/*.gz') a.count() Nick ​ On Tue, May 20, 2014 at 10:09 PM, Madhu wrote: > I have read gzip files from S3 successfully. > > It s

Re: How to Unsubscribe from the Spark user list

2014-05-20 Thread Nicholas Chammas
Ah, here's that address again (looks like the mailing list stripped it out for privacy's sake): user-unsubscribe [at] spark.apache.org Nick On Tue, May 20, 2014 at 10:11 PM, Nick Chammas wrote: > Send an email to this address to unsubscribe from the Spark user list: > > [hidden email]

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Aaron Davidson
Unfortunately, those errors are actually due to an Executor that exited, such that the connection between the Worker and Executor failed. This is not a fatal issue, unless there are analogous messages from the Worker to the Master (which should be present, if they exist, at around the same point in

How to Unsubscribe from the Spark user list

2014-05-20 Thread Nick Chammas
Send an email to this address to unsubscribe from the Spark user list: user-unsubscr...@spark.apache.org Sending an email to the Spark user list itself (i.e. this list) *does not do anything*, even if you put "unsubscribe" as the subject. We will all just see your email. Nick -- View this me

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Madhu
I have read gzip files from S3 successfully. It sounds like a file is corrupt or not a valid gzip file. Does it work with fewer gzip files? How are you reading the files? - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.100

Unsubscribe

2014-05-20 Thread A.Khanolkar

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Nicholas Chammas
Any tips on how to troubleshoot this? On Thu, May 15, 2014 at 4:15 PM, Nick Chammas wrote: > I’m trying to do a simple count() on a large number of GZipped files in > S3. My job is failing with the following message: > > 14/05/15 19:12:37 WARN scheduler.TaskSetManager: Loss was due to > java.io

Using Spark to analyze complex JSON

2014-05-20 Thread Nick Chammas
The Apache Drill home page has an interesting heading: "Liberate Nested Data". Is there any current or planned functionality in Spark SQL or Shark to enable SQL-like querying of complex JSON? Nick -- View this message in context: http://apache-spark-user-

Re: Setting queue for spark job on yarn

2014-05-20 Thread Ron Gonzalez
Hi Sandy, Is there a programmatic way? We're building a platform as a service and need to assign it to different queues that can provide different scheduler approaches. Thanks, Ron Sent from my iPhone > On May 20, 2014, at 1:30 PM, Sandy Ryza wrote: > > Hi Ron, > > What version are you us

Python, Spark and HBase

2014-05-20 Thread twizansk
Hello, This seems like a basic question but I have been unable to find an answer in the archives or other online sources. I would like to know if there is any way to load a RDD from HBase in python. In Java/Scala I can do this by initializing a NewAPIHadoopRDD with a TableInputFormat class. Is

Re: facebook data mining with Spark

2014-05-20 Thread Michael Cutler
Hello Joe, The first step is acquiring some data, either through the Facebook APIor a third-party service like Datasift (paid). Once you've acquired some data, and got it somewhere Spark can access it (like HDFS), you can then load and man

Spark Performace Comparison Spark on YARN vs Spark Standalone

2014-05-20 Thread anishs...@yahoo.co.in
Hi All I need to analyse performance of Spark YARN vs Spark Standalone Please suggest if we have some pre-published comparison statistics available. TIA -- Anish Sneh http://in.linkedin.com/in/anishsneh

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Aaron Davidson
So the current stalling is simply sitting there with no log output? Have you jstack'd an Executor to see where it may be hanging? Are you observing memory or disk pressure ("df" and "df -i")? On Tue, May 20, 2014 at 2:03 PM, jonathan.keebler wrote: > Thanks for the suggestion, Andrew. We have a

Re: Spark Streaming and Shark | Streaming Taking All CPUs

2014-05-20 Thread anishs...@yahoo.co.in
Thanks Mayur, it is working :) -- Anish Sneh http://in.linkedin.com/in/anishsneh

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread jonathan.keebler
Thanks for the suggestion, Andrew. We have also implemented our solution using reduceByKey, but observe the same behavior. For example if we do the following: map1 groupByKey map2 saveAsTextFile Then the stalling will occur during the map1+groupByKey execution. If we do map1 reduceByKey map2

Re: Setting queue for spark job on yarn

2014-05-20 Thread Sandy Ryza
Hi Ron, What version are you using? For 0.9, you need to set it outside your code with the SPARK_YARN_QUEUE environment variable. -Sandy On Mon, May 19, 2014 at 9:29 PM, Ron Gonzalez wrote: > Hi, > How does one submit a spark job to yarn and specify a queue? > The code that successfully

Imports that need to be specified in a Spark application jar?

2014-05-20 Thread Shivani Rao
Hello All, I am learning that there are certain imports done by Spark REPL that is used to invoke and run code in a spark shell, that I would have to import specifically if I need the same functionality in a spark jar run by command line. I am getting into a repeated serialization error of an RDD

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Andrew Ash
If the distribution of the keys in your groupByKey is skewed (some keys appear way more often than others) you should consider modifying your job to use reduceByKey instead wherever possible. On May 20, 2014 12:53 PM, "Jon Keebler" wrote: > So we upped the spark.akka.frameSize value to 128 MB and

Re: Evaluating Spark just for Cluster Computing

2014-05-20 Thread Sean Owen
My $0.02: If you are simply reading input records, running a model, and outputting the result, then it's a simple "map-only" problem and you're mostly looking for a process to baby-sit these operations. Lots of things work -- Spark, M/R (+ Crunch), Hadoop Streaming, etc. I'd choose whatever is simp

java.lang.NoClassDefFoundError: org/apache/hadoop/io/Writable

2014-05-20 Thread pcutil
This is the first time I'm trying the Spark. I just downloaded and trying the SimpleApp Java program using the maven. I added 2 maven dependencies -- spark-core and scala-library? Even though my program is in java, I was forced to add the scala dependency. Is that really required? Now, I'm able to

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Sean Owen
This isn't helpful of me to say, but, I see the same sorts of problem and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight into when it happens, but usually after heavy use and after running for a long time. I had figured I'd see if the changes since 0.9.0 addressed it and revisit

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Jon Keebler
So we upped the spark.akka.frameSize value to 128 MB and still observed the same behavior. It's happening not necessarily when data is being sent back to the driver, but when there is an inter-cluster shuffle, for example during a groupByKey. Is it possible we should focus on tuning these paramet

Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld
Thanks, that sounds perfect On Tue, May 20, 2014 at 1:38 PM, Xiangrui Meng wrote: > You can search for XMLInputFormat on Google. There are some > implementations that allow you to specify the to split on, e.g.: > > https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collecti

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread Arun Ahuja
Yes, we are on Spark 0.9.0 so that explains the first piece, thanks! Also, yes, I meant SPARK_WORKER_MEMORY. Thanks for the hierarchy. Similarly is there some best practice on setting SPARK_WORKER_INSTANCES and spark.default.parallelism? Thanks, Arun On Tue, May 20, 2014 at 3:04 PM, Andrew Or

Spark Streaming using Flume body size limitation

2014-05-20 Thread lemieud
Hi, I am trying to send events to spark streaming via flume. It's working fine up to a certain point. I have problems when the size of the body is over 1020 characters. Basically, up to 1020 it works 1021 through 1024, the event will be accepted and there is no exception, but the channel seems t

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
So, for example, I have two disassociated worker machines at the moment. The last messages in the spark logs are akka association error messages, like the following: 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError [akka.tcp:// sparkwor...@hdn3.int.meetup.com:50038] -> [akka.tcp:// sparke

Re: life if an executor

2014-05-20 Thread Koert Kuipers
interesting, so it sounds to me like spark is forced to choose between the ability to add jars during lifetime and the ability to run tasks with user classpath first (which important for the ability to run jobs on spark clusters not under your control, so for the viability of 3rd party spark apps)

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
We're using spark 0.9.0, and we're using it "out of the box" -- not using Cloudera Manager or anything similar. There are warnings from the master that there continue to be heartbeats from the unregistered workers. I will see if there are particular telltale errors on the worker side. We've had

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Matei Zaharia
Are you guys both using Cloudera Manager? Maybe there’s also an issue with the integration with that. Matei On May 20, 2014, at 11:44 AM, Aaron Davidson wrote: > I'd just like to point out that, along with Matei, I have not seen workers > drop even under the most exotic job failures. We're ru

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread Andrew Or
I'm assuming you're running Spark 0.9.x, because in the latest version of Spark you shouldn't have to add the HADOOP_CONF_DIR to the java class path manually. I tested this out on my own YARN cluster and was able to confirm that. In Spark 1.0, SPARK_MEM is deprecated and should not be used. Instea

Re: facebook data mining with Spark

2014-05-20 Thread Mayur Rustagi
Are you looking to connect as streaming source. You should be able to integrate it like twitter API. Regards Mayur On May 20, 2014 9:38 AM, "Joe L" wrote: > Is there any way to get facebook data into Spark and filter the content of > it? > > > > -- > View this message in context: > http://apache-

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Aaron Davidson
I'd just like to point out that, along with Matei, I have not seen workers drop even under the most exotic job failures. We're running pretty close to master, though; perhaps it is related to an uncaught exception in the Worker from a prior version of Spark. On Tue, May 20, 2014 at 11:36 AM, Arun

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Arun Ahuja
Hi Matei, Unfortunately, I don't have more detailed information, but we have seen the loss of workers in standalone mode as well. If a job is killed through CTRL-C we will often see in the Spark Master page the number of workers and cores decrease. They are still alive and well in the Cloudera M

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread Andrew Or
Hi Gaurav and Arun, Your settings seem reasonable; as long as YARN_CONF_DIR or HADOOP_CONF_DIR is properly set, the application should be able to find the correct RM port. Have you tried running the examples in yarn-client mode, and your custom application in yarn-standalone (now yarn-cluster) mod

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread Arun Ahuja
I was actually able to get this to work. I was NOT setting the classpath properly originally. Simply running java -cp /etc/hadoop/conf/: com.domain.JobClass and setting yarn-client as the spark master worked for me. Originally I had not put the configuration on the classpath. Also, I used $SPAR

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-20 Thread Jacob Eisinger
Howdy Gerard, Yeah, the docker link feature seems to work well for client-server interaction. But, peer-to-peer architectures need more for service discovery. As for you addressing requirements, I don't completely understand what you are asking for... you may also want to check out xip.io . Th

Re: Spark and Hadoop

2014-05-20 Thread Andras Barjak
You can download any of them, I would go with the latest versions, or just download the source and build it yourself. For experimenting with basic things you can just launch the REPL and start right away in spark local mode not using any hadoop stuff. 2014-05-20 19:43 GMT+02:00 pcutil : > I'm a

Re: Spark and Hadoop

2014-05-20 Thread Andrew Ash
Hi Puneet, If you're not going to read/write data in HDFS from your Spark cluster, then it doesn't matter which one you download. Just go with "Hadoop 2" as that's more likely to connect to an HDFS cluster in the future if you ever do decide to use HDFS because it's the newer APIs. Cheers, Andre

Spark and Hadoop

2014-05-20 Thread pcutil
I'm a first time user and need to try just the hello world kind of program in spark. Now on downloads page, I see following 3 options for Pre-built packages that I can download: - Hadoop 1 (HDP1, CDH3) - CDH4 - Hadoop 2 (HDP2, CDH5) I'm confused which one do I need to download. I need to try jus

Re: reading large XML files

2014-05-20 Thread Xiangrui Meng
You can search for XMLInputFormat on Google. There are some implementations that allow you to specify the to split on, e.g.: https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collection/XMLInputFormat.java On Tue, May 20, 2014 at 10:31 AM, Nathan Kronenfeld wrote: > Unfortuna

Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld
Unfortunately, I don't have a bunch of moderately big xml files; I have one, really big file - big enough that reading it into memory as a single string is not feasible. On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng wrote: > Try sc.wholeTextFiles(). It reads the entire file into a string > rec

Evaluating Spark just for Cluster Computing

2014-05-20 Thread pcutil
Hi - We have a use case for batch processing for which we are trying to figure out if Apache Spark would be a good fit or not. We have a universe of identifiers sitting in RDBMS for which we need to go get input data from RDBMS and then pass that input to analytical models that generate some outp

Re: reading large XML files

2014-05-20 Thread Xiangrui Meng
Try sc.wholeTextFiles(). It reads the entire file into a string record. -Xiangrui On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld wrote: > We are trying to read some large GraphML files to use in spark. > > Is there an easy way to read XML-based files like this that accounts for > partition bo

Re: life if an executor

2014-05-20 Thread Aaron Davidson
One issue is that new jars can be added during the lifetime of a SparkContext, which can mean after executors are already started. Off-heap storage is always serialized, correct. On Tue, May 20, 2014 at 6:48 AM, Koert Kuipers wrote: > just for my clarification: off heap cannot be java objects,

Re: filling missing values in a sequence

2014-05-20 Thread Mohit Jaggi
Xiangrui, Thanks for the pointer. I think it should work...for now I did cook up my own which is similar but on top of spark core APIs. I would suggest moving the sliding window RDD to the core spark library. It seems quite general to me and a cursory look at the code indicates nothing specific to

Re: issue with Scala, Spark and Akka

2014-05-20 Thread Gerard Maas
This error message says "I can't find the config for the akka subsystem". That is typically included in the Spark assembly. First, you need to compile your spark distro, by running sbt/sbt assembly on the SPARK_HOME dir. Then, use the SPARK_HOME (through env or configuration) to point to your SPARK

reading large XML files

2014-05-20 Thread Nathan Kronenfeld
We are trying to read some large GraphML files to use in spark. Is there an easy way to read XML-based files like this that accounts for partition boundaries and the like? Thanks, Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley St

issue with Scala, Spark and Akka

2014-05-20 Thread Greg
Hi, I have the following Scala code: ===--- import org.apache.spark.SparkContext class test { def main(){ val sc = new SparkContext("local", "Scala Word Count") } } ===--- and the following build.sbt file ===--- name := "test" version := "1.0" scalaVersion := "2.10.4" libraryDependencie

Re: Advanced log processing

2014-05-20 Thread Laurent T
Thanks for the advice. I think you're right. I'm not sure we're going to use HBase but starting by partitioning data into multiple buckets will be a first step. I'll see how it performs on large datasets. My original question though was more like: is there a spark trick i don't know about ? Curren

Ignoring S3 0 files exception

2014-05-20 Thread Laurent T
Hi, I'm trying to get data from S3 using sc.textFile("s3n://"+filenamePattern) It seems that if a pattern gives out no result i get an exception like so: org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3n://bucket/20140512/* matches 0 files at org.apache.hadoop.mapred.FileI

Re: life if an executor

2014-05-20 Thread Koert Kuipers
if they are tied to the spark context, then why can the subprocess not be started up with the extra jars (sc.addJars) already on class path? this way a switch like user-jars-first would be a simple rearranging of the class path for the subprocess, and the messing with classloaders that is currently

Re: life if an executor

2014-05-20 Thread Koert Kuipers
just for my clarification: off heap cannot be java objects, correct? so we are always talking about serialized off-heap storage? On May 20, 2014 1:27 AM, "Tathagata Das" wrote: > That's one the main motivation in using Tachyon ;) > http://tachyon-project.org/ > > It gives off heap in-memory cachi

Re: Status stays at ACCEPTED

2014-05-20 Thread Jan Holmberg
Still the same. I increased the memory of the node holding resource manager to 5 Gig. I also spotted an HDFS alert of replication factor 3 that I now dropped to the number of data nodes. I also shut all down all services not in use. Still the issue remains. I have noticed following two events t

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread gaurav.dasgupta
Few more details I would like to provide (Sorry as I should have provided with the previous post): *- Spark Version = 0.9.1 (using pre-built spark-0.9.1-bin-hadoop2) - Hadoop Version = 2.4.0 (Hortonworks) - I am trying to execute a Spark Streaming program* Because I am using Hortornworks Hadoo

Re: question about the license of akka and Spark

2014-05-20 Thread Sean Owen
The page says "Akka is Open Source and available under the Apache 2 License." It may also be available under another license, but that does not change the fact that it may be used by adhering to the terms of the AL2. The section is referring to commercial support that Typesafe sells. I am not eve

Re: question about the license of akka and Spark

2014-05-20 Thread YouPeng Yang
Hi Well,Maybe I get the wrong reference: http://doc.akka.io/docs/akka/2.3.2/intro/what-is-akka.html On the page ,the last bold tag Commercial Support indicate that the akka is under commercial license ,by the way,the version is 2.3.2 2014-05-20 17:30 GMT+08:00 Tathagata Das : > Akka

Re: question about the license of akka and Spark

2014-05-20 Thread Tathagata Das
Akka is under Apache 2 license too. http://doc.akka.io/docs/akka/snapshot/project/licenses.html On Tue, May 20, 2014 at 2:16 AM, YouPeng Yang wrote: > Hi > Just know akka is under a commercial license,however Spark is under the > apache > license. > Is there any problem? > > > Regards >

question about the license of akka and Spark

2014-05-20 Thread YouPeng Yang
Hi Just know akka is under a commercial license,however Spark is under the apache license. Is there any problem? Regards

Re: Worker re-spawn and dynamic node joining

2014-05-20 Thread Han JU
Thank you guys for the detailed answer. Akhil, yes I would like to have a try of your tool. Is it open-sourced? 2014-05-17 17:55 GMT+02:00 Mayur Rustagi : > A better way would be use Mesos (and quite possibly Yarn in 1.0.0). > That will allow you to add nodes on the fly & leverage it for Spark.

Re: Status stays at ACCEPTED

2014-05-20 Thread Jan Holmberg
Hi, each node has 4Gig of memory. After total reboot and re-run of SparkPi resource manager shows no running containers and 1 pending container. -jan On 20 May 2014, at 10:24, wrote: > Hi Jan, > > How much memory capacity is configured for each node? > > If you go to the ResourceManager

Problem with loading files: Loss was due to java.io.EOFException java.io.EOFException

2014-05-20 Thread hakanilter
Hi everyone, I'm having problems with loading files. Either with java code or spark-shell, I got same errors when I try to load a text file. I added hadoop-client and hadoop-common 2.0.0-cdh4.6.0 as dependencies and maven-shade-plugin is configured. I have CDH 4.6.0, spark-0.9.1-bin-cdh4 and JD

rdd.map() can't pass parameters

2014-05-20 Thread zzzzzqf12345
I do some image matting on sparkstreaming, and I put the background images in a broadcast var , RDD[String,Qimage] => a sorted Array[Qiamge] val qingbg = broadcastbg.value.collect.sortWith((a,b) => a._1.toInt < b._1.toInt).map(data => data._2) When a image comes, I want to get its background imag

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread gaurav.dasgupta
Hi, Even I am encountering the same problem and exactly the same console logs while running custom Spark programs using YARN. I have checked all the information provided elsewhere and confirmed the same in my system: *- Set HADOOP_CONF_DIR=/etc/hadoop/conf - Set YARN_CONF_DIR=/etc/hadoop/conf

Re: combinebykey throw classcastexception

2014-05-20 Thread xiemeilong
This issue is turned out cased by version mismatch between driver(0.9.1) and server(0.9.0-cdh5.0.1) just now. Other function works fine but combinebykey before. Thank you very much for your reply. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/combinebyke

Re: Status stays at ACCEPTED

2014-05-20 Thread sandy . ryza
Hi Jan, How much memory capacity is configured for each node? If you go to the ResourceManager web UI, does it indicate any containers are running? -Sandy > On May 19, 2014, at 11:43 PM, Jan Holmberg wrote: > > Hi, > I’m new to Spark and trying to test first Spark prog. I’m running SparkPi

Re: combinebykey throw classcastexception

2014-05-20 Thread Sean Owen
You asked off-list, and provided a more detailed example there: val random = new Random() val testdata = (1 to 1).map(_=>(random.nextInt(),random.nextInt())) sc.parallelize(testdata).combineByKey[ArrayBuffer[Int]]( (instant:Int)=>{new ArrayBuffer[Int]()}, (bucket:ArrayB