Re: Update on Pig on Spark initiative

2014-08-28 Thread Russell Jurney
This is really exciting! Thanks so much for this work, I think you've guaranteed Pig's continued vitality. On Wednesday, August 27, 2014, Matei Zaharia matei.zaha...@gmail.com wrote: Awesome to hear this, Mayur! Thanks for putting this together. Matei On August 27, 2014 at 10:04:12 PM,

Re: Submitting multiple files pyspark

2014-08-28 Thread Andrew Or
Hi Cheng, You specify extra python files through --py-files. For example: bin/spark-submit [your other options] --py-files helper.py main_app.py -Andrew 2014-08-27 22:58 GMT-07:00 Chengi Liu chengi.liu...@gmail.com: Hi, I have two files.. main_app.py and helper.py main_app.py calls

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-08-28 Thread durin
Hi, I'm using a cluster with 5 nodes that each use 8 cores and 10GB of RAM Basically I'm creating a dictionary from text, i.e. giving each words that occurs more than n times in all texts a unique identifier. The essential port of the code looks like that: var texts = ctx.sql(SELECT text FROM

Using unshaded akka in Spark driver

2014-08-28 Thread Aniket Bhatnagar
I am building (yet another) job server for Spark using Play! framework and it seems like Play's akka dependency conflicts with Spark's shaded akka dependency. Using SBT, I can force Play to use akka 2.2.3 (unshaded) but I haven't been able to figure out how to exclude com.typesafe.akka

Re: Visualizing stage task dependency graph

2014-08-28 Thread Phuoc Do
I'm working on this patch to visualize stages: https://github.com/apache/spark/pull/2077 Phuoc Do On Mon, Aug 4, 2014 at 10:12 PM, Zongheng Yang zonghen...@gmail.com wrote: I agree that this is definitely useful. One related project I know of is Sparkling [1] (also see talk at Spark

Key-Value Operations

2014-08-28 Thread Deep Pradhan
Hi, I have a RDD of key-value pairs. Now I want to find the key for which the values has the largest number of elements. How should I do that? Basically I want to select the key for which the number of items in values is the largest. Thank You

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Tathagata Das
If just want arbitrary unique id attached to each record in a dstream (no ordering etc), then why not create generate and attach an UUID to each record? On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: I see a issue here. If rdd.id is 1000 then rdd.id *

sbt package assembly run spark examples

2014-08-28 Thread filipus
hi guys, can someone explain or give the stupid user like me a link where i can get the right usage of sbt and spark in order to run the examples as a stand alone app I got to the point running the app by sbt run path-to-the-data but still get some error because i probably didnt tell the app

Re: sbt package assembly run spark examples

2014-08-28 Thread filipus
got it when I read the class refference https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkConf conf.setMaster(local[2]) set the master to local with 2 threads but still get some warnings and the result (see below) is also not right i think ps: by the way ... first

Spark SQL : how to find element where a field is in a given set

2014-08-28 Thread Jaonary Rabarisoa
Hi all, What is the expression that I should use with spark sql DSL if I need to retreive data with a field in a given set. For example : I have the following schema case class Person(name: String, age: Int) And I need to do something like : personTable.where('name in Seq(foo, bar)) ?

Re: Trying to run SparkSQL over Spark Streaming

2014-08-28 Thread praveshjain1991
Thanks for the reply. Sorry I could not ask more earlier. Trying to use a parquet file is not working at all. case class Rec(name:String,pv:Int) val sqlContext=new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD val

Re: How to join two PairRDD together?

2014-08-28 Thread Yanbo Liang
Maybe you can refer sliding method of RDD, but it's right now mllib private method. Look at org.apache.spark.mllib.rdd.RDDFunctions. 2014-08-26 12:59 GMT+08:00 Vida Ha v...@databricks.com: Can you paste the code? It's unclear to me how/when the out of memory is occurring without seeing the

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,triedmvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree dep.txtAttached the dep. txt for your information.[WARNING] [WARNING] Some problems were encountered while building the effective settings [WARNING] Unrecognised tag: 'mirrors' (position: START_TAG

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread Ted Yu
I see 0.98.5 in dep.txt You should be good to go. On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, tried mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree dep.txt Attached the dep. txt for your

Spark-submit not running

2014-08-28 Thread Hingorani, Vineet
The file is compiling properly but when I try to run the jar file using spark-submit, it is giving some errors. I am running spark locally and have downloaded a pre-built version of Spark named For Hadoop 2 (HDP2, CDH5). AI don't know if it is a dependency problem but I don't want to have

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread GADV
Not sure if this make sense, but maybe would be nice to have a kind of flag available within the code that tells me if I'm running in a normal situation or during a recovery. To better explain this, let's consider the following scenario: I am processing data, let's say from a Kafka streaming, and

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, I tried to start Spark but failed: $ ./sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /mnt/hadoop/spark-1.0.2/sbin/../logs/spark-edhuser-org.apache.spark.deploy.master.Master-1-m133.out failed to launch org.apache.spark.deploy.master.Master: Failed to find

how to filter value in spark

2014-08-28 Thread marylucy
fileA=1 2 3 4 one number a line,save in /sparktest/1/ fileB=3 4 5 6 one number a line,save in /sparktest/2/ I want to get 3 and 4 var a = sc.textFile(/sparktest/1/).map((_,1)) var b = sc.textFile(/sparktest/2/).map((_,1)) a.filter(param={b.lookup(param._1).length0}).map(_._1).foreach(println)

Re: Spark-submit not running

2014-08-28 Thread Sean Owen
You need to set HADOOP_HOME. Is Spark officially supposed to work on Windows or not at this stage? I know the build doesn't quite yet. On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet vineet.hingor...@sap.com wrote: The file is compiling properly but when I try to run the jar file using

Re: how to filter value in spark

2014-08-28 Thread Matthew Farrellee
On 08/28/2014 07:20 AM, marylucy wrote: fileA=1 2 3 4 one number a line,save in /sparktest/1/ fileB=3 4 5 6 one number a line,save in /sparktest/2/ I want to get 3 and 4 var a = sc.textFile(/sparktest/1/).map((_,1)) var b = sc.textFile(/sparktest/2/).map((_,1))

how to specify columns in groupby

2014-08-28 Thread MEETHU MATHEW
Hi all, I have an RDD  which has values in the  format id,date,cost. I want to group the elements based on the id and date columns and get the sum of the cost  for each group. Can somebody tell me how to do this?   Thanks Regards, Meethu M

Re: How to join two PairRDD together?

2014-08-28 Thread Sean Owen
It sounds like you are adding the same key to every element, and joining, in order to accomplish a full cartesian join? I can imagine doing it that way would blow up somewhere. There is a cartesian() method to do this maybe more efficiently. However if your data set is large, this sort of

Re: Compilaon Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread Sean Owen
0.98.2 is not an HBase version, but 0.98.2-hadoop2 is: http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.hbase%22%20AND%20a%3A%22hbase%22 On Thu, Aug 28, 2014 at 2:54 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I need to use Spark with HBase 0.98 and tried

Re: how to specify columns in groupby

2014-08-28 Thread Yanbo Liang
For your reference: val d1 = textFile.map(line = { val fileds = line.split(,) ((fileds(0),fileds(1)), fileds(2).toDouble) }) val d2 = d1.reduceByKey(_+_) d2.foreach(println) 2014-08-28 20:04 GMT+08:00 MEETHU MATHEW meethu2...@yahoo.co.in: Hi all, I have an RDD

Re: Key-Value Operations

2014-08-28 Thread Sean Owen
If you mean your values are all a Seq or similar already, then you just take the top 1 ordered by the size of the value: rdd.top(1)(Ordering.by(_._2.size)) On Thu, Aug 28, 2014 at 9:34 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have a RDD of key-value pairs. Now I want to find

RE: Spark-submit not running

2014-08-28 Thread Hingorani, Vineet
How can I set HADOOP_HOME if I am running the Spark on my local machine without anything else? Do I have to install some other pre-built file? I am on Windows 7 and Spark’s official site says that it is available on Windows, I added Java path in the PATH variable. Vineet From: Sean Owen

Re: Spark-submit not running

2014-08-28 Thread Guru Medasani
Can you copy the exact spark-submit command that you are running? You should be able to run it locally without installing hadoop. Here is an example on how to run the job locally. # Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master

Re: Spark-submit not running

2014-08-28 Thread Sean Owen
Yes, but I think at the moment there is still a dependency on Hadoop even when not using it. See https://issues.apache.org/jira/browse/SPARK-2356 On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani gdm...@outlook.com wrote: Can you copy the exact spark-submit command that you are running? You

SPARK on YARN, containers fails

2014-08-28 Thread Control
Hi there, I'm trying to run JavaSparkPi example on YARN with master = yarn-client but I have a problem. It runs smoothly with submitting application, first container for Application Master works too. When job is starting and there are some tasks to do I'm getting this warning on console (I'm

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread Yana Kadiyska
Can you clarify the scenario: val ssc = new StreamingContext(sparkConf, Seconds(10)) ssc.checkpoint(checkpointDirectory) val stream = KafkaUtils.createStream(...) val wordCounts = lines.flatMap(_.split( )).map(x = (x, 1L)) val wordDstream= wordCounts.updateStateByKey[Int](updateFunc)

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread Ted Yu
I didn't see that problem. Did you run this command ? mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests clean package Here is what I got: TYus-MacBook-Pro:spark-1.0.2 tyu$ sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to

repartitioning an RDD yielding imbalance

2014-08-28 Thread Rok Roskar
I've got an RDD where each element is a long string (a whole document). I'm using pyspark so some of the handy partition-handling functions aren't available, and I count the number of elements in each partition with: def count_partitions(id, iterator): c = sum(1 for _ in iterator)

Re: Graphx: undirected graph support

2014-08-28 Thread FokkoDriesprong
A bit in analogy with a linked-list a double linked-list. It might introduce overhead in terms of memory usage, but you could use two directed edges to substitute the uni-directed edge. -- View this message in context:

Re: Spark-submit not running

2014-08-28 Thread Guru Medasani
Thanks Sean. Looks like there is a workaround as per the JIRA https://issues.apache.org/jira/browse/SPARK-2356 . http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7. May be that's worth a shot? On Aug 28, 2014, at 8:15 AM, Sean Owen so...@cloudera.com wrote: Yes, but I

RE: Spark-submit not running

2014-08-28 Thread Hingorani, Vineet
Thank you Sean and Guru for giving the information. Btw I have to put this line, but where should I add it? In my scala file below other ‘import …’ lines are written? System.setProperty(hadoop.home.dir, d:\\winutil\\) Thank you Vineet From: Sean Owen [mailto:so...@cloudera.com] Sent:

Re: Spark-submit not running

2014-08-28 Thread Sean Owen
You should set this as early as possible in your program, before other code runs. On Thu, Aug 28, 2014 at 3:27 PM, Hingorani, Vineet vineet.hingor...@sap.com wrote: Thank you Sean and Guru for giving the information. Btw I have to put this line, but where should I add it? In my scala file below

RE: Spark-submit not running

2014-08-28 Thread Hingorani, Vineet
The following error is given when I try to add this line: [info] Set current project to Simple Project (in build file:/C:/Users/D062844/Desktop/Hand sOnSpark/Install/spark-1.0.2-bin-hadoop2/) [info] Compiling 1 Scala source to C:\Users\D062844\Desktop\HandsOnSpark\Install\spark-1.0

Change delimiter when collecting SchemaRDD

2014-08-28 Thread yadid ayzenberg
Hi All, Is there any way to change the delimiter from being a comma ? Some of the strings in my data contain commas as well, making it very difficult to parse the results. Yadid

Print to spark log

2014-08-28 Thread jamborta
Hi all, Just wondering if there is a way to use logging to print to spark logs some additional info (similar to debug in scalding). Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035.html Sent from the Apache Spark User

Re: CUDA in spark, especially in MLlib?

2014-08-28 Thread Debasish Das
Breeze author David also has a github project on cuda binding in scalado you prefer using java or scala ? On Aug 27, 2014 2:05 PM, Frank van Lankvelt f.vanlankv...@onehippo.com wrote: you could try looking at ScalaCL[1], it's targeting OpenCL rather than CUDA, but that might be close

SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, I have just tried to apply the patch of SPARK-1297: https://issues.apache.org/jira/browse/SPARK-1297 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt respectively. When applying the 2nd one, I got Hunk #1 FAILED at 45 Can you please advise how to fix it in order

Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread Ted Yu
I attached patch v5 which corresponds to the pull request. Please try again. On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I have just tried to apply the patch of SPARK-1297: https://issues.apache.org/jira/browse/SPARK-1297 There are two

Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, patch -p1 -i spark-1297-v5.txt can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |diff --git docs/building-with-maven.md docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |---

New SparkR mailing list, JIRA

2014-08-28 Thread Shivaram Venkataraman
Hi I'd like to announce a couple of updates to the SparkR project. In order to facilitate better collaboration for new features and development we have a new mailing list, issue tracker for SparkR. - The new JIRA is hosted at https://sparkr.atlassian.net/browse/SPARKR/ and we have migrated all

Converting a DStream's RDDs to SchemaRDDs

2014-08-28 Thread Verma, Rishi (398J)
Hi Folks, I’d like to find out tips on how to convert the RDDs inside a Spark Streaming DStream to a set of SchemaRDDs. My DStream contains JSON data pushed over from Kafka, and I’d like to use SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset as a table, and perform

Re: Spark webUI - application details page

2014-08-28 Thread Brad Miller
Hi All, @Andrew Thanks for the tips. I just built the master branch of Spark last night, but am still having problems viewing history through the standalone UI. I dug into the Spark job events directories as you suggested, and I see at a minimum 'SPARK_VERSION_1.0.0' and 'EVENT_LOG_1'; for

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread RodrigoB
Hi Yana, The fact is that the DB writing is happening on the node level and not on Spark level. One of the benefits of distributed computing nature of Spark is enabling IO distribution as well. For example, is much faster to have the nodes to write to Cassandra instead of having them all

Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread Ted Yu
bq. Spark 1.0.2 For the above release, you can download pom.xml attached to the JIRA and place it in examples directory I verified that the build against 0.98.4 worked using this command: mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests clean package Patch v5

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-28 Thread DNoteboom
Sorry for the extremely late reply. It turns out that the same error occurred when running on yarn. However, I recently updated my project to depend on cdh5 and the issue I was having disappeared and I am no longer setting the userClassPathFirst to true. -- View this message in context:

Re: Low Level Kafka Consumer for Spark

2014-08-28 Thread Chris Fregly
@bharat- overall, i've noticed a lot of confusion about how Spark Streaming scales - as well as how it handles failover and checkpointing, but we can discuss that separately. there's actually 2 dimensions to scaling here: receiving and processing. *Receiving* receiving can be scaled out by

Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi Ted, I downloaded pom.xml to examples directory. It works, thanks!! Regards Arthur [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.119s] [INFO] Spark Project

org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and HBase. With default setting in Spark 1.0.2, when trying to load a file I got Class org.apache.hadoop.io.compress.SnappyCodec not found Can you please advise how to enable snappy in Spark? Regards Arthur scala

Re: CUDA in spark, especially in MLlib?

2014-08-28 Thread Wei Tan
Thank you Debasish. I am fine with either Scala or Java. I would like to get a quick evaluation on the performance gain, e.g., ALS on GPU. I would like to try whichever library does the business :) Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T.

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, my check native result: hadoop checknative 14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version 14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded initialized native-zlib library Native library

org.apache.spark.examples.xxx

2014-08-28 Thread filipus
hey guys i still try to get used to compile and run the example code why does the run_example code submit the class with an org.apache.spark.examples in front of the class itself? probably a stupid question but i would be glad some one of you explains by the way.. how was the

Re: Print to spark log

2014-08-28 Thread Control
I'm not sure if this is the case, but basic monitoring is described here: https://spark.apache.org/docs/latest/monitoring.html If it comes to something more sophisticated I was for example able to save some messages into local logs and view them in YARN UI via http by editing spark source code

What happens if I have a function like a PairFunction but which might return multiple values

2014-08-28 Thread Steve Lewis
In many cases when I work with Map Reduce my mapper or my reducer might take a single value and map it to multiple keys - The reducer might also take a single key and emit multiple values I don't think that functions like flatMap and reduceByKey will work or are there tricks I am not aware of

Re: Spark webUI - application details page

2014-08-28 Thread SK
I was able to recently solve this problem for standalone mode. For this mode, I did not use a history server. Instead, I set spark.eventLog.dir (in conf/spark-defaults.conf) to a directory in hdfs (basically this directory should be in a place that is writable by the master and accessible globally

Re: OutofMemoryError when generating output

2014-08-28 Thread SK
Hi, Thanks for the response. I tried to use countByKey. But I am not able to write the output to console or to a file. Neither collect() nor saveAsTextFile() work for the Map object that is generated after countByKey(). valx = sc.textFile(baseFile)).map { line = val

Re: Print to spark log

2014-08-28 Thread jamborta
thanks for the reply. I was looking for something for the case when it's running outside of the spark framework. if I declare a sparkcontext or and rdd that could print some messages in the log? The problem I have that if I print something from the scala object that runs the spark app, it does

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Soumitra Kumar
Yes, that is an option. I started with a function of batch time, and index to generate id as long. This may be faster than generating UUID, with added benefit of sorting based on time. - Original Message - From: Tathagata Das tathagata.das1...@gmail.com To: Soumitra Kumar

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Tathagata Das
But then if you want to generate ids that are unique across ALL the records that you are going to see in a stream (which can be potentially infinite), then you definitely need a number space larger than long :) TD On Thu, Aug 28, 2014 at 12:48 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote:

Q on downloading spark for standalone cluster

2014-08-28 Thread Sanjeev Sagar
Hello there, I've a basic question on the downloadthat which option I need to downloadfor standalone cluster. I've a private cluster of three machineson Centos. When I click on download it shows me following: Download Spark The latest release is Spark 1.0.2, released August 5, 2014

Re: Converting a DStream's RDDs to SchemaRDDs

2014-08-28 Thread Tathagata Das
Try using local[n] with n 1, instead of local. Since receivers take up 1 slot, and local is basically 1 slot, there is no slot left to process the data. That's why nothing gets printed. TD On Thu, Aug 28, 2014 at 10:28 AM, Verma, Rishi (398J) rishi.ve...@jpl.nasa.gov wrote: Hi Folks, I’d

Re: repartitioning an RDD yielding imbalance

2014-08-28 Thread Davies Liu
On Thu, Aug 28, 2014 at 7:00 AM, Rok Roskar rokros...@gmail.com wrote: I've got an RDD where each element is a long string (a whole document). I'm using pyspark so some of the handy partition-handling functions aren't available, and I count the number of elements in each partition with: def

Re: Q on downloading spark for standalone cluster

2014-08-28 Thread Daniel Siegmann
If you aren't using Hadoop, I don't think it matters which you download. I'd probably just grab the Hadoop 2 package. Out of curiosity, what are you using as your data store? I get the impression most Spark users are using HDFS or something built on top. On Thu, Aug 28, 2014 at 4:07 PM, Sanjeev

Re: Where to save intermediate results?

2014-08-28 Thread Daniel Siegmann
I assume your on-demand calculations are a streaming flow? If your data aggregated from batch isn't too large, maybe you should just save it to disk; when your streaming flow starts you can read the aggregations back from disk and perhaps just broadcast them. Though I guess you'd have to restart

Re: What happens if I have a function like a PairFunction but which might return multiple values

2014-08-28 Thread Sean Owen
To emulate a Mapper, flatMap() is exactly what you want. Since it flattens, it means you return an Iterable of values instead of 1 value. That can be a Collection containing many values, or 1, or 0. For a reducer, to really reproduce what a Reducer does in Java, I think you will need groupByKey()

RE: Q on downloading spark for standalone cluster

2014-08-28 Thread Sagar, Sanjeev
Hello Daniel, If you’re not using Hadoop then why you want to grab the Hadoop package? CDH5 will download all the Hadoop packages and cloudera manager too. Just curious what happen if you start spark on EC2 cluster, what it choose for the data store as default? -Sanjeev From: Daniel Siegmann

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tathagata Das
Do you see this error right in the beginning or after running for sometime? The root cause seems to be that somehow your Spark executors got killed, which killed receivers and caused further errors. Please try to take a look at the executor logs of the lost executor to find what is the root cause

Re: DStream repartitioning, performance tuning processing

2014-08-28 Thread Tathagata Das
If you are repartitioning to 8 partitions, and your node happen to have at least 4 cores each, its possible that all 8 partitions are assigned to only 2 nodes. Try increasing the number of partitions. Also make sure you have executors (allocated by YARN) running on more than two nodes if you want

Re: Kinesis receiver spark streaming partition

2014-08-28 Thread Chris Fregly
great question, wei. this is very important to understand from a performance perspective. and this extends is beyond kinesis - it's for any streaming source that supports shards/partitions. i need to do a little research into the internals to confirm my theory. lemme get back to you! -chris

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, If change my etc/hadoop/core-site.xml from property nameio.compression.codecs/name value org.apache.hadoop.io.compress.SnappyCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec,

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tim Smith
Appeared after running for a while. I re-ran the job and this time, it crashed with: 14/08/29 00:18:50 WARN ReceiverTracker: Error reported by receiver for stream 0: Error in block pushing thread - java.net.SocketException: Too many open files Shouldn't the failed receiver get re-spawned on a

Re: DStream repartitioning, performance tuning processing

2014-08-28 Thread Tim Smith
TD - Apologies, didn't realize I was replying to you instead of the list. What does numPartitions refer to when calling createStream? I read an earlier thread that seemed to suggest that numPartitions translates to partitions created on the Spark side?

Re: Change delimiter when collecting SchemaRDD

2014-08-28 Thread Michael Armbrust
The comma is just the way the default toString works for Row objects. Since SchemaRDDs are also RDDs, you can do arbitrary transformations on the Row objects that are returned. For example, if you'd rather the delimiter was '|': sql(SELECT * FROM src).map(_.mkString(|)).collect() On Thu, Aug

Memory statistics in the Application detail UI

2014-08-28 Thread SK
Hi, I am using a cluster where each node has 16GB (this is the executor memory). After I complete an MLlib job, the executor tab shows the following: Memory: 142.6 KB Used (95.5 GB Total) and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB (this is different for

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, I fixed the issue by copying libsnappy.so to Java ire. Regards Arthur On 29 Aug, 2014, at 8:12 am, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, If change my etc/hadoop/core-site.xml from property nameio.compression.codecs/name value

The concurrent model of spark job/stage/task

2014-08-28 Thread 35597...@qq.com
hi, guys I am trying to understand how spark work on the concurrent model. I read below from https://spark.apache.org/docs/1.0.2/job-scheduling.html quote Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from

Problem using accessing HiveContext

2014-08-28 Thread Zitser, Igor
Hi, While using HiveContext. If hive table created as test_datatypes(testbigint bigint, ss bigint ) select below works fine. For create table test_datatypes(testbigint bigint, testdec decimal(5,2) ) scala val dataTypes=hiveContext.hql(select * from test_datatypes) 14/08/28 21:18:44 INFO

FW: Reference Accounts Large Node Deployments

2014-08-28 Thread Steve Nunez
Anyone? No customers using streaming at scale? From: Steve Nunez snu...@hortonworks.com Date: Wednesday, August 27, 2014 at 9:08 To: user@spark.apache.org user@spark.apache.org Subject: Reference Accounts Large Node Deployments All, Does anyone have specific references to customers,

Odd saveAsSequenceFile bug

2014-08-28 Thread Shay Seng
Hey Sparkies... I have an odd bug. I am running Spark 0.9.2 on Amazon EC2 machines as a job (i.e. not in REPL) After a bunch of processing, I tell spark to save my rdd to S3 using: rdd.saveAsSequenceFile(uri,codec) That line of code hangs. By hang I mean (a) Spark stages UI shows no update on

RE: Sorting Reduced/Groupd Values without Explicit Sorting

2014-08-28 Thread fluke777
Hi list, Any change on this one? I think I have seen a lot of work being done on this lately but I am unable to forge a working solution from jira tickets. Any example would be highly appreciated. Tomas -- View this message in context:

Re: OutofMemoryError when generating output

2014-08-28 Thread Burak Yavuz
Yeah, saveAsTextFile is an RDD specific method. If you really want to use that method, just turn the map into an RDD: `sc.parallelize(x.toSeq).saveAsTextFile(...)` Reading through the api-docs will present you many more alternate solutions! Best, Burak - Original Message - From: SK

Re: Memory statistics in the Application detail UI

2014-08-28 Thread Sean Owen
Each executor reserves some memory for storing RDDs in memory, and some for executor operations like shuffling. The number you see is memory reserved for storing RDDs, and defaults to about 0.6 of the total (spark.storage.memoryFraction). On Fri, Aug 29, 2014 at 2:32 AM, SK skrishna...@gmail.com

Re: Memory statistics in the Application detail UI

2014-08-28 Thread Burak Yavuz
Hi, Spark uses by default approximately 60% of the executor heap memory to store RDDs. That's why you have 8.6GB instead of 16GB. 95.5 is therefore the sum of all the 8.6 GB of executor memory + the driver memory. Best, Burak - Original Message - From: SK skrishna...@gmail.com To:

Spark / Thrift / ODBC connectivity

2014-08-28 Thread Denny Lee
I’m currently using the Spark 1.1 branch and have been able to get the Thrift service up and running.  The quick questions were whether I should able to use the Thrift service to connect to SparkSQL generated tables and/or Hive tables?   As well, by any chance do we have any documents that

How to debug this error?

2014-08-28 Thread Gary Zhao
Hello I'm new to Spark and playing around, but saw the following error. Could anyone to help on it? Thanks Gary scala c res15: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[7] at flatMap at console:23 scala group res16: org.apache.spark.rdd.RDD[(String, Iterable[String])] =

Spark Hive max key length is 767 bytes

2014-08-28 Thread arthur.hk.c...@gmail.com
(Please ignore if duplicated) Hi, I use Spark 1.0.2 with Hive 0.13.1 I have already set the hive mysql database to latine1; mysql: alter database hive character set latin1; Spark: scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) scala hiveContext.hql(create table

Re: Memory statistics in the Application detail UI

2014-08-28 Thread SK
Hi, Thanks for the responses. I understand that the second values in the Memory Used column for the executors add up to 95.5 GB and the first values add up to 17.3 KB. If 95.5 GB is the memory used to store the RDDs, then what is 17.3 KB ? is that the memory used for shuffling operations? For

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tathagata Das
It did. It got failed and respawned 4 times. In this case, the too many open files is a sign that you need increase the system-wide limit of open files. Try adding ulimit -n 16000 to your conf/spark-env.sh. TD On Thu, Aug 28, 2014 at 5:29 PM, Tim Smith secs...@gmail.com wrote: Appeared after

Re: Memory statistics in the Application detail UI

2014-08-28 Thread Sean Owen
Click the Storage tab. You have some (tiny) RDD persisted in memory. On Fri, Aug 29, 2014 at 5:58 AM, SK skrishna...@gmail.com wrote: Hi, Thanks for the responses. I understand that the second values in the Memory Used column for the executors add up to 95.5 GB and the first values add up to

RE: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread linkpatrickliu
Hi, You can set the settings in conf/spark-env.sh like this:export SPARK_LIBRARY_PATH=/usr/lib/hadoop/lib/native/ SPARK_JAVA_OPTS+=-Djava.library.path=$SPARK_LIBRARY_PATH SPARK_JAVA_OPTS+=-Dspark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec