Re: Converting a DStream's RDDs to SchemaRDDs

2014-08-28 Thread Verma, Rishi (398J)
Hi Tathagata, Thanks very much for this tip and for the quick reply. It absolutely did the trick! One small note for others: make sure your command-line invocation of the spark-submit script has the argument “—master” also setting “local[n]” with n > 1. Thanks again, Rishi On Aug 28, 2014, a

Re: DStream repartitioning, performance tuning processing

2014-08-28 Thread Tim Smith
I set partitions to 64: // kInMsg.repartition(64) val outdata = kInMsg.map(x=>normalizeLog(x._2,configMap)) // Still see all activity only on the two nodes that seem to be receiving from Kafka. On Thu, Aug 28, 2014 at 5:47 PM, Tim Smith wrote: > TD - Apologies, didn't realize I was replying t

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tim Smith
I upped the ulimit to 128k files on all nodes. Job crashed again with "DAGScheduler: Failed to run runJob at ReceiverTracker.scala:275". Couldn't get the logs because I killed the job and looks like yarn wipe the container logs (not sure why it wipes the logs under /var/log/hadoop-yarn/container).

Re: Spark SQL : how to find element where a field is in a given set

2014-08-28 Thread Jaonary Rabarisoa
ok, but what if I have a long list do I need to hard code like this every element of my list of is there a function that translate a list into a tuple ? On Fri, Aug 29, 2014 at 3:24 AM, Michael Armbrust wrote: > You don't need the Seq, as in is a variadic function. > > personTable.where('name i

Re: How to get prerelease thriftserver working?

2014-08-28 Thread Matt Chu
Hey Michael, Cheng, Thanks for the replies. Sadly I can't remember the specific error so I'm going to chalk it up to user error, especially since others on the list have not had a problem. @michael By the way, was at the Spark 1.1 meetup yesterday. Great event, very informative, cheers and keep d

Re: RE: The concurrent model of spark job/stage/task

2014-08-28 Thread 35597...@qq.com
hi, dear Thanks for the response. Some comments below. and yes, I am using spark on yarn. 1. The release doc of spark says multi jobs can be submitted in one application if the jobs(actions) are submit by different threads. I wrote some java thread code in driver, one action in each thread, a

u'' notation with pyspark output data

2014-08-28 Thread Oleg Ruchovets
Hi , I am working with pyspark and doing simple aggregation def doSplit(x): y = x.split(',') if(len(y)==3): return y[0],(y[1],y[2]) counts = lines.map(doSplit).groupByKey() output = counts.collect() Iterating over output I got such format of the data u'1385501280'

Re: Merging two Spark SQL tables?

2014-08-28 Thread Evan Chan
Filed SPARK-3295. On Mon, Aug 25, 2014 at 12:49 PM, Michael Armbrust wrote: >> SO I tried the above (why doesn't union or ++ have the same behavior >> btw?) > > > I don't think there is a good reason for this. I'd open a JIRA. > >> >> and it works, but is slow because the original Rdds are not >

RE: problem connection to hdfs on localhost from spark-shell

2014-08-28 Thread linkpatrickliu
Change your conf/spark-env.sh: export HADOOP_CONF_DIR="/etc/hadoop/conf"export YARN_CONF_DIR="/etc/hadoop/conf" Date: Thu, 28 Aug 2014 16:19:05 -0700 From: ml-node+s1001560n13074...@n3.nabble.com To: linkpatrick...@live.com Subject: problem connection to hdfs on localhost from spark-shell

RE: The concurrent model of spark job/stage/task

2014-08-28 Thread linkpatrickliu
Hi, Please see the answers following each question. If there's any mistake, please let me know. Thanks! I am not sure which mode you are running. So I will assume you are using spark-submit script to submit spark applications to spark cluster(spark-standalone or Yarn) 1. how to start 2 or more

RE: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread linkpatrickliu
Hi, You can set the settings in conf/spark-env.sh like this:export SPARK_LIBRARY_PATH=/usr/lib/hadoop/lib/native/ SPARK_JAVA_OPTS+="-Djava.library.path=$SPARK_LIBRARY_PATH "SPARK_JAVA_OPTS+="-Dspark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec "SPARK_JAVA_OPTS+="-Dio.compress

Re: Memory statistics in the Application detail UI

2014-08-28 Thread Sean Owen
Click the Storage tab. You have some (tiny) RDD persisted in memory. On Fri, Aug 29, 2014 at 5:58 AM, SK wrote: > Hi, > Thanks for the responses. I understand that the second values in the Memory > Used column for the executors add up to 95.5 GB and the first values add up > to 17.3 KB. If 95.5

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tathagata Das
It did. It got failed and respawned 4 times. In this case, the too many open files is a sign that you need increase the system-wide limit of open files. Try adding ulimit -n 16000 to your conf/spark-env.sh. TD On Thu, Aug 28, 2014 at 5:29 PM, Tim Smith wrote: > Appeared after running for a whi

Re: Memory statistics in the Application detail UI

2014-08-28 Thread SK
Hi, Thanks for the responses. I understand that the second values in the Memory Used column for the executors add up to 95.5 GB and the first values add up to 17.3 KB. If 95.5 GB is the memory used to store the RDDs, then what is 17.3 KB ? is that the memory used for shuffling operations? For non

Spark Hive max key length is 767 bytes

2014-08-28 Thread arthur.hk.c...@gmail.com
(Please ignore if duplicated) Hi, I use Spark 1.0.2 with Hive 0.13.1 I have already set the hive mysql database to latine1; mysql: alter database hive character set latin1; Spark: scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) scala> hiveContext.hql("create table tes

How to debug this error?

2014-08-28 Thread Gary Zhao
Hello I'm new to Spark and playing around, but saw the following error. Could anyone to help on it? Thanks Gary scala> c res15: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[7] at flatMap at :23 scala> group res16: org.apache.spark.rdd.RDD[(String, Iterable[String])] = MappedValuesRDD[5] a

Spark / Thrift / ODBC connectivity

2014-08-28 Thread Denny Lee
I’m currently using the Spark 1.1 branch and have been able to get the Thrift service up and running.  The quick questions were whether I should able to use the Thrift service to connect to SparkSQL generated tables and/or Hive tables?   As well, by any chance do we have any documents that point

Re: Memory statistics in the Application detail UI

2014-08-28 Thread Burak Yavuz
Hi, Spark uses by default approximately 60% of the executor heap memory to store RDDs. That's why you have 8.6GB instead of 16GB. 95.5 is therefore the sum of all the 8.6 GB of executor memory + the driver memory. Best, Burak - Original Message - From: "SK" To: u...@spark.incubator.ap

Re: Memory statistics in the Application detail UI

2014-08-28 Thread Sean Owen
Each executor reserves some memory for storing RDDs in memory, and some for executor operations like shuffling. The number you see is memory reserved for storing RDDs, and defaults to about 0.6 of the total (spark.storage.memoryFraction). On Fri, Aug 29, 2014 at 2:32 AM, SK wrote: > Hi, > > I am

Re: OutofMemoryError when generating output

2014-08-28 Thread Burak Yavuz
Yeah, saveAsTextFile is an RDD specific method. If you really want to use that method, just turn the map into an RDD: `sc.parallelize(x.toSeq).saveAsTextFile(...)` Reading through the api-docs will present you many more alternate solutions! Best, Burak - Original Message - From: "SK"

RE: Sorting Reduced/Groupd Values without Explicit Sorting

2014-08-28 Thread fluke777
Hi list, Any change on this one? I think I have seen a lot of work being done on this lately but I am unable to forge a working solution from jira tickets. Any example would be highly appreciated. Tomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sorti

Odd saveAsSequenceFile bug

2014-08-28 Thread Shay Seng
Hey Sparkies... I have an odd "bug". I am running Spark 0.9.2 on Amazon EC2 machines as a job (i.e. not in REPL) After a bunch of processing, I tell spark to save my rdd to S3 using: rdd.saveAsSequenceFile(uri,codec) That line of code hangs. By hang I mean (a) Spark stages UI shows no update on

FW: Reference Accounts & Large Node Deployments

2014-08-28 Thread Steve Nunez
Anyone? No customers using streaming at scale? From: Steve Nunez Date: Wednesday, August 27, 2014 at 9:08 To: "user@spark.apache.org" Subject: Reference Accounts & Large Node Deployments > All, > > Does anyone have specific references to customers, use cases and large-scale > deployments

Problem using accessing HiveContext

2014-08-28 Thread Zitser, Igor
Hi, While using HiveContext. If hive table created as test_datatypes(testbigint bigint, ss bigint ) select below works fine. For "create table test_datatypes(testbigint bigint, testdec decimal(5,2) )" scala> val dataTypes=hiveContext.hql("select * from test_datatypes") 14/08/28 21:18:44 INFO p

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, I fixed the issue by copying libsnappy.so to Java ire. Regards Arthur On 29 Aug, 2014, at 8:12 am, arthur.hk.c...@gmail.com wrote: > Hi, > > If change my etc/hadoop/core-site.xml > > from > > io.compression.codecs > > org.apache.hadoop.io.compress.SnappyCodec, >

The concurrent model of spark job/stage/task

2014-08-28 Thread 35597...@qq.com
hi, guys I am trying to understand how spark work on the concurrent model. I read below from https://spark.apache.org/docs/1.0.2/job-scheduling.html quote " Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from sep

Memory statistics in the Application detail UI

2014-08-28 Thread SK
Hi, I am using a cluster where each node has 16GB (this is the executor memory). After I complete an MLlib job, the executor tab shows the following: Memory: 142.6 KB Used (95.5 GB Total) and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB (this is different for differe

Re: Spark SQL : how to find element where a field is in a given set

2014-08-28 Thread Michael Armbrust
You don't need the Seq, as in is a variadic function. personTable.where('name in ("foo", "bar")) On Thu, Aug 28, 2014 at 3:09 AM, Jaonary Rabarisoa wrote: > Hi all, > > What is the expression that I should use with spark sql DSL if I need to > retreive > data with a field in a given set. > Fo

Re: Change delimiter when collecting SchemaRDD

2014-08-28 Thread Michael Armbrust
The comma is just the way the default toString works for Row objects. Since SchemaRDDs are also RDDs, you can do arbitrary transformations on the Row objects that are returned. For example, if you'd rather the delimiter was '|': sql("SELECT * FROM src").map(_.mkString("|")).collect() On Thu, A

Re: DStream repartitioning, performance tuning processing

2014-08-28 Thread Tim Smith
TD - Apologies, didn't realize I was replying to you instead of the list. What does "numPartitions" refer to when calling createStream? I read an earlier thread that seemed to suggest that numPartitions translates to partitions created on the Spark side? http://mail-archives.apache.org/mod_mbox/in

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tim Smith
Appeared after running for a while. I re-ran the job and this time, it crashed with: 14/08/29 00:18:50 WARN ReceiverTracker: Error reported by receiver for stream 0: Error in block pushing thread - java.net.SocketException: Too many open files Shouldn't the failed receiver get re-spawned on a diff

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, If change my etc/hadoop/core-site.xml from io.compression.codecs org.apache.hadoop.io.compress.SnappyCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec to

Re: Kinesis receiver & spark streaming partition

2014-08-28 Thread Chris Fregly
great question, wei. this is very important to understand from a performance perspective. and this extends is beyond kinesis - it's for any streaming source that supports shards/partitions. i need to do a little research into the internals to confirm my theory. lemme get back to you! -chris

Re: DStream repartitioning, performance tuning processing

2014-08-28 Thread Tathagata Das
If you are repartitioning to 8 partitions, and your node happen to have at least 4 cores each, its possible that all 8 partitions are assigned to only 2 nodes. Try increasing the number of partitions. Also make sure you have executors (allocated by YARN) running on more than two nodes if you want t

problem connection to hdfs on localhost from spark-shell

2014-08-28 Thread Bharath Bhushan
I have HDFS servers running locally and "hadoop dfs -ls /" are all running fine. >From spark-shell I do this: val lines = sc.textFile("hdfs:///test") and I get this error message. java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: "localhost.locald

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tathagata Das
Do you see this error right in the beginning or after running for sometime? The root cause seems to be that somehow your Spark executors got killed, which killed receivers and caused further errors. Please try to take a look at the executor logs of the lost executor to find what is the root cause

Re: transforming a Map object to RDD

2014-08-28 Thread Sean Owen
val map = Map("foo" -> 1, "bar" -> 2, "baz" -> 3) val rdd = sc.parallelize(map.toSeq) rdd is a an RDD[(String,Int)] and you can do what you like from there. On Thu, Aug 28, 2014 at 11:56 PM, SK wrote: > Hi, > > How do I convert a Map object to an RDD so that I can use the > saveAsTextFile() oper

transforming a Map object to RDD

2014-08-28 Thread SK
Hi, How do I convert a Map object to an RDD so that I can use the saveAsTextFile() operation to output the Map object? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/transforming-a-Map-object-to-RDD-tp13071.html Sent from the Apache Spark User Li

Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tim Smith
Hi, Have a Spark-1.0.0 (CDH5) streaming job reading from kafka that died with: 14/08/28 22:28:15 INFO DAGScheduler: Failed to run runJob at ReceiverTracker.scala:275 Exception in thread "Thread-59" 14/08/28 22:28:15 INFO YarnClientClusterScheduler: Cancelling stage 2 14/08/28 22:28:15 INFO DAGSch

DStream repartitioning, performance tuning processing

2014-08-28 Thread Tim Smith
Hi, In my streaming app, I receive from kafka where I have tried setting the partitions when calling "createStream" or later, by calling repartition - in both cases, the number of nodes running the tasks seems to be stubbornly stuck at 2. Since I have 11 nodes in my cluster, I was hoping to use mo

RE: Q on downloading spark for standalone cluster

2014-08-28 Thread Sagar, Sanjeev
Hello Daniel, If you’re not using Hadoop then why you want to grab the Hadoop package? CDH5 will download all the Hadoop packages and cloudera manager too. Just curious what happen if you start spark on EC2 cluster, what it choose for the data store as default? -Sanjeev From: Daniel Siegmann [

Re: What happens if I have a function like a PairFunction but which might return multiple values

2014-08-28 Thread Sean Owen
To emulate a Mapper, flatMap() is exactly what you want. Since it flattens, it means you return an Iterable of values instead of 1 value. That can be a Collection containing many values, or 1, or 0. For a reducer, to really reproduce what a Reducer does in Java, I think you will need groupByKey()

Re: Where to save intermediate results?

2014-08-28 Thread Daniel Siegmann
I assume your on-demand calculations are a streaming flow? If your data aggregated from batch isn't too large, maybe you should just save it to disk; when your streaming flow starts you can read the aggregations back from disk and perhaps just broadcast them. Though I guess you'd have to restart yo

Re: Q on downloading spark for standalone cluster

2014-08-28 Thread Daniel Siegmann
If you aren't using Hadoop, I don't think it matters which you download. I'd probably just grab the Hadoop 2 package. Out of curiosity, what are you using as your data store? I get the impression most Spark users are using HDFS or something built on top. On Thu, Aug 28, 2014 at 4:07 PM, Sanjeev

Re: repartitioning an RDD yielding imbalance

2014-08-28 Thread Davies Liu
On Thu, Aug 28, 2014 at 7:00 AM, Rok Roskar wrote: > I've got an RDD where each element is a long string (a whole document). I'm > using pyspark so some of the handy partition-handling functions aren't > available, and I count the number of elements in each partition with: > > def count_partitio

Where to save intermediate results?

2014-08-28 Thread huylv
Hi, I'm building a system for near real-time data analytics. My plan is to have an ETL batch job which calculates aggregations running periodically. User queries are then parsed for on-demand calculations, also in Spark. Where are the pre-calculated results supposed to be saved? I mean, after fini

Re: Converting a DStream's RDDs to SchemaRDDs

2014-08-28 Thread Tathagata Das
Try using "local[n]" with n > 1, instead of local. Since receivers take up 1 slot, and "local" is basically 1 slot, there is no slot left to process the data. That's why nothing gets printed. TD On Thu, Aug 28, 2014 at 10:28 AM, Verma, Rishi (398J) < rishi.ve...@jpl.nasa.gov> wrote: > Hi Folks,

Q on downloading spark for standalone cluster

2014-08-28 Thread Sanjeev Sagar
Hello there, I've a basic question on the downloadthat which option I need to downloadfor standalone cluster. I've a private cluster of three machineson Centos. When I click on download it shows me following: Download Spark The latest release is Spark 1.0.2, released August 5, 2014 (re

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Tathagata Das
But then if you want to generate ids that are unique across ALL the records that you are going to see in a stream (which can be potentially infinite), then you definitely need a number space larger than long :) TD On Thu, Aug 28, 2014 at 12:48 PM, Soumitra Kumar wrote: > Yes, that is an option

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Soumitra Kumar
Yes, that is an option. I started with a function of batch time, and index to generate id as long. This may be faster than generating UUID, with added benefit of sorting based on time. - Original Message - From: "Tathagata Das" To: "Soumitra Kumar" Cc: "Xiangrui Meng" , user@spark.apac

Re: Print to spark log

2014-08-28 Thread jamborta
thanks for the reply. I was looking for something for the case when it's running outside of the spark framework. if I declare a sparkcontext or and rdd that could print some messages in the log? The problem I have that if I print something from the scala object that runs the spark app, it does

Re: OutofMemoryError when generating output

2014-08-28 Thread SK
Hi, Thanks for the response. I tried to use countByKey. But I am not able to write the output to console or to a file. Neither collect() nor saveAsTextFile() work for the Map object that is generated after countByKey(). valx = sc.textFile(baseFile)).map { line => val field

Re: Spark webUI - application details page

2014-08-28 Thread SK
I was able to recently solve this problem for standalone mode. For this mode, I did not use a history server. Instead, I set spark.eventLog.dir (in conf/spark-defaults.conf) to a directory in hdfs (basically this directory should be in a place that is writable by the master and accessible globally

What happens if I have a function like a PairFunction but which might return multiple values

2014-08-28 Thread Steve Lewis
In many cases when I work with Map Reduce my mapper or my reducer might take a single value and map it to multiple keys - The reducer might also take a single key and emit multiple values I don't think that functions like flatMap and reduceByKey will work or are there tricks I am not aware of

Re: Print to spark log

2014-08-28 Thread Control
I'm not sure if this is the case, but basic monitoring is described here: https://spark.apache.org/docs/latest/monitoring.html If it comes to something more sophisticated I was for example able to save some messages into local logs and view them in YARN UI via http by editing spark source code (use

org.apache.spark.examples.xxx

2014-08-28 Thread filipus
hey guys i still try to get used to compile and run the example code why does the run_example code submit the class with an org.apache.spark.examples in front of the class itself? probably a stupid question but i would be glad some one of you explains by the way.. how was the "spark...example..

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, my check native result: hadoop checknative 14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version 14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library Native library checki

Re: CUDA in spark, especially in MLlib?

2014-08-28 Thread Wei Tan
Thank you Debasish. I am fine with either Scala or Java. I would like to get a quick evaluation on the performance gain, e.g., ALS on GPU. I would like to try whichever library does the business :) Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J.

org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and HBase. With default setting in Spark 1.0.2, when trying to load a file I got "Class org.apache.hadoop.io.compress.SnappyCodec not found" Can you please advise how to enable snappy in Spark? Regards Arthur scala> i

Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi Ted, I downloaded pom.xml to examples directory. It works, thanks!! Regards Arthur [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.119s] [INFO] Spark Project

Re: Low Level Kafka Consumer for Spark

2014-08-28 Thread Chris Fregly
@bharat- overall, i've noticed a lot of confusion about how Spark Streaming scales - as well as how it handles failover and checkpointing, but we can discuss that separately. there's actually 2 dimensions to scaling here: receiving and processing. *Receiving* receiving can be scaled out by subm

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-28 Thread DNoteboom
Sorry for the extremely late reply. It turns out that the same error occurred when running on yarn. However, I recently updated my project to depend on cdh5 and the issue I was having disappeared and I am no longer setting the userClassPathFirst to true. -- View this message in context: http:/

Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread Ted Yu
bq. Spark 1.0.2 For the above release, you can download pom.xml attached to the JIRA and place it in examples directory I verified that the build against 0.98.4 worked using this command: mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests clean package Patch v5 is

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread RodrigoB
Hi Yana, The fact is that the DB writing is happening on the node level and not on Spark level. One of the benefits of distributed computing nature of Spark is enabling IO distribution as well. For example, is much faster to have the nodes to write to Cassandra instead of having them all collected

Re: Spark webUI - application details page

2014-08-28 Thread Brad Miller
Hi All, @Andrew Thanks for the tips. I just built the master branch of Spark last night, but am still having problems viewing history through the standalone UI. I dug into the Spark job events directories as you suggested, and I see at a minimum 'SPARK_VERSION_1.0.0' and 'EVENT_LOG_1'; for appli

Converting a DStream's RDDs to SchemaRDDs

2014-08-28 Thread Verma, Rishi (398J)
Hi Folks, I’d like to find out tips on how to convert the RDDs inside a Spark Streaming DStream to a set of SchemaRDDs. My DStream contains JSON data pushed over from Kafka, and I’d like to use SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset as a table, and perform

New SparkR mailing list, JIRA

2014-08-28 Thread Shivaram Venkataraman
Hi I'd like to announce a couple of updates to the SparkR project. In order to facilitate better collaboration for new features and development we have a new mailing list, issue tracker for SparkR. - The new JIRA is hosted at https://sparkr.atlassian.net/browse/SPARKR/ and we have migrated all ex

Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, patch -p1 -i spark-1297-v5.txt can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |diff --git docs/building-with-maven.md docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- docs/bui

Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread Ted Yu
I attached patch v5 which corresponds to the pull request. Please try again. On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > I have just tried to apply the patch of SPARK-1297: > https://issues.apache.org/jira/browse/SPARK-1297 > > There ar

SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, I have just tried to apply the patch of SPARK-1297: https://issues.apache.org/jira/browse/SPARK-1297 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt respectively. When applying the 2nd one, I got "Hunk #1 FAILED at 45" Can you please advise how to fix it in order

Re: CUDA in spark, especially in MLlib?

2014-08-28 Thread Debasish Das
Breeze author David also has a github project on cuda binding in scalado you prefer using java or scala ? On Aug 27, 2014 2:05 PM, "Frank van Lankvelt" wrote: > you could try looking at ScalaCL[1], it's targeting OpenCL rather than > CUDA, but that might be close enough? > > cheers, Frank >

Print to spark log

2014-08-28 Thread jamborta
Hi all, Just wondering if there is a way to use logging to print to spark logs some additional info (similar to debug in scalding). Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035.html Sent from the Apache Spark User List

Change delimiter when collecting SchemaRDD

2014-08-28 Thread yadid ayzenberg
Hi All, Is there any way to change the delimiter from being a comma ? Some of the strings in my data contain commas as well, making it very difficult to parse the results. Yadid

RE: Spark-submit not running

2014-08-28 Thread Hingorani, Vineet
The following error is given when I try to add this line: [info] Set current project to Simple Project (in build file:/C:/Users/D062844/Desktop/Hand sOnSpark/Install/spark-1.0.2-bin-hadoop2/) [info] Compiling 1 Scala source to C:\Users\D062844\Desktop\HandsOnSpark\Install\spark-1.0 .2-bin-hadoop

Re: Spark-submit not running

2014-08-28 Thread Sean Owen
You should set this as early as possible in your program, before other code runs. On Thu, Aug 28, 2014 at 3:27 PM, Hingorani, Vineet wrote: > Thank you Sean and Guru for giving the information. Btw I have to put this > line, but where should I add it? In my scala file below other ‘import …’ > lin

RE: Spark-submit not running

2014-08-28 Thread Hingorani, Vineet
Thank you Sean and Guru for giving the information. Btw I have to put this line, but where should I add it? In my scala file below other ‘import …’ lines are written? System.setProperty("hadoop.home.dir", "d:\\winutil\\") Thank you Vineet From: Sean Owen [mailto:so...@cloudera.com] Sent: Donner

Re: Spark-submit not running

2014-08-28 Thread Guru Medasani
Thanks Sean. Looks like there is a workaround as per the JIRA https://issues.apache.org/jira/browse/SPARK-2356 . http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7. May be that's worth a shot? On Aug 28, 2014, at 8:15 AM, Sean Owen wrote: > Yes, but I think at the moment t

Re: Graphx: undirected graph support

2014-08-28 Thread FokkoDriesprong
A bit in analogy with a linked-list a double linked-list. It might introduce overhead in terms of memory usage, but you could use two directed edges to substitute the uni-directed edge. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-undirected-graph-

repartitioning an RDD yielding imbalance

2014-08-28 Thread Rok Roskar
I've got an RDD where each element is a long string (a whole document). I'm using pyspark so some of the handy partition-handling functions aren't available, and I count the number of elements in each partition with: def count_partitions(id, iterator): c = sum(1 for _ in iterator) yiel

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread Ted Yu
I didn't see that problem. Did you run this command ? mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests clean package Here is what I got: TYus-MacBook-Pro:spark-1.0.2 tyu$ sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /Users/tyu/spark-1.0.2/sbi

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread Yana Kadiyska
Can you clarify the scenario: val ssc = new StreamingContext(sparkConf, Seconds(10)) ssc.checkpoint(checkpointDirectory) val stream = KafkaUtils.createStream(...) val wordCounts = lines.flatMap(_.split(" ")).map(x => (x, 1L)) val wordDstream= wordCounts.updateStateByKey[Int](updateFunc) wo

SPARK on YARN, containers fails

2014-08-28 Thread Control
Hi there, I'm trying to run JavaSparkPi example on YARN with master = yarn-client but I have a problem. It runs smoothly with submitting application, first container for Application Master works too. When job is starting and there are some tasks to do I'm getting this warning on console (I'm us

Re: Spark-submit not running

2014-08-28 Thread Sean Owen
Yes, but I think at the moment there is still a dependency on Hadoop even when not using it. See https://issues.apache.org/jira/browse/SPARK-2356 On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani wrote: > Can you copy the exact spark-submit command that you are running? > > You should be able to r

Re: Spark-submit not running

2014-08-28 Thread Guru Medasani
Can you copy the exact spark-submit command that you are running? You should be able to run it locally without installing hadoop. Here is an example on how to run the job locally. # Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master

RE: Spark-submit not running

2014-08-28 Thread Hingorani, Vineet
How can I set HADOOP_HOME if I am running the Spark on my local machine without anything else? Do I have to install some other pre-built file? I am on Windows 7 and Spark’s official site says that it is available on Windows, I added Java path in the PATH variable. Vineet From: Sean Owen [mailt

Re: Key-Value Operations

2014-08-28 Thread Sean Owen
If you mean your values are all a Seq or similar already, then you just take the top 1 ordered by the size of the value: rdd.top(1)(Ordering.by(_._2.size)) On Thu, Aug 28, 2014 at 9:34 AM, Deep Pradhan wrote: > Hi, > I have a RDD of key-value pairs. Now I want to find the "key" for which > the

Re: how to specify columns in groupby

2014-08-28 Thread Yanbo Liang
For your reference: val d1 = textFile.map(line => { val fileds = line.split(",") ((fileds(0),fileds(1)), fileds(2).toDouble) }) val d2 = d1.reduceByKey(_+_) d2.foreach(println) 2014-08-28 20:04 GMT+08:00 MEETHU MATHEW : > Hi all, > > I have an RDD which has values in t

Re: Compilaon Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread Sean Owen
"0.98.2" is not an HBase version, but "0.98.2-hadoop2" is: http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.hbase%22%20AND%20a%3A%22hbase%22 On Thu, Aug 28, 2014 at 2:54 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > I need to use Spark with HBase 0.98 an

Re: How to join two PairRDD together?

2014-08-28 Thread Sean Owen
It sounds like you are adding the same key to every element, and joining, in order to accomplish a full cartesian join? I can imagine doing it that way would blow up somewhere. There is a cartesian() method to do this maybe more efficiently. However if your data set is large, this sort of algorith

how to specify columns in groupby

2014-08-28 Thread MEETHU MATHEW
Hi all, I have an RDD  which has values in the  format "id,date,cost". I want to group the elements based on the id and date columns and get the sum of the cost  for each group. Can somebody tell me how to do this?   Thanks & Regards, Meethu M

Re: how to filter value in spark

2014-08-28 Thread Matthew Farrellee
On 08/28/2014 07:20 AM, marylucy wrote: fileA=1 2 3 4 one number a line,save in /sparktest/1/ fileB=3 4 5 6 one number a line,save in /sparktest/2/ I want to get 3 and 4 var a = sc.textFile("/sparktest/1/").map((_,1)) var b = sc.textFile("/sparktest/2/").map((_,1)) a.filter(param=>{b.lookup(p

Re: Spark-submit not running

2014-08-28 Thread Sean Owen
You need to set HADOOP_HOME. Is Spark officially supposed to work on Windows or not at this stage? I know the build doesn't quite yet. On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet < vineet.hingor...@sap.com> wrote: > The file is compiling properly but when I try to run the jar file using

how to filter value in spark

2014-08-28 Thread marylucy
fileA=1 2 3 4 one number a line,save in /sparktest/1/ fileB=3 4 5 6 one number a line,save in /sparktest/2/ I want to get 3 and 4 var a = sc.textFile("/sparktest/1/").map((_,1)) var b = sc.textFile("/sparktest/2/").map((_,1)) a.filter(param=>{b.lookup(param._1).length>0}).map(_._1).foreach(prin

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, I tried to start Spark but failed: $ ./sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /mnt/hadoop/spark-1.0.2/sbin/../logs/spark-edhuser-org.apache.spark.deploy.master.Master-1-m133.out failed to launch org.apache.spark.deploy.master.Master: Failed to find Spa

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread GADV
Not sure if this make sense, but maybe would be nice to have a kind of "flag" available within the code that tells me if I'm running in a "normal" situation or during a recovery. To better explain this, let's consider the following scenario: I am processing data, let's say from a Kafka streaming, a

Spark-submit not running

2014-08-28 Thread Hingorani, Vineet
The file is compiling properly but when I try to run the jar file using spark-submit, it is giving some errors. I am running spark locally and have downloaded a pre-built version of Spark named "For Hadoop 2 (HDP2, CDH5)". AI don't know if it is a dependency problem but I don't want to have Hado

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread Ted Yu
I see 0.98.5 in dep.txt You should be good to go. On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > tried > mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests > dependency:tree > dep.txt > > Attached the dep. txt for your

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,tried mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree > dep.txtAttached the dep. txt for your information. [WARNING] [WARNING] Some problems were encountered while building the effective settings [WARNING] Unrecognised tag: 'mirrors' (position: START_TAG s

Re: How to join two PairRDD together?

2014-08-28 Thread Yanbo Liang
Maybe you can refer sliding method of RDD, but it's right now mllib private method. Look at org.apache.spark.mllib.rdd.RDDFunctions. 2014-08-26 12:59 GMT+08:00 Vida Ha : > Can you paste the code? It's unclear to me how/when the out of memory is > occurring without seeing the code. > > > > > On

Re: Trying to run SparkSQL over Spark Streaming

2014-08-28 Thread praveshjain1991
Thanks for the reply. Sorry I could not ask more earlier. Trying to use a parquet file is not working at all. case class Rec(name:String,pv:Int) val sqlContext=new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD val d1=sc.parallelize(Array(("a",10),("b",3))).map(e=>Rec(e._1

  1   2   >