Problem in running a job on more than one workers

2014-08-18 Thread Rasika Pohankar
Hello, I am trying to run a job on two workers. I have cluster of 3 computers where one is the master and the other two are workers. I am able to successfully register the separate physical machines as workers in the cluster. When I run a job with a single worker connected, it runs successf

Re: spark - reading hfds files every 5 minutes

2014-08-18 Thread Akhil Das
Spark Stre​aming is the best fit for this use case. Basically you create a streaming context pointing to that directory, also you can set the streaming interval (in your case its 5 minutes). SparkStreaming will only process the

Re: Working with many RDDs in parallel?

2014-08-18 Thread David Tinker
Hmm I thought as much. I am using Cassandra with the Spark connector. What I really need is a RDD created from a query against Cassandra of the form "where partition_key = :id" where :id is taken from a list. Some grouping of the ids would be a way to partition this. On Mon, Aug 18, 2014 at 3:42

Re: spark streaming updataStateByKey clear old data

2014-08-18 Thread darwen
Yeah, that's what I do -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-updataStateByKey-clear-old-data-tp12340p12357.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark - Identifying and skipping processed data in hdfs

2014-08-18 Thread Jörn Franke
Hi, This sounds more like a task for complex event processing using Spark Streaming or Storm. There you can basically define time windows, such as 5 minutes, and do some analyses for the data within this time window. Can you give an example what do you do during processing? Best regards, Le 19 ao

Re: Data loss - Spark streaming and network receiver

2014-08-18 Thread Wei Liu
BTW, we heard that it is not so easy to setup/admin kafka on AWS, if any of you had good or bad experiences, do you mind sharing them with us? Thanks. I knew this is an off topic for spark user group, I wouldn't mind if you just reply to my email address. Thanks in advance. Wei On Mon, Aug 18,

Re: Data loss - Spark streaming and network receiver

2014-08-18 Thread Wei Liu
Thank you all for responding to my question. I am pleasantly surprised by this many prompt responses I got. It shows the strength of the spark community. Kafka is still an option for us, I will check out the link provided by Dibyendu. Meanwhile if someone out there already figured this out with K

Re: Does anyone have a stand alone spark instance running on Windows

2014-08-18 Thread Steve Lewis
OK I tried your build - First you need to put spt in C:\sbt Then you get Microsoft Windows [Version 6.2.9200] (c) 2012 Microsoft Corporation. All rights reserved. e:\>which java /cygdrive/c/Program Files/Java/jdk1.6.0_25/bin/java e:\>java -version java version "1.6.0_25" Java(TM) SE Runtime Envir

Cannot run program "Rscript" using SparkR

2014-08-18 Thread Stuti Awasthi
Hi All, I am using R 3.1 and Spark 0.9 and installed SparkR successfully. Now when I execute the "pi.R" example using spark master as local, then script executes fine. But when I try to execute same example using master as spark cluster master, then in throws Rcript error. Error : java.io.IOEx

Re: Data loss - Spark streaming and network receiver

2014-08-18 Thread Dibyendu Bhattacharya
Dear All, Recently I have written a Spark Kafka Consumer to solve this problem. Even we have seen issues with KafkaUtils which is using Highlevel Kafka Consumer and consumer code has no handle to offset management. The below code solves this problem, and this has is being tested in our Spark Clus

RE: Data loss - Spark streaming and network receiver

2014-08-18 Thread Shao, Saisai
I think Currently Spark Streaming lack a data acknowledging mechanism when data is stored and replicated in BlockManager, so potentially data will be lost even pulled into Kafka, say if data is stored just in BlockGenerator not BM, while in the meantime Kafka itself commit the consumer offset, a

Re: Bug or feature? Overwrite broadcasted variables.

2014-08-18 Thread Peng Cheng
Yeah, Thanks a lot. I know for people understanding lazy execution this seems straightforward. But for those who don't it may become a liability. I've only tested its stability on a small example (which seems stable), hopefully it's not a serendipity. Can a committer confirm this? Yours Peng -

spark - Identifying and skipping processed data in hdfs

2014-08-18 Thread salemi
Hi, Mine data source stores the incoming data every 10 second to hdfs. The naming convention save-.csv (see below) drwxr-xr-x ali supergroup 0 B 0 0 B save-1408396065000.csv drwxr-xr-x ali supergroup 0 B 0 0 B save-140839607.csv drwxr-xr-x ali

Re: Segmented fold count

2014-08-18 Thread Davies Liu
On Mon, Aug 18, 2014 at 7:41 PM, fil wrote: > fil wrote >> - Python functions like groupCount; these get reflected from their Python >> AST and converted into a Spark DAG? Presumably if I try and do something >> non-convertible this transformation process will throw an error? In other >> words thi

sqlContext.parquetFile(path) fails if path is a file but succeeds if a directory

2014-08-18 Thread Fengyun RAO
I'm using CDH 5.1 with spark 1.0. When I try to run Spark SQL following the Programming Guide val parquetFile = sqlContext.parquetFile(path) If the "path" is a file, it throws an exception: Exception in thread "main" java.lang.IllegalArgumentException: Expected hdfs://*/file.parquet for be a d

Re: Data loss - Spark streaming and network receiver

2014-08-18 Thread Tobias Pfeiffer
Hi Wei, On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu wrote: > > Since our application cannot tolerate losing customer data, I am wondering > what is the best way for us to address this issue. > 1) We are thinking writing application specific logic to address the data > loss. To us, the problem seems

Re: Segmented fold count

2014-08-18 Thread fil
fil wrote > - Python functions like groupCount; these get reflected from their Python > AST and converted into a Spark DAG? Presumably if I try and do something > non-convertible this transformation process will throw an error? In other > words this runs in the JVM. Further to this - it seems that

Re: Does HiveContext support Parquet?

2014-08-18 Thread Silvio Fiorito
First the JAR needs to be deployed using the ‹jars argument. Then in your HQL code you need to use the DeprecatedParquetInputFormat and DeprecatedParquetOutputFormat as described here https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0 .12 This is because SparkSQL is based

Re: How to use Spark Streaming from an HTTP api?

2014-08-18 Thread Silvio Fiorito
You need to create a custom receiver that submits the HTTP requests then deserializes the data and pushes it into the Streaming context. See here for an example: http://spark.apache.org/docs/latest/streaming-custom-receivers.html On 8/18/14, 6:20 PM, "bumble123" wrote: >I want to send an HTTP

Data loss - Spark streaming and network receiver

2014-08-18 Thread Wei Liu
We are prototyping an application with Spark streaming and Kinesis. We use kinesis to accept incoming txn data, and then process them using spark streaming. So far we really liked both technologies, and we saw both technologies are getting mature rapidly. We are almost settled to use these two tech

Processing multiple files in parallel

2014-08-18 Thread SK
Hi, I have a piece of code that reads all the (csv) files in a folder. For each file, it parses each line, extracts the first 2 elements from each row of the file, groups the tuple by the key and finally outputs the number of unique values for each key. val conf = new SparkConf().setAppNam

Re: Writing to RabbitMQ

2014-08-18 Thread Vida Ha
Oh sorry, just to be more clear - writing from the driver program is only safe if the amount of data you are trying to write is small enough to fit on memory in the driver program. I looked at your code, and since you are just writing a few things each time interval, this seems safe. -Vida On

Re: Writing to RabbitMQ

2014-08-18 Thread Vida Ha
Hi John, It seems like original problem you had was that you were initializing the RabbitMQ connection on the driver, but then calling the code to write to RabbitMQ on the workers (I'm guessing, but I don't know since I didn't see your code). That's definitely a problem because the connection can

spark-submit with HA YARN

2014-08-18 Thread Matt Narrell
Hello, I have an HA enabled YARN cluster with two resource mangers. When submitting jobs via “spark-submit —master yarn-cluster”. It appears that the driver is looking explicitly for the "yarn.resourcemanager.address” property rather than round robin-ing through the resource managers via the

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread cesararevalo
Thanks, Zhan for the follow up. But, do you know how I am supposed to set that table name on the jobConf? I don't have access to that object from my client driver? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-connecting-from-Spa

How to use Spark Streaming from an HTTP api?

2014-08-18 Thread bumble123
I want to send an HTTP request (specifically to OpenTSDB) to get data. I've been looking at the StreamingContext api and don't seem to see any methods that can connect to this. Has anyone tried connecting Spark Streaming to a server via HTTP requests before? How have you done it? -- View this me

Re: Bug or feature? Overwrite broadcasted variables.

2014-08-18 Thread Zhan Zhang
I think the behavior is by designed. Because if b is not persisted, and in each call b.collect, broadcasted has point to a new broadcasted variable, serialized by driver, and fetched by executors. If you do persist, you don’t expect the RDD get changed due to new broadcasted variable. Thanks.

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-08-18 Thread Ankur Dave
On Mon, Aug 18, 2014 at 6:29 AM, Yifan LI wrote: > I am testing our application(similar to "personalised page rank" using > Pregel, and note that each vertex property will need pretty much more space > to store after new iteration) [...] But when we ran it on larger graph(e.g. LiveJouranl), it

setCallSite for API backtraces not showing up in logs?

2014-08-18 Thread John Salvatier
What's the correct way to use setCallSite to get the change to show up in the spark logs? I have something like class RichRDD (rdd : RDD[MyThing]) { def mySpecialOperation() { rdd.context.setCallSite("bubbles and candy!") rdd.map() val result = rdd.groupBy()

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Zhan Zhang
Looks like hbaseTableName is null, probably caused by incorrect configuration. String hbaseTableName = jobConf.get(HBaseSerDe.HBASE_TABLE_NAME); setHTable(new HTable(HBaseConfiguration.create(jobConf), Bytes.toBytes(hbaseTableName))); Here is the definition. public static final Strin

spark - reading hfds files every 5 minutes

2014-08-18 Thread salemi
Hi, Mine data source stores the incoming data every 10 second to hdfs. The naming convention save-.csv (see below) drwxr-xr-x ali supergroup 0 B 0 0 B save-1408396065000.csv drwxr-xr-x ali supergroup 0 B 0 0 B save-140839607.csv drwxr-xr-x ali supe

Re: Writing to RabbitMQ

2014-08-18 Thread jschindler
Well, it looks like I can use the .repartition(1) method to stuff everything in one partition so that gets rid of the duplicate messages I send to RabbitMQ but that seems like a bad idea perhaps. Wouldn't that hurt scalability? -- View this message in context: http://apache-spark-user-list.1

Re: Spark Streaming Data Sharing

2014-08-18 Thread Ruchir Jha
The Spark Job that has the main DStream, could have another DStream that is listening for "stream subscription" requests. So when a subscription is received, you could do a filter/forEach on the main DStream and respond to that one request. So you're basically creating a stream server that is capab

Re: Writing to RabbitMQ

2014-08-18 Thread jschindler
I am running into a different problem relating to this spark app right now and I'm thinking it may be due to the fact that I am publishing to RabbitMQ inside of a foreachPartition loop. I would like to publish once for each window and the app is publishing a lot more than that (it varies sometimes

Re: Extracting unique elements of an ArrayBuffer

2014-08-18 Thread Sean Owen
The result of groupByKey is an RDD of K and Iterable[V]. The values may in fact be ArrayBuffer[V] but this is not guaranteed to you by the API. Printing it as a String would show it's an ArrayBuffer but not (necessarily) as far as the types are concerned. And .distinct() is not a method of Iterable

Extracting unique elements of an ArrayBuffer

2014-08-18 Thread SK
Hi, I have a piece of code in which the result of a groupByKey operation is as follows: (2013-04, ArrayBuffer(s1, s2, s3, s1, s2, s4)) The first element is a String value representing a date and the ArrayBuffer consists of (non-unique) strings. I want to extract the unique elements of the Array

java.nio.channels.CancelledKeyException in Graphx Connected Components

2014-08-18 Thread Jeffrey Picard
Hey all, I’m trying to run connected components in graphx on about 400GB of data on 50 m3.xlarge nodes on emr. I keep getting java.nio.channels.CancelledKeyException when it gets to "mapPartitions at VertexRDD.scala:347”. I haven’t been able to find much about this online, and nothing that seem

RE: Does HiveContext support Parquet?

2014-08-18 Thread lyc
I followed your instructions to try to load data as parquet format through hiveContext but failed. Do you happen to know my uncorrectness in the following steps? The steps I am following is like: 1. download "parquet-hive-bundle-1.5.0.jar" 2. revise hive-site.xml including this: hive.jar.direc

Re: Merging complicated small matrices to one big matrix

2014-08-18 Thread Davies Liu
rdd.flatMap(lambda x:x) maybe could solve your problem, it will convert an RDD from [[[1,2,3],[4,5,6]],[[7,8,9,],[10,11,12]]] into: [[1,2,3], [4,5,6], [7,8,9,], [10,11,12]] On Mon, Aug 18, 2014 at 2:42 AM, Chengi Liu wrote: > I have an rdd in pyspark which looks like follows: > It has two

Bug or feature? Overwrite broadcasted variables.

2014-08-18 Thread Peng Cheng
I'm curious to see that if you declare broadcasted wrapper as a var, and overwrite it in the driver program, the modification can have stable impact on all transformations/actions defined BEFORE the overwrite but was executed lazily AFTER the overwrite: val a = sc.parallelize(1 to 10) var

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Cesar Arevalo
I removed the JAR that you suggested but now I get another error when I try to create the HiveContext. Here is the error: scala> val hiveContext = new HiveContext(sc) error: bad symbolic reference. A signature in HiveContext.class refers to term ql in package org.apache.hadoop.hive which is not av

Spark Streaming - saving DStream into hadoop throws execption if checkpoint is enabled.

2014-08-18 Thread salemi
Hi, Using the code below to save DStream into hadoop throws execption if checkpoint is enabled. see the really simple example below. if i take out the scc.checkpoint(...) the the code works. val ssc = new StreamingContext(sparkConf, Seconds(1)) ssc.checkpoint(hdfsCheckPointUrl)

RE: spark kryo serilizable exception

2014-08-18 Thread Sameer Tilak
Hi,I was able to set this parameter in my application to resolve this issue: set("spark.kryoserializer.buffer.mb", "256") Please let me know if this helps. Date: Mon, 18 Aug 2014 21:50:02 +0800 From: dujinh...@hzduozhun.com To: user@spark.apache.org Subject: spark kryo serilizable exception

Re: Question regarding spark data partition and coalesce. Need info on my use case.

2014-08-18 Thread abhiguruvayya
Hello Mayur, #3 in the new RangePartitioner(*3*, partitionedFile); is also a hard coded value for the number of partitions. Do you find a way where i can avoid that. And including the cluster size, partitions depends on the input data size also. Correct me if i am wrong. -- View this message in

Spark Streaming Data Sharing

2014-08-18 Thread Levi Bowman
Based on my understanding something like this doesn't seem to be possible out of the box, but I thought I would write it up anyway in case someone has any ideas. We have conceptually one high volume input stream, each streaming job is either interested in a subset of the stream or the entire st

spark kryo serilizable exception

2014-08-18 Thread adu
hi all, In RDD map , i invoke an object that is *Serialized* by java standard , and exception :: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 13 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeAscii_

Re: Working with many RDDs in parallel?

2014-08-18 Thread Sean Owen
You won't be able to use RDDs inside of RDD operation. I imagine your immediate problem is that the code you've elided references 'sc' and that gets referenced by the PairFunction and serialized, but it can't be. If you want to play it this way, parallelize across roots in Java. That is just use a

Working with many RDDs in parallel?

2014-08-18 Thread David Tinker
Hi All. I need to create a lot of RDDs starting from a set of "roots" and count the rows in each. Something like this: final JavaSparkContext sc = new JavaSparkContext(conf); List roots = ... Map res = sc.parallelize(roots).mapToPair(new PairFunction(){ public Tuple2 call(String root) throws

[GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-08-18 Thread Yifan LI
Hi, I am testing our application(similar to "personalised page rank" using Pregel, and note that each vertex property will need pretty much more space to store after new iteration), it works correctly on small graph.(we have one single machine, 8 cores, 16G memory) But when we ran it on larger

Re: OutOfMemory Error

2014-08-18 Thread Ghousia
But this would be applicable only to operations that have a shuffle phase. This might not be applicable to a simple Map operation where a record is mapped to a new huge value, resulting in OutOfMemory Error. On Mon, Aug 18, 2014 at 12:34 PM, Akhil Das wrote: > I believe spark.shuffle.memoryFr

Merging complicated small matrices to one big matrix

2014-08-18 Thread Chengi Liu
I have an rdd in pyspark which looks like follows: It has two sub matrices..(array,array) [ array([[-13.00771575, 0.2740844 , 0.9752694 , 0.67465999, -1.45741537, 0.546775 , 0.7900841 , -0.59473707, -1.11752044, 0.61564356], [ -0., 12.20115746, -0.

Re: Re: application as a service

2014-08-18 Thread Zhanfeng Huo
That helps a lot. Thanks. Zhanfeng Huo From: Davies Liu Date: 2014-08-18 14:31 To: ryaminal CC: u...@spark.incubator.apache.org Subject: Re: application as a service Another option is using Tachyon to cache the RDD, then the cache can be shared by different applications. See how to use Spark

Re: s3:// sequence file startup time

2014-08-18 Thread Cheng Lian
Maybe irrelevant, but this resembles a lot the S3 Parquet file issue we've met before. It takes a dozen minutes to read the metadata because the ParquetInputFormat tries to call geFileStatus for all part-files sequentially. Just checked SequenceFileInputFormat, and found that a MapFile may share s

Re: Spark: why need a masterLock when sending heartbeat to master

2014-08-18 Thread Victor Sheng
Thanks, I got it ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-why-need-a-masterLock-when-sending-heartbeat-to-master-tp12256p12297.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Akhil Das
Then definitely its a jar conflict. Can you try removing this jar from the class path /opt/spark-poc/lib_managed/jars/org.spark-project.hive/hive-exec/ hive-exec-0.12.0.jar Thanks Best Regards On Mon, Aug 18, 2014 at 12:45 PM, Cesar Arevalo wrote: > Nope, it is NOT null. Check this out: > > sc

Re: Segmented fold count

2014-08-18 Thread fil
Thanks for the reply! def groupCount(l): >gs = itertools.groupby(l) >return map(lambda (n, it): (n, sum(1 for _ in it)), gs) > > If you have an RDD, you can use RDD.mapPartitions(groupCount).collect() > Yes, I am interested in RDD - not pure Python :) I am new to Spark, can you explain:

RE: a noob question for how to implement setup and cleanup in Spark map

2014-08-18 Thread Henry Hung
I slightly modify the code to use while(partitions.hasNext) { } instead of partitions.map(func) I suppose this can eliminate the uncertainty from lazy execution. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, August 18, 2014 3:10 PM To: MA33 YTHung1 Cc: user@

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Cesar Arevalo
Nope, it is NOT null. Check this out: scala> hiveContext == null res2: Boolean = false And thanks for sending that link, but I had already looked at it. Any other ideas? I looked through some of the relevant Spark Hive code and I'm starting to think this may be a bug. -Cesar On Mon, Aug 18,

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-18 Thread Sean Owen
I think this was a more comprehensive answer recently. Tobias is right that it is not quite that simple: http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dr...@mail.gmail.com%3E On Mon, Aug 18, 2014 at 8:04 AM, Henry Hung wrote: > Hi

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-18 Thread Akhil Das
You can create an RDD and then you can do a map or mapPartitions on that where in the top you will create the database connection and all, then do the operations and at the end close the connections. Thanks Best Regards On Mon, Aug 18, 2014 at 12:34 PM, Henry Hung wrote: > Hi All, > > > > Ple

Re: OutOfMemory Error

2014-08-18 Thread Akhil Das
I believe spark.shuffle.memoryFraction is the one you are looking for. spark.shuffle.memoryFraction : Fraction of Java heap to use for aggregation and cogroups during shuffles, if spark.shuffle.spill is true. At any given time, the collective size of all in-memory maps used for shuffles is bounded

RE: a noob question for how to implement setup and cleanup in Spark map

2014-08-18 Thread Henry Hung
Hi All, Please ignore my question, I found a way to implement it via old archive mails: http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAF_KkPzpU4qZWzDWUpS5r9bbh=-hwnze2qqg56e25p--1wv...@mail.gmail.com%3E Best regards, Henry From: MA33 YTHung1 Sent: Monday, August 18, 2014 2

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Akhil Das
Looks like your hiveContext is null. Have a look at this documentation. Thanks Best Regards On Mon, Aug 18, 2014 at 12:09 PM, Cesar Arevalo wrote: > Hello: > > I am trying to setup Spark to connect to a Hive table wh