Re: Running 2 spark application in parallel

2015-10-22 Thread Simon Elliston Ball
If yarn has capacity to run both simultaneously it will. You should ensure you are not allocating too many executors for the first app and leave some space for the second) You may want to run the application on different yarn queues to control resource allocation. If you run as a different

Analyzing consecutive elements

2015-10-22 Thread Sampo Niskanen
Hi, I have analytics data with timestamps on each element. I'd like to analyze consecutive elements using Spark, but haven't figured out how to do this. Essentially what I'd want is a transform from a sorted RDD [A, B, C, D, E] to an RDD [(A,B), (B,C), (C,D), (D,E)]. (Or some other way to

RE: Spark_1.5.1_on_HortonWorks

2015-10-22 Thread Sun, Rui
Frans, SparkR runs with R 3.1+. If possible, latest verison of R is recommended. From: Saisai Shao [mailto:sai.sai.s...@gmail.com] Sent: Thursday, October 22, 2015 11:17 AM To: Frans Thamura Cc: Ajay Chander; Doug Balog; user spark mailing list Subject: Re: Spark_1.5.1_on_HortonWorks SparkR is

Re: Kafka Streaming and Filtering > 3000 partitons

2015-10-22 Thread varun sharma
You can try something like this to filter by topic: val kafkaStringStream = KafkaUtils.createDirectStream[...] //you might want to create Stream by fetching offsets from zk kafkaStringStream.foreachRDD{rdd => val topics = rdd.map(_._1).distinct().collect() if (topics.length > 0) {

Request for submitting Spark jobs in code purely, without jar

2015-10-22 Thread ??????
Hi developers, I've encountered some problem with Spark, and before opening an issue, I'd like to hear your thoughts. Currently, if you want to submit a Spark job, you'll need to write the code, make a jar, and then submit it with spark-submit or org.apache.spark.launcher.SparkLauncher.

Re: Request for submitting Spark jobs in code purely, without jar

2015-10-22 Thread Ali Tajeldin EDU
The Spark job-server project may help (https://github.com/spark-jobserver/spark-jobserver). -- Ali On Oct 21, 2015, at 11:43 PM, ?? wrote: > Hi developers, I've encountered some problem with Spark, and before opening > an issue, I'd like to hear your thoughts. >

Re: Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread Deenar Toraskar
I can see this artifact in public repos http://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10/1.5.1 http://central.maven.org/maven2/org/apache/spark/spark-sql_2.10/1.5.1/spark-sql_2.10-1.5.1.jar check your proxy settings or the list of repos you are using. Deenar On 22 October 2015

Fwd: sqlContext load by offset

2015-10-22 Thread Kayode Odeyemi
Hi, I've trying to load a postgres table using the following expression: val cachedIndex = cache.get("latest_legacy_group_index") val mappingsDF = sqlContext.load("jdbc", Map( "url" -> Config.dataSourceUrl(mode, Some("mappings")), "dbtable" -> s"(select userid, yid, username from legacyusers

Re: Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Steve Loughran
> On 22 Oct 2015, at 15:12, Ashish Shrowty wrote: > > I understand that there is some incompatibility with the API between Hadoop > 2.6/2.7 and Amazon AWS SDK where they changed a signature of >

spark.deploy.zookeeper.url

2015-10-22 Thread Michal Čizmazia
Does the spark.deploy.zookeeper.url configuration work correctly when I point it to a single virtual IP address with more hosts behind it (load balancer or round robin)? https://spark.apache.org/docs/latest/spark-standalone.html#high-availability ZooKeeper FAQ also discusses this topic:

Re: Large number of conf broadcasts

2015-10-22 Thread Anders Arpteg
Yes, seems unnecessary. I actually tried patching the com.databricks.spark.avro reader to only broadcast once per dataset, instead of every single file/partition. It seems to work just as fine, and there are significantly less broadcasts and not seeing out of memory issues any more. Strange that

Re: Spark_1.5.1_on_HortonWorks

2015-10-22 Thread Steve Loughran
On 22 Oct 2015, at 02:47, Ajay Chander > wrote: Thanks for your time. I have followed your inputs and downloaded "spark-1.5.1-bin-hadoop2.6" on one of the node say node1. And when I did a pie test everything seems to be working fine, except that

"java.io.IOException: Connection reset by peer" thrown on the resource manager when launching Spark on Yarn

2015-10-22 Thread PashMic
Hi all, I am trying to launch a Spark job using yarn-client mode on a cluster. I have already tried spark-shell with yarn and I can launch the application. But, I also would like to be able run the driver program from, say eclipse, while using the cluster to run the tasks. I have also added

Re: [Spark Streaming] How do we reset the updateStateByKey values.

2015-10-22 Thread Sander van Dijk
I don't think it is possible in the way you try to do it. It is important to remember that the statements you mention only set up the stream stages, before the stream is actually running. Once it's running, you cannot change, remove or add stages. I am not sure how you determine your condition

Re: Incremental load of RDD from HDFS?

2015-10-22 Thread Chris Spagnoli
Found the problem. There was an rdd.distinct hiding in my code that I had overlooked, and that caused this behavior (because instead of iterating over the raw RDD, I was instead iterating over the RDD which had been derived from it). Thank you everyone! - Chris -- View this message in

Running 2 spark application in parallel

2015-10-22 Thread Suman Somasundar
Hi all, Is there a way to run 2 spark applications in parallel under Yarn in the same cluster? Currently, if I submit 2 applications, one of them waits till the other one is completed. I want both of them to start and run at the same time. Thanks, Suman.

Re: [Spark Streaming] How do we reset the updateStateByKey values.

2015-10-22 Thread Uthayan Suthakar
I need to take the value from a RDD and update the state of the other RDD. Is this possible? On 22 October 2015 at 16:06, Uthayan Suthakar wrote: > Hello guys, > > I have a stream job that will carryout computations and update the state > (SUM the value). At some

Re: Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread William Li
Thanks Deenar for your response. I am able to get the version 1.5.0 and other lower version, they all fine but just not the 1.5.1. It's hard to believe it's proxy settings settings. What is interesting is that the Intellij does a few things when downloading this jar: putting into .m2

Re: Spark StreamingStatefull information

2015-10-22 Thread Adrian Tanase
The result of updatestatebykey is a dstream that emits the entire state every batch - as an RDD - nothing special about it. It easy to join / cogroup with another RDD if you have the correct keys in both. You could load this one when the job starts and/or have it update with updatestatebykey as

Re: Get the previous state string

2015-10-22 Thread Akhil Das
That way, you will eventually end up bloating up that list. Instead, you could push the stream to a noSQL database (like hbase or cassandra etc) and then read it back and join it with your current stream if that's what you are looking for. Thanks Best Regards On Thu, Oct 15, 2015 at 6:11 PM,

Re: Analyzing consecutive elements

2015-10-22 Thread Dylan Hogg
Hi Sampo, You could try zipWithIndex followed by a self join with shifted index values like this: val arr = Array((1, "A"), (8, "D"), (7, "C"), (3, "B"), (9, "E")) val rdd = sc.parallelize(arr) val sorted = rdd.sortByKey(true) val zipped = sorted.zipWithIndex.map(x => (x._2, x._1)) val pairs =

Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-10-22 Thread Akhil Das
Did you try passing the mysql connector jar through --driver-class-path Thanks Best Regards On Sat, Oct 17, 2015 at 6:33 AM, Hurshal Patel wrote: > Hi all, > > I've been struggling with a particularly puzzling issue after upgrading to > Spark 1.5.1 from Spark 1.4.1. > >

Re: Output println info in LogMessage Info ?

2015-10-22 Thread Akhil Das
Yes, using log4j you can log everything. Here's a thread with example http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs Thanks Best Regards On Sun, Oct 18, 2015 at 12:10 AM, kali.tumm...@gmail.com <

Re: Analyzing consecutive elements

2015-10-22 Thread Sampo Niskanen
Hi, Sorry, I'm not very familiar with those methods and cannot find the 'drop' method anywhere. As an example: val arr = Array((1, "A"), (8, "D"), (7, "C"), (3, "B"), (9, "E")) val rdd = sc.parallelize(arr) val sorted = rdd.sortByKey(true) // ... then what? Thanks. Best regards, *Sampo

Re: Can I convert RDD[My_OWN_JAVA_CLASS] to DataFrame in Spark 1.3.x?

2015-10-22 Thread Akhil Das
Have a look at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection if you haven't seen that already. Thanks Best Regards On Thu, Oct 15, 2015 at 10:56 PM, java8964 wrote: > Hi, Sparkers: > > I wonder if I can convert a RDD

Re: spark streaming failing to replicate blocks

2015-10-22 Thread Akhil Das
Can you try fixing spark.blockManager.port to specific port and see if the issue exists? Thanks Best Regards On Mon, Oct 19, 2015 at 6:21 PM, Eugen Cepoi wrote: > Hi, > > I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN. > The job is reading data from

Re: multiple pyspark instances simultaneously (same time)

2015-10-22 Thread Akhil Das
Did you read https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application Thanks Best Regards On Thu, Oct 15, 2015 at 11:31 PM, jeff.sadow...@gmail.com < jeff.sadow...@gmail.com> wrote: > I am having issues trying to setup spark to run jobs simultaneously. > > I

Accessing external Kerberised resources from Spark executors in Yarn client/cluster mode

2015-10-22 Thread Deenar Toraskar
Hi All I am trying to access a SQLServer that uses Kerberos for authentication from Spark. I can successfully connect to the SQLServer from the driver node, but any connections to SQLServer from executors fails with "Failed to find any Kerberos tgt".

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Akhil Das
Convert your data to parquet, it saves space and time. Thanks Best Regards On Mon, Oct 19, 2015 at 11:43 PM, ahaider3 wrote: > Hi, > A lot of the data I have in HDFS is compressed. I noticed when I load this > data into spark and cache it, Spark unrolls the data like

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Igor Berman
check spark.rdd.compress On 19 October 2015 at 21:13, ahaider3 wrote: > Hi, > A lot of the data I have in HDFS is compressed. I noticed when I load this > data into spark and cache it, Spark unrolls the data like normal but stores > the data uncompressed in memory. For

Re: java TwitterUtils.createStream() how create "user stream" ???

2015-10-22 Thread Akhil Das
I don't think the one that comes with spark would listen to specific user feeds, but yes you can filter out the public tweets by passing the filters argument. Here's an example for you to start

Re: Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread Sean Owen
Maven, in general, does some local caching to avoid htting the repo every time. It's possible this is why you're not seeing 1.5.1. On the command line you can for example add "mvn -U ..." Not sure of the equivalent in IntelliJ, but it will be updating the same repo IJ sees. Try that. The repo

Re: [Spark Streaming] Design Patterns forEachRDD

2015-10-22 Thread Nipun Arora
Hi Sandip, Thanks for your response.. I am not sure if this is the same thing. I am looking for a way to connect to external network as shown in the example. @All - Can anyone else let me know if they have a better solution? Thanks Nipun On Wed, Oct 21, 2015 at 2:07 PM, Sandip Mehta

Getting ClassNotFoundException: scala.Some on Spark 1.5.x

2015-10-22 Thread Babar Tareen
Hi, I am getting following exception when submitting a job to Spark 1.5.x from Scala. The same code works with Spark 1.4.1. Any clues as to what might causing the exception. *Code:App.scala*import org.apache.spark.SparkContext object App { def main(args: Array[String]) = { val l =

Re: Save RandomForest Model from ML package

2015-10-22 Thread Sujit Pal
Hi Sebastian, You can save models to disk and load them back up. In the snippet below (copied out of a working Databricks notebook), I train a model, then save it to disk, then retrieve it back into model2 from disk. import org.apache.spark.mllib.tree.RandomForest > import

Spark YARN Shuffle service wire compatibility

2015-10-22 Thread Jong Wook Kim
Hi, I’d like to know if there is a guarantee that Spark YARN shuffle service has wire compatibility between 1.x versions. I could run Spark 1.5 job with YARN nodemanagers having shuffle service 1.4, but it might’ve been just a coincidence. Now we’re upgrading CDH to 5.3 to 5.4, whose

[SPARK STREAMING] polling based operation instead of event based operation

2015-10-22 Thread Nipun Arora
Hi, In general in spark stream one can do transformations ( filter, map etc.) or output operations (collect, forEach) etc. in an event-driven pardigm... i.e. the action happens only if a message is received. Is it possible to do actions every few seconds in a polling based fashion, regardless if

Re: Spark issue running jar on Linux vs Windows

2015-10-22 Thread Ted Yu
RemoteActorRefProvider is in akka-remote_2.10-2.3.11.jar jar tvf ~/.m2/repository/com/typesafe/akka/akka-remote_2.10/2.3.11/akka-remote_2.10-2.3.11.jar | grep RemoteActorRefProvi 1761 Fri May 08 16:13:02 PDT 2015 akka/remote/RemoteActorRefProvider$$anonfun$5.class 1416 Fri May 08 16:13:02 PDT

Re: Spark issue running jar on Linux vs Windows

2015-10-22 Thread Ted Yu
RemoteActorRefProvider is in akka-remote_2.10-2.3.11.jar jar tvf ~/.m2/repository/com/typesafe/akka/akka-remote_2.10/2.3.11/akka-remote_2.10-2.3.11.jar | grep RemoteActorRefProvi 1761 Fri May 08 16:13:02 PDT 2015 akka/remote/RemoteActorRefProvider$$anonfun$5.class 1416 Fri May 08 16:13:02 PDT

Spark issue running jar on Linux vs Windows

2015-10-22 Thread Michael Lewis
Hi, I have a Spark driver process that I have built into a single ‘fat jar’ this runs fine, in Cygwin, on my development machine, I can run: scala -cp my-fat-jar-1.0.0.jar com.foo.MyMainClass this works fine, it will submit Spark job, they process, all good. However, on Linux (all Jars

Re: Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread William Li
Thanks Sean. I did that but didn¹t seem to help. However, I manually downloaded both the pom and jar files from the site, and then run through mvn dependency:purge-local-repository to clean up the local repo (+ re download them all). All are good and then the error went away. Thanks a lot for

Best way to use Spark UDFs via Hive (Spark Thrift Server)

2015-10-22 Thread Dave Moyers
Hi, We have several udf's written in Scala that we use within jobs submitted into Spark. They work perfectly with the sqlContext after being registered. We also allow access to saved tables via the Hive Thrift server bundled with Spark. However, we would like to allow Hive connections to use

Spark StreamingStatefull information

2015-10-22 Thread Arttii
Hi, So I am working on a usecase, where Clients are walking in and out of geofences and sendingmessages based on that. I currently have some in Memory Broadcast vars to do certain lookups for client and geofence info, some of this is also coming from Cassandra. My current quandry is that I need

Re: Accessing external Kerberised resources from Spark executors in Yarn client/cluster mode

2015-10-22 Thread Doug Balog
Another thing to check is to make sure each one of you executor nodes has the JCE jars installed. try{ javax.crypto.Cipher.getMaxAllowedKeyLength("AES") > 128 } catch { case e:java.security.NoSuchAlgorithmException => false } Setting "-Dsun.security.krb5.debug=true” and

Re: Sporadic error after moving from kafka receiver to kafka direct stream

2015-10-22 Thread Cody Koeninger
That sounds like a networking issue to me. Stuff to try - make sure every executor node can talk to every kafka broker on relevant ports - look at firewalls / network config. Even if you can make the initial connection, something may be happening after a while (we've seen ... "interesting"...

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Adnan Haider
I believe spark.rdd.compress requires the data to be serialized. In my case, I have data already compressed which becomes decompressed as I try to cache it. I believe even when I set spark.rdd.compress to *true, *Spark will still decompress the data and then serialize it and then compress the

Re: Error in starting Spark Streaming Context

2015-10-22 Thread Tiago Albineli Motta
Can't say what is happening, and I have a similar problem here. While for you the source is: org.apache.spark.streaming.dstream.WindowedDStream@532d0784 has not been initialized For me is: org.apache.spark.SparkException: org.apache.spark.streaming.dstream.MapPartitionedDStream@7a2d07cc has

Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-10-22 Thread Xiao Li
A few months ago, I used the DB2 jdbc drivers. I hit a couple of issues when using --driver-class-path. At the end, I used the following command to bypass most of issues: ./bin/spark-submit --jars /Users/smile/db2driver/db2jcc.jar,/Users/smile/db2driver/db2jcc_license_cisuz.jar --master local[*]

Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Ashish Shrowty
I understand that there is some incompatibility with the API between Hadoop 2.6/2.7 and Amazon AWS SDK where they changed a signature of com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold. The JIRA indicates that this would be fixed in Hadoop 2.8.

Saving RDDs in Tachyon

2015-10-22 Thread mark
I have Avro records stored in Parquet files in HDFS. I want to read these out as an RDD and save that RDD in Tachyon for any spark job that wants the data. How do I save the RDD in Tachyon? What format do I use? Which RDD 'saveAs...' method do I want? Thanks

Re: Save to paquet files failed

2015-10-22 Thread Raghavendra Pandey
Can ypu increase number of partitions and try... Also, i dont think you need to cache dfs before saving them... U can do away with that as well... Raghav On Oct 23, 2015 7:45 AM, "Ram VISWANADHA" wrote: > Hi , > I am trying to load 931MB file into an RDD, then

How to restart a failed Spark Streaming Application automatically in client mode on YARN

2015-10-22 Thread y
I'm managing Spark Streaming applications which run on Cloud Dataproc (https://cloud.google.com/dataproc/). Spark Streaming applications running on a Cloud Dataproc cluster seem to run in client mode on YARN. Some of my applications sometimes stop due to the application failure. I'd like YARN to

Re: How to restart a failed Spark Streaming Application automatically in client mode on YARN

2015-10-22 Thread Saisai Shao
Looks like currently there's no way for Spark Streaming to restart automatically in yarn-client mode, because in yarn-client mode, AM and driver are two processes, Yarn only control the restart of AM, not driver, so it is not supported in yarn-client mode. You can write some scripts to monitor

Save to paquet files failed

2015-10-22 Thread Ram VISWANADHA
Hi , I am trying to load 931MB file into an RDD, then create a DataFrame and store the data in a Parquet file. The save method of Parquet file is hanging. I have set the timeout to 1800 but still the system fails to respond and hangs. I can’t spot any errors in my code. Can someone help me?

Re: Spark 1.5 on CDH 5.4.0

2015-10-22 Thread Sandy Ryza
Hi Deenar, The version of Spark you have may not be compiled with YARN support. If you inspect the contents of the assembly jar, does org.apache.spark.deploy.yarn.ExecutorLauncher exist? If not, you'll need to find a version that does have the YARN classes. You can also build your own using

Saving offset while reading from kafka

2015-10-22 Thread Ramkumar V
Hi, I had written spark streaming application using kafka stream and its writing to hdfs for every hour(batch time). I would like to know how to get offset or commit offset of kafka stream while writing to hdfs so that if there is any issue or redeployment, i'll start from the point where i did a

(SOLVED) Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-22 Thread t3l
I was able to solve this by myself. What I did is changing the way spark computes the partitioning for binary files. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140p25170.html Sent from the Apache

Whether Spark will use disk when the memory is not enough on MEMORY_ONLY Storage Level

2015-10-22 Thread JoneZhang
1.Whether Spark will use disk when the memory is not enough on MEMORY_ONLY Storage Level? 2.If not, How can i set Storage Level when i use Hive on Spark? 3.Do Spark have any intention of dynamically determined Hive on MapReduce or Hive on Spark, base on SQL features. Thanks in advance Best

Re: Can we split partition

2015-10-22 Thread Akhil Das
Did you try coalesce? It doesn't shuffle the data around. Thanks Best Regards On Wed, Oct 21, 2015 at 10:27 AM, shahid wrote: > Hi > > I have a large partition(data skewed) i need to split it to no. of > partitions, repartitioning causes lot of shuffle. Can we do that..? > >

Re: Analyzing consecutive elements

2015-10-22 Thread Adrian Tanase
Drop is a method on scala’s collections (array, list, etc) - not on Spark’s RDDs. You can look at it as long as you use mapPartitions or something like reduceByKey, but it totally depends on the use-cases you have for analytics. The others have suggested better solutions using only spark’s

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-22 Thread Xiao Li
Actually, I found a design issue in self joins. When we have multiple-layer projections above alias, the information of alias relation between alias and actual columns are lost. Thus, when resolving the alias in self joins, the rules treat the alias (e.g., in Projection) as normal columns. This

Save RandomForest Model from ML package

2015-10-22 Thread Sebastian Kuepers
Hey, I try to figure out the best practice on saving and loading models which have bin fitted with the ML package - i.e. with the RandomForest classifier. There is PMML support in the MLib package afaik but not in ML - is that correct? How do you approach this, so that you do not have to fit

Sporadic error after moving from kafka receiver to kafka direct stream

2015-10-22 Thread Conor Fennell
Hi, Firstly want to say a big thanks to Cody for contributing the kafka direct stream. I have been using the receiver based approach for months but the direct stream is a much better solution for my use case. The job in question is now ported over to the direct stream doing idempotent outputs

Sporadic error after moving from kafka receiver to kafka direct stream

2015-10-22 Thread Conor Fennell
Hi, Firstly want to say a big thanks to Cody for contributing the kafka direct stream. I have been using the receiver based approach for months but the direct stream is a much better solution for my use case. The job in question is now ported over to the direct stream doing idempotent outputs

RE: Analyzing consecutive elements

2015-10-22 Thread Andrianasolo Fanilo
Hi Sampo, There is a sliding method you could try inside the org.apache.spark.mllib.rdd.RDDFunctions package, though it’s DeveloperApi stuff (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions) import org.apache.spark.{SparkConf, SparkContext}

Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
Hello everyone, I am doing some analytics experiments under a 4 server stand-alone cluster in a spark shell, mostly involving a huge database with groupBy and aggregations. I am picking 6 groupBy columns and returning various aggregated results in a dataframe. GroupBy fields are of two types,

[jira] Ankit shared "SPARK-11213: Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4" with you

2015-10-22 Thread Ankit (JIRA)
Ankit shared an issue with you --- > Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4 > --- > > Key: SPARK-11213 > URL:

Re: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Xiao Li
Hi, Saif, Could you post your code here? It might help others reproduce the errors and give you a correct answer. Thanks, Xiao Li 2015-10-22 8:27 GMT-07:00 : > Hello everyone, > > I am doing some analytics experiments under a 4 server stand-alone cluster > in a

Re: Analyzing consecutive elements

2015-10-22 Thread Sampo Niskanen
Hi, Excellent, the sliding method seems to be just what I'm looking for. Hope it becomes part of the stable API, I'd assume there to be lots of uses with time-related data. Dylan's suggestion seems reasonable as well, if DeveloperApi is not an option. Thanks! Best regards, *Sampo

Re: [jira] Ankit shared "SPARK-11213: Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4" with you

2015-10-22 Thread Anubhav Agarwal
Hi Ankit, Here is my solution for this:- 1) Download the latest Spark 1.5.1(Just copied the following link from spark.apache.org, if it doesn't work then gran a new one from the website.) wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz 2) Unzip the folder and rename/move

[Spark Streaming] How do we reset the updateStateByKey values.

2015-10-22 Thread Uthayan Suthakar
Hello guys, I have a stream job that will carryout computations and update the state (SUM the value). At some point, I would like to reset the state. I could drop the state by setting 'None' but I don't want to drop it. I would like to keep the state but update the state. For example:

Re: Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Igor Berman
many use it. how do you add aws sdk to classpath? check in environment ui what is in cp. you should make sure that in your cp the version is compatible with one that spark compiled with I think 1.7.4 is compatible(at least we use it) make sure that you don't get other versions from other

Re: spark streaming failing to replicate blocks

2015-10-22 Thread Eugen Cepoi
Huh indeed this worked, thanks. Do you know why this happens, is that some known issue? Thanks, Eugen 2015-10-22 19:08 GMT+07:00 Akhil Das : > Can you try fixing spark.blockManager.port to specific port and see if the > issue exists? > > Thanks > Best Regards > > On

Python worker exited unexpectedly (crashed)

2015-10-22 Thread shahid
Hi I am running 10 node standalone cluster on aws and loading 100G data on HDFS.. doing first groupby operation. and then generating pairs from the groupedrdd (key,[a1,b1],key,[a,b,c]) generating the pairs like (a1,b1),(a,b),(a,c) ... n PairRDD will get large in size. some stats from ui when

RE: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
Thanks, sorry I cannot share the data and not sure how much significant it will be for you. I am reproducing the issue on a smaller piece of the content and see wether I find a reason on the inconsistence. val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band ", "aget",

Spark SQL: Issues with using DirectParquetOutputCommitter with APPEND mode and OVERWRITE mode

2015-10-22 Thread Jerry Lam
Hi Spark users and developers, I read the ticket [SPARK-8578] (Should ignore user defined output committer when appending data) which ignore DirectParquetOutputCommitter if append mode is selected. The logic was that it is unsafe to use because it is not possible to revert a failed job in append

Spark 1.5 on CDH 5.4.0

2015-10-22 Thread Deenar Toraskar
Hi I have got the prebuilt version of Spark 1.5 for Hadoop 2.6 ( http://www.apache.org/dyn/closer.lua/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz) working with CDH 5.4.0 in local mode on a cluster with Kerberos. It works well including connecting to the Hive metastore. I am facing an issue

Re: Error in starting Spark Streaming Context

2015-10-22 Thread Tiago Albineli Motta
Solved! The problem has nothing to do about class and object refactory. But in the process of this refactory I made a change that is similar of your code. Before this refactory, I processed the DStream inside the function that I sent to StreamingContext.getOrCreate. After, I started processing

RE: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
nevermind my last email. res2 is filtered so my test does not make sense. The issue is not reproduced there. I have the problem somwhere else. From: Ellafi, Saif A. Sent: Thursday, October 22, 2015 12:57 PM To: 'Xiao Li' Cc: user Subject: RE: Spark groupby and agg inconsistent and missing data

Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread William Li
Hi - I tried to download the Spark SQL 2.10 and version 1.5.1 from Intellij using the maven library: -Project Structure -Global Library, click on the + to select Maven Repository -Type in org.apache.spark to see the list. -The list result only shows version up to spark-sql_2.10-1.1.1 -I tried

Fwd: multiple pyspark instances simultaneously (same time)

2015-10-22 Thread Jeff Sadowski
On Thu, Oct 22, 2015 at 5:40 AM, Akhil Das wrote: > Did you read > https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application > I did. I had set the option spark.scheduler.mode FAIR in conf/spark-defaults.conf and created

Re: Large number of conf broadcasts

2015-10-22 Thread Koert Kuipers
i am seeing the same thing. its gona completely crazy creating broadcasts for the last 15 mins or so. killing it... On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg wrote: > Hi, > > Running spark 1.5.0 in yarn-client mode, and am curios in why there are so > many broadcast