Spark-Submit error

2015-07-30 Thread satish chandra j
HI, I have submitted a Spark Job with options jars,class,master as *local* but i am getting an error as below *dse spark-submit spark error exception in thread main java.io.ioexception: Invalid Request Exception(Why you have not logged in)* *Note: submitting datastax spark node* please let me kn

Re: Getting the number of slaves

2015-07-30 Thread Tathagata Das
To clarify, that is the number of executors requested by the SparkContext from the cluster manager. On Tue, Jul 28, 2015 at 5:18 PM, amkcom wrote: > try sc.getConf.getInt("spark.executor.instances", 1) > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.c

Re: Re: How RDD lineage works

2015-07-30 Thread Tathagata Das
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FailureSuite.scala This may help. On Thu, Jul 30, 2015 at 10:42 PM, bit1...@163.com wrote: > The following is copied from the paper, is something related with rdd > lineage. Is there a unit test that covers this sce

Re: Re: How RDD lineage works

2015-07-30 Thread bit1...@163.com
The following is copied from the paper, is something related with rdd lineage. Is there a unit test that covers this scenario(rdd partition lost and recovery)? Thanks. If a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute just th

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread gaurav sharma
I have run into similar excpetions ERROR DirectKafkaInputDStream: ArrayBuffer(java.net.SocketTimeoutException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([AdServe,1])) and the issue has happened on Kafka Side, where my broker offsets go out of sync, or do not return l

Re: Heatmap with Spark Streaming

2015-07-30 Thread Tathagata Das
I do suggest that the non-spark related discussions be taken to a different this forum as it does not directly contribute to the contents of this user list. On Thu, Jul 30, 2015 at 8:52 PM, UMESH CHAUDHARY wrote: > Thanks for the valuable suggestion. > I also started with JAX-RS restful service

Re: Re: How RDD lineage works

2015-07-30 Thread bit1...@163.com
Thanks TD and Zhihong for the guide. I will check it bit1...@163.com From: Tathagata Das Date: 2015-07-31 12:27 To: Ted Yu CC: bit1...@163.com; user Subject: Re: How RDD lineage works You have to read the original Spark paper to understand how RDD lineage works. https://www.cs.berkeley.edu/~

Re: Error SparkStreaming after a while executing.

2015-07-30 Thread Tathagata Das
Are you running it on YARN? Might be a known YARN credential expiring issue. Hari, any insights? On Thu, Jul 30, 2015 at 4:04 AM, Guillermo Ortiz wrote: > I'm executing a job with Spark Streaming and got this error all times when > the job has been executing for a while (usually hours of days).

Re: Long running streaming application - worker death

2015-07-30 Thread Tathagata Das
How are you figuring out that the receiver isnt running? What does the streaming UI say regarding the location of the receivers? Also what version of Spark are you using? On Sun, Jul 26, 2015 at 8:47 AM, Ashwin Giridharan wrote: > Hi Aviemzur, > > As of now the Spark workers just run indefinitel

Re: How RDD lineage works

2015-07-30 Thread Tathagata Das
You have to read the original Spark paper to understand how RDD lineage works. https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf On Thu, Jul 30, 2015 at 9:25 PM, Ted Yu wrote: > Please take a look at: > core/src/test/scala/org/apache/spark/CheckpointSuite.scala > > Cheers > > On Thu,

Re: How RDD lineage works

2015-07-30 Thread Ted Yu
Please take a look at: core/src/test/scala/org/apache/spark/CheckpointSuite.scala Cheers On Thu, Jul 30, 2015 at 7:39 PM, bit1...@163.com wrote: > Hi, > > I don't get a good understanding how RDD lineage works, so I would ask > whether spark provides a unit test in the code base to illustrate h

Losing files in hdfs after creating spark sql table

2015-07-30 Thread Ron Gonzalez
Hi, After I create a table in spark sql and load infile an hdfs file to it, the file is no longer queryable if I do hadoop fs -ls. Is this expected? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org F

RE: Heatmap with Spark Streaming

2015-07-30 Thread UMESH CHAUDHARY
Thanks for the valuable suggestion. I also started with JAX-RS restful service with Angular. Since play can help a lot in my scenario, I would prefer it along with Angular. Is this combination fine over d3?

Re: Problem submiting an script .py against an standalone cluster.

2015-07-30 Thread Anh Hong
You might want to run spark-submit with option "--deploy-mode cluster" On Thursday, July 30, 2015 7:24 PM, Marcelo Vanzin wrote: Can you share the part of the code in your script where you create the SparkContext instance? On Thu, Jul 30, 2015 at 7:19 PM, fordfarline wrote: Hi A

How RDD lineage works

2015-07-30 Thread bit1...@163.com
Hi, I don't get a good understanding how RDD lineage works, so I would ask whether spark provides a unit test in the code base to illustrate how RDD lineage works. If there is, What's the class name is it? Thanks! bit1...@163.com

Re: Problem submiting an script .py against an standalone cluster.

2015-07-30 Thread Marcelo Vanzin
Can you share the part of the code in your script where you create the SparkContext instance? On Thu, Jul 30, 2015 at 7:19 PM, fordfarline wrote: > Hi All, > > I`m having an issue when lanching an app (python) against a stand alone > cluster, but runs in local, as it doesn't reach the cluster. >

Problem submiting an script .py against an standalone cluster.

2015-07-30 Thread fordfarline
Hi All, I`m having an issue when lanching an app (python) against a stand alone cluster, but runs in local, as it doesn't reach the cluster. It's the first time i try the cluster, in local works ok. i made this: -> /home/user/Spark/spark-1.3.0-bin-hadoop2.4/sbin/start-all.sh # Master and worke

RE: Heatmap with Spark Streaming

2015-07-30 Thread Mohammed Guller
Umesh, You can create a web-service in any of the languages supported by Spark and stream the result from this web-service to your D3-based client using Websocket or Server-Sent Events. For example, you can create a webservice using Play. This app will integrate with Spark streaming in the back

How do i specify the data types in a DF

2015-07-30 Thread afarahat
Hello; I have a simple file of mobile IDFA. they look like gregconv = ['00013FEE-7561-47F3-95BC-CA18D20BCF78', '000D9B97-2B54-4B80-AAA1-C1CB42CFBF3A', '000F9E1F-BC7E-47E1-BF68-C68F6D987B96'] I am trying to make this RDD into a data frame ConvRecord = Row("IDFA") gregconvdf = gregconv.map(lambda

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread Martin Senne
Dear Michael, dear all, distinguishing those records that have a match in mapping from those that don't is the crucial point. Record(x : Int, a: String) Mapping(x: Int, y: Int) Thus Record(1, "hello") Record(2, "bob") Mapping(2, 5) yield (2, "bob", 5) on an inner join. BUT I'm also interested

Re: Does Spark Streaming need to list all the files in a directory?

2015-07-30 Thread Tathagata Das
For the first time it needs to list them. AFter that the list should be cached by the file stream implementation (as far as I remember). On Thu, Jul 30, 2015 at 3:55 PM, Brandon White wrote: > Is this a known bottle neck for Spark Streaming textFileStream? Does it > need to list all the current

RE: Failed to load class for data source: org.apache.spark.sql.cassandra

2015-07-30 Thread Benjamin Ross
If anyone's curious, the issue here is that I was using the 1.2.4 connector of the datastax spark Cassandra connector, rather than the 1.4.0-M1 pre-release. 1.2.4 doesn't fully support data frames, and it's presumably still only experimental in 1.4.0-M1. Ben From: Benjamin Ross Sent: Thursda

Does Spark Streaming need to list all the files in a directory?

2015-07-30 Thread Brandon White
Is this a known bottle neck for Spark Streaming textFileStream? Does it need to list all the current files in a directory before he gets the new files? Say I have 500k files in a directory, does it list them all in order to get the new files?

Re: Problems with JobScheduler

2015-07-30 Thread Tathagata Das
Yes, and that is indeed the problem. It is trying to process all the data in Kafka, and therefore taking 60 seconds. You need to set the rate limits for that. On Thu, Jul 30, 2015 at 8:51 AM, Cody Koeninger wrote: > If you don't set it, there is no maximum rate, it will get everything from > the

Parquet SaveMode.Append Trouble.

2015-07-30 Thread satyajit vegesna
Hi, I am new to using Spark and Parquet files, Below is what i am trying to do, on Spark-shell, val df = sqlContext.parquetFile("/data/LM/Parquet/Segment/pages/part-m-0.gz.parquet") Have also tried below command, val df=sqlContext.read.format("parquet").load("/data/LM/Parquet/Segment/pages/

Re: [Parquet + Dataframes] Column names with spaces

2015-07-30 Thread Michael Armbrust
You can't use these names due to limitations in parquet (and the library it self with silently generate corrupt files that can't be read, hence the error we throw). You can alias a column by df.select(df("old").alias("new")), which is essential what withColumnRenamed does. Alias in this case mean

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread Michael Armbrust
Perhaps I'm missing what you are trying to accomplish, but if you'd like to avoid the null values do an inner join instead of an outer join. Additionally, I'm confused about how the result of joinedDF.filter(joinedDF( "y").isNotNull).show still contains null values in the column y. This doesn't re

Re: unsubscribe

2015-07-30 Thread Brandon White
https://www.youtube.com/watch?v=JncgoPKklVE On Thu, Jul 30, 2015 at 1:30 PM, wrote: > > > -- > > This message is for the designated recipient only and may contain > privileged, proprietary, or otherwise confidential information. If you have > received it in error, ple

unsubscribe

2015-07-30 Thread ziqiu.li
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you

RE: Failed to load class for data source: org.apache.spark.sql.cassandra

2015-07-30 Thread Benjamin Ross
I'm submitting the application this way: spark-submit test-2.0.5-SNAPSHOT-jar-with-dependencies.jar I've confirmed that org.apache.spark.sql.cassandra and org.apache.cassandra classes are in the jar. Apologies for this relatively newbie question - I'm still new to both spark and scala. Thanks,

Re: Spark Master Build Git Commit Hash

2015-07-30 Thread Ted Yu
Looking at: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-NightlyBuilds Maven artifacts should be here: https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/ though the jars were dated July 16th. FY

Failed to load class for data source: org.apache.spark.sql.cassandra

2015-07-30 Thread Benjamin Ross
Hey all, I'm running what should be a very straight-forward application of the Cassandra sql connector, and I'm getting an error: Exception in thread "main" java.lang.RuntimeException: Failed to load class for data source: org.apache.spark.sql.cassandra at scala.sys.package$.error(packag

Re: Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
Hi Ted, The problem is that I don't know if the build uses the commits happened on the same day or it is possible that it builds based on Jul 15th commits. Just a thought, it might be possible to replace "SNAPSHOT" with the git commit hash in the filename so people will know which one is based on.

Cast Error DataFrame/RDD doing group by and case class

2015-07-30 Thread Rishabh Bhardwaj
Hi, I have just started learning DF in sparks and encountered the following error: I am creating the following : *case class A(a1:String,a2:String,a3:String)* *case class B(b1:String,b2:String,b3:String)* *case class C(key:A,value:Seq[B])* Now I have to do a DF with struc ("key" :{..},"value":{.

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread Martin Senne
Dear Michael, dear all, motivation: object OtherEntities { case class Record( x:Int, a: String) case class Mapping( x: Int, y: Int ) val records = Seq( Record(1, "hello"), Record(2, "bob")) val mappings = Seq( Mapping(2, 5) ) } Now I want to perform an *left outer join* on records and

[Parquet + Dataframes] Column names with spaces

2015-07-30 Thread angelini
Hi all, Our data has lots of human readable column names (names that include spaces), is it possible to use these with Parquet and Dataframes? When I try and write the Dataframe I get the following error: (I am using PySpark) `AnalysisException: Attribute name "Name with Space" contains invalid

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread Michael Armbrust
We don't yet updated nullability information based on predicates as we don't actually leverage this information in many places yet. Why do you want to update the schema? On Thu, Jul 30, 2015 at 11:19 AM, martinibus77 wrote: > Hi all, > > 1. *Columns in dataframes can be nullable and not nullabl

Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread martinibus77
Hi all, 1. *Columns in dataframes can be nullable and not nullable. Having a nullable column of Doubles, I can use the following Scala code to filter all "non-null" rows:* val df = . // some code that creates a DataFrame df.filter( df("columnname").isNotNull() ) +-+-++

Re: apache-spark 1.3.0 and yarn integration and spring-boot as a container

2015-07-30 Thread Steve Loughran
you need to fix your configuration so that the resource manager hostname/URL is set...that address there is the "listen on any port" path On 30 Jul 2015, at 10:47, Nirav Patel mailto:npa...@xactlycorp.com>> wrote: 15/07/29 11:19:26 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:

apache-spark 1.3.0 and yarn integration and spring-boot as a container

2015-07-30 Thread Nirav Patel
Hi, I was running spark application as a query service (much like spark-shell but within my servlet container provided by spring-boot) with spark 1.0.2 and standalone mode. Now After upgrading to spark 1.3.1 and trying to use Yarn instead of standalone cluster things going south for me. I created u

TFIDF Transformation

2015-07-30 Thread hans ziqiu li
Hello spark users! I am having some troubles with the TFIDF in MLlib and was wondering if anyone can point me to the right direction. The data ingestion and the initial term frequency count code taken from the example works fine (I am using the first example from this page: https://spark.apache.o

Re: HiveQL to SparkSQL

2015-07-30 Thread Bigdata techguy
Thanks Jorn for the response and for the pointer questions to Hive optimization tips. I believe I have done the possible & applicable things to improve hive query performance including but not limited to - running on TEZ, using partitioning, bucketing, using explain to make sure partition pruning

Re: Difference between RandomForestModel and RandomForestClassificationModel

2015-07-30 Thread Bryan Cutler
Hi Praveen, In MLLib, the major difference is that RandomForestClassificationModel makes use of a newer API which utilizes ML pipelines. I can't say for certain if they will produce the same exact result for a given dataset, but I believe they should. Bryan On Wed, Jul 29, 2015 at 12:14 PM, pra

Re: How to control Spark Executors from getting Lost when using YARN client mode?

2015-07-30 Thread Ashwin Giridharan
What is your cluster configuration ( size and resources) ? If you do not have enough resources, then your executor will not run. Moreover allocating 8 cores to an executor is too much. If you have a cluster with four nodes running NodeManagers, each equipped with 4 cores and 8GB of memory, then a

Re: How to control Spark Executors from getting Lost when using YARN client mode?

2015-07-30 Thread Ted Yu
See past thread on this topic: http://search-hadoop.com/m/q3RTt0NZXV1cC6q02 On Thu, Jul 30, 2015 at 9:08 AM, unk1102 wrote: > Hi I have one Spark job which runs fine locally with less data but when I > schedule it on YARN to execute I keep on getting the following ERROR and > slowly all executo

Re: sc.parallelise to work more like a producer/consumer?

2015-07-30 Thread Kostas Kougios
there is a work around. sc.parallelise(items, items size / 2) This way each executor will get a batch of 2 items at a time, simulating a producer-consumer. With /4 it will get 4 items. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sc-parallelise-to-work-

How to control Spark Executors from getting Lost when using YARN client mode?

2015-07-30 Thread unk1102
Hi I have one Spark job which runs fine locally with less data but when I schedule it on YARN to execute I keep on getting the following ERROR and slowly all executors gets removed from UI and my job fails 15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 8 on myhost1.com: remote Rpc cl

Re: sc.parallelize(512k items) doesn't always use 64 executors

2015-07-30 Thread Konstantinos Kougios
yes,thanks, that sorted out the issue. On 30/07/15 09:26, Akhil Das wrote: sc.parallelize takes a second parameter which is the total number of partitions, are you using that? Thanks Best Regards On Wed, Jul 29, 2015 at 9:27 PM, Kostas Kougios mailto:kostas.koug...@googlemail.com>> wrote:

Re: Problems with JobScheduler

2015-07-30 Thread Cody Koeninger
If you don't set it, there is no maximum rate, it will get everything from the end of the last batch to the maximum available offset On Thu, Jul 30, 2015 at 10:46 AM, Guillermo Ortiz wrote: > The difference is that one recives more data than the others two. I can > pass thought parameters the to

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
The difference is that one recives more data than the others two. I can pass thought parameters the topics, so, I could execute the code trying with one topic and figure out with one is the topic, although I guess that it's the topics which gets more data. Anyway it's pretty weird those delays in

Re: Spark Master Build Git Commit Hash

2015-07-30 Thread Ted Yu
Jerry: See this recent thread: http://search-hadoop.com/m/q3RTtRygSv1PEdCr Cheers On Thu, Jul 30, 2015 at 8:10 AM, Ted Yu wrote: > The files were dated 16-Jul-2015 > Looks like nightly build either was not published, or published at a > different location. > > You can download spark-1.5.0-SNAPS

Re: Spark build/sbt assembly

2015-07-30 Thread Rahul Palamuttam
Hi Akhil, Yes I did try to remove it, and i tried to build again. However that jar keeps getting recreated, whenever i run ./build/sbt assembly Thanks, Rahul P On Thu, Jul 30, 2015 at 12:38 AM, Akhil Das wrote: > Did you try removing this jar? build/sbt-launch-0.13.7.jar > > Thanks > Best Reg

Re: Spark Master Build Git Commit Hash

2015-07-30 Thread Ted Yu
The files were dated 16-Jul-2015 Looks like nightly build either was not published, or published at a different location. You can download spark-1.5.0-SNAPSHOT.tgz and binary-search for the commits made on Jul 16th. There may be other ways of determining the latest commit. Cheers On Thu, Jul 30,

How to register array class with Kyro in spark-defaults.conf

2015-07-30 Thread Wang, Ningjun (LNG-NPV)
I register my class with Kyro in spark-defaults.conf as follow spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrationRequired true spark.kryo.classesToRegister ltn.analytics.es.EsDoc But I got the following excep

Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
Hi Spark users and developers, I wonder which git commit was used to build the latest master-nightly build found at: http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/? I downloaded the build but I couldn't find the information related to it. Thank you! Best Regards, Jerry

Re: Problems with JobScheduler

2015-07-30 Thread Cody Koeninger
If the jobs are running on different topicpartitions, what's different about them? Is one of them 120x the throughput of the other, for instance? You should be able to eliminate cluster config as a difference by running the same topic partition on the different clusters and comparing the results.

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
I have three topics with one partition each topic. So each jobs run about one topics. 2015-07-30 16:20 GMT+02:00 Cody Koeninger : > Just so I'm clear, the difference in timing you're talking about is this: > > 15/07/30 14:33:59 INFO DAGScheduler: Job 24 finished: foreachRDD at > MetricsSpark.scal

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread Umesh Kacha
Hi Cody sorry my bad you were right there was a typo in topicSet. When I corrected typo in topicSet it started working. Thanks a lot. Regards On Thu, Jul 30, 2015 at 7:43 PM, Cody Koeninger wrote: > Can you post the code including the values of kafkaParams and topicSet, > ideally the relevant o

Re: Upgrade of Spark-Streaming application

2015-07-30 Thread Cody Koeninger
You can't use checkpoints across code upgrades. That may or may not change in the future, but for now that's a limitation of spark checkpoints (regardless of whether you're using Kafka). Some options: - Start up the new job on a different cluster, then kill the old job once it's caught up to whe

Re: Lost task - connection closed

2015-07-30 Thread firemonk9
I am getting same error. Any resolution on this issue ? Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-task-connection-closed-tp21361p24082.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Problems with JobScheduler

2015-07-30 Thread Cody Koeninger
Just so I'm clear, the difference in timing you're talking about is this: 15/07/30 14:33:59 INFO DAGScheduler: Job 24 finished: foreachRDD at MetricsSpark.scala:67, took 60.391761 s 15/07/30 14:37:35 INFO DAGScheduler: Job 93 finished: foreachRDD at MetricsSpark.scala:67, took 0.531323 s Are th

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread Cody Koeninger
Can you post the code including the values of kafkaParams and topicSet, ideally the relevant output of kafka-topics.sh --describe as well On Wed, Jul 29, 2015 at 11:39 PM, Umesh Kacha wrote: > Hi thanks for the response. Like I already mentioned in the question kafka > topic is valid and it has

Re: Spark on YARN

2015-07-30 Thread Jeetendra Gangele
Thanks for information this fixed the issue. Issue was in spark-master memory when I specify manually 1G for master. it start working On 30 July 2015 at 14:26, Shao, Saisai wrote: > You’d better also check the log of nodemanager, sometimes because your > memory usage exceeds the limit of Yarn c

Re: Running Spark on user-provided Hadoop installation

2015-07-30 Thread Ted Yu
Herman: For "Pre-built with user-provided Hadoop", spark-1.4.1-bin-hadoop2.6.tgz, e.g., uses hadoop-2.6 profile which defines versions of projects Spark depends on. Hadoop cluster is used to provide storage (hdfs) and resource management (YARN). For the latter, please see: https://spark.apache.org

Re: Connection closed/reset by peers error

2015-07-30 Thread firemonk9
I am having the same issue. Have you found any resolution ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Connection-closed-reset-by-peers-error-tp21459p24081.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Twitter Connector-Spark Streaming

2015-07-30 Thread Akhil Das
You can create a custom receiver and then inside it you can write yourown piece of code to receive data, filter them etc before giving it to spark. Thanks Best Regards On Thu, Jul 30, 2015 at 6:49 PM, Sadaf Khan wrote: > okay :) > > then is there anyway to fetch the tweets specific to my accoun

Re: help plz! how to use zipWithIndex to each subset of a RDD

2015-07-30 Thread askformore
Hi @rok, thanks I got it -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-plz-how-to-use-zipWithIndex-to-each-subset-of-a-RDD-tp24071p24080.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Spark and Speech Recognition

2015-07-30 Thread Peter Wolf
Oh... That was embarrassingly easy! Thank you that was exactly the understanding of partitions that I needed. P On Thu, Jul 30, 2015 at 6:35 AM, Simon Elliston Ball < si...@simonellistonball.com> wrote: > You might also want to consider broadcasting the models to ensure you get > one instance

Python version collision

2015-07-30 Thread Javier Domingo Cansino
Hi, I find rather confusing the documentation about the configuration options. There are a lot of files that are not too clear on where to modify. For example, spark-env vs spark-defaults. I am getting an error with Python versions collision: File "/root/spark/python/lib/pyspark.zip/pyspark/wo

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
I read about maxRatePerPartition parameter, I haven't set this parameter. Could it be the problem?? Although this wouldn't explain why it doesn't work in one of the clusters. 2015-07-30 14:47 GMT+02:00 Guillermo Ortiz : > They just share the kafka, the rest of resources are independents. I tried

Re: How to set log level in spark-submit ?

2015-07-30 Thread Dean Wampler
Did you use an absolute path in $path_to_file? I just tried this with spark-shell v1.4.1 and it worked for me. If the URL is wrong, you should see an error message from log4j that it can't find the file. For windows it would be something like file:/c:/path/to/file, I believe. Dean Wampler, Ph.D. A

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
They just share the kafka, the rest of resources are independents. I tried to stop one cluster and execute just the cluster isn't working but it happens the same. 2015-07-30 14:41 GMT+02:00 Guillermo Ortiz : > I have some problem with the JobScheduler. I have executed same code in > two cluster.

Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
I have some problem with the JobScheduler. I have executed same code in two cluster. I read from three topics in Kafka with DirectStream so I have three tasks. I have check YARN and there aren't more jobs launched. The cluster where I have troubles I got this logs: 15/07/30 14:32:58 INFO TaskSet

Re: Twitter Connector-Spark Streaming

2015-07-30 Thread Akhil Das
TwitterUtils.createStream takes a 3rd argument which is a filter, once you provide these, it will only fetch tweets of such. Thanks Best Regards On Thu, Jul 30, 2015 at 4:19 PM, Sadaf wrote: > Hi. > I am writing twitter connector using spark streaming. but it fetched the > random tweets. > Is t

Re: Graceful shutdown for Spark Streaming

2015-07-30 Thread anshu shukla
Yes I was doing same , if You mean that this is the correct way to do Then I will verify it once more in my case . On Thu, Jul 30, 2015 at 1:02 PM, Tathagata Das wrote: > How is sleep not working? Are you doing > > streamingContext.start() > Thread.sleep(xxx) > streamingContext.stop() > >

Re: Spark Interview Questions

2015-07-30 Thread Sandeep Giri
i have prepared some interview questions: http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-1 http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2 please provide your feedback. On Wed, Jul 29, 2015, 23:43 Pedro Rodriguez wrote: > You might look at the edx

Error SparkStreaming after a while executing.

2015-07-30 Thread Guillermo Ortiz
I'm executing a job with Spark Streaming and got this error all times when the job has been executing for a while (usually hours of days). I have no idea why it's happening. 15/07/30 13:02:14 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTarge

Twitter Connector-Spark Streaming

2015-07-30 Thread Sadaf
Hi. I am writing twitter connector using spark streaming. but it fetched the random tweets. Is there any way to receive the tweets of a particular account? I made an app on twitter and used the credentials as given below. def managingCredentials(): Option[twitter4j.auth.Authorization]= {

Re: Spark and Speech Recognition

2015-07-30 Thread Simon Elliston Ball
You might also want to consider broadcasting the models to ensure you get one instance shared across cores in each machine, otherwise the model will be serialised to each task and you'll get a copy per executor (roughly core in this instance) Simon Sent from my iPhone > On 30 Jul 2015, at 10

Re: How to set log level in spark-submit ?

2015-07-30 Thread Alexander Krasheninnikov
I saw such example in docs: --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file://$path_to_file" but, unfortunately, it does not work for me. On 30.07.2015 05:12, canan chen wrote: Yes, that should work. What I mean is is there any option in spark-submit command that I can specify

How to perform basic statistics on a Json file to explore my numeric and non-numeric variables?

2015-07-30 Thread SparknewUser
I've imported a Json file which has this schema : sqlContext.read.json("filename").printSchema root |-- COL: long (nullable = true) |-- DATA: array (nullable = true) ||-- element: struct (containsNull = true) |||-- Crate: string (nullable = true)

Re: Heatmap with Spark Streaming

2015-07-30 Thread Akhil Das
You can integrate it with any language (like php) and use ajax calls to update the charts. Thanks Best Regards On Thu, Jul 30, 2015 at 2:11 PM, UMESH CHAUDHARY wrote: > Thanks For the suggestion Akhil! > I looked at https://github.com/mbostock/d3/wiki/Gallery to know more > about d3, all exampl

Upgrade of Spark-Streaming application

2015-07-30 Thread Nicola Ferraro
Hi, I've read about the recent updates about spark-streaming integration with Kafka (I refer to the new approach without receivers). In the new approach, metadata are persisted in checkpoint folders on HDFS so that the SparkStreaming context can be recreated in case of failures. This means that the

Re: Writing streaming data to cassandra creates duplicates

2015-07-30 Thread Juan Rodríguez Hortalá
Hi, Just my two cents. I understand your problem is that your problem is that you have messages with the same key in two different dstreams. What I would do would be making a union of all the dstreams with StreamingContext.union or several calls to DStream.union, and then I would create a pair dst

RE: Spark on YARN

2015-07-30 Thread Shao, Saisai
You’d better also check the log of nodemanager, sometimes because your memory usage exceeds the limit of Yarn container’s configuration. I’ve met similar problem before, here is the warning log in nodemanager: 2015-07-07 17:06:07,141 WARN org.apache.hadoop.yarn.server.nodemanager.containermanag

Re: Writing streaming data to cassandra creates duplicates

2015-07-30 Thread Priya Ch
Hi All, Can someone throw insights on this ? On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch wrote: > > > Hi TD, > > Thanks for the info. I have the scenario like this. > > I am reading the data from kafka topic. Let's say kafka has 3 partitions > for the topic. In my streaming application, I woul

Running Spark on user-provided Hadoop installation

2015-07-30 Thread hermansc
Hi. I want to run Spark, and more specifically the "Pre-build with user-provided Hadoop" version from the downloads page, but I can't find any documentation on how to connect the two components together (namely Spark and Hadoop). I've had some success in settting SPARK_CLASSPATH to my hadoop dist

Re: Heatmap with Spark Streaming

2015-07-30 Thread UMESH CHAUDHARY
Thanks For the suggestion Akhil! I looked at https://github.com/mbostock/d3/wiki/Gallery to know more about d3, all examples described here are on static data, how we can update our heat map from updated data, if we store it in Hbase or Mysql. I mean, do we need to query back and fourth for it. Is

Re: Does spark-submit support file transfering from local to cluster?

2015-07-30 Thread Jeff Zhang
Spark is computing engine, I don't think it is suitable for spark to do the input large file transferring. On Thu, Jul 30, 2015 at 4:18 PM, Akhil Das wrote: > You can achieve this with a Network File System, create an NFS on the > cluster and mount it to your local machine. > > Thanks > Best R

Re: Spark on YARN

2015-07-30 Thread Jeff Zhang
>> 15/07/30 12:13:35 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM AM is killed somehow, may due to preemption. Does it always happen ? Resource manager log would be helpful. On Thu, Jul 30, 2015 at 4:17 PM, Jeetendra Gangele wrote: > I can't see the application logs here. All the

Re: sc.parallelize(512k items) doesn't always use 64 executors

2015-07-30 Thread Akhil Das
sc.parallelize takes a second parameter which is the total number of partitions, are you using that? Thanks Best Regards On Wed, Jul 29, 2015 at 9:27 PM, Kostas Kougios < kostas.koug...@googlemail.com> wrote: > Hi, I do an sc.parallelize with a list of 512k items. But sometimes not all > executo

Re: Does spark-submit support file transfering from local to cluster?

2015-07-30 Thread Akhil Das
You can achieve this with a Network File System, create an NFS on the cluster and mount it to your local machine. Thanks Best Regards On Wed, Jul 29, 2015 at 9:32 AM, Anh Hong wrote: > Hi, > > I'm using spark-submit cluster mode to submit a job from local to spark > cluster. There are input f

Re: Spark on YARN

2015-07-30 Thread Jeetendra Gangele
I can't see the application logs here. All the logs are going into stderr. can anybody help here? On 30 July 2015 at 12:21, Jeetendra Gangele wrote: > I am running below command this is default spark PI program but this is > not running all the log are going in stderr but at the terminal job is

Re: Spark and Speech Recognition

2015-07-30 Thread Akhil Das
Like this? val data = sc.textFile("/sigmoid/audio/data/", 24).foreachPartition(urls => speachRecognizer(urls)) Let 24 be the total number of cores that you have on all the workers. Thanks Best Regards On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf wrote: > Hello, I am writing a Spark application

Re: Heatmap with Spark Streaming

2015-07-30 Thread Akhil Das
You can easily push data to an intermediate storage from spark streaming (like HBase or a SQL/NoSQL DB etc) and then power your dashboards with d3 js. Thanks Best Regards On Tue, Jul 28, 2015 at 12:18 PM, UMESH CHAUDHARY wrote: > I have just started using Spark Streaming and done few POCs. It i

Re: streaming issue

2015-07-30 Thread Akhil Das
What operation are you doing with streaming? Also can you look in the datanode logs and see whats going on? Thanks Best Regards On Tue, Jul 28, 2015 at 8:18 AM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Hi, > I got a error when running spark streaming as below . > > java.lang

Re: Spark SQL Error

2015-07-30 Thread Akhil Das
It seem an issue with the ES connector https://github.com/elastic/elasticsearch-hadoop/issues/482 Thanks Best Regards On Tue, Jul 28, 2015 at 6:14 AM, An Tran wrote: > Hello all, > > I am currently having an error with Spark SQL access Elasticsearch using > Elasticsearch Spark integration. Bel

Re: Spark Streaming Cannot Work On Next Interval

2015-07-30 Thread Himanshu Mehra
Hi Ferriad, Can you share the code? because its hard to judge any problem with this little information. Thank you Regards Himanshu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Cannot-Work-On-Next-Interval-tp24045p24075.html Sent from th

Re: CPU Parallelization not being used (local mode)

2015-07-30 Thread Akhil Das
When you give master as *local[1]* it occupies a single thread which will probably run on a single core, you can give *local[*]* to allocate the total # cores that you have. Thanks Best Regards On Tue, Jul 28, 2015 at 1:27 AM, wrote: > Hi all, > > would like some insight. I am currently comput

Re: Spark build/sbt assembly

2015-07-30 Thread Akhil Das
Did you try removing this jar? build/sbt-launch-0.13.7.jar Thanks Best Regards On Tue, Jul 28, 2015 at 12:08 AM, Rahul Palamuttam wrote: > Hi All, > > I hope this is the right place to post troubleshooting questions. > I've been following the install instructions and I get the following error >

  1   2   >