Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
I have some problem with the JobScheduler. I have executed same code in two cluster. I read from three topics in Kafka with DirectStream so I have three tasks. I have check YARN and there aren't more jobs launched. The cluster where I have troubles I got this logs: 15/07/30 14:32:58 INFO

Re: Spark and Speech Recognition

2015-07-30 Thread Peter Wolf
Oh... That was embarrassingly easy! Thank you that was exactly the understanding of partitions that I needed. P On Thu, Jul 30, 2015 at 6:35 AM, Simon Elliston Ball si...@simonellistonball.com wrote: You might also want to consider broadcasting the models to ensure you get one instance

Re: Running Spark on user-provided Hadoop installation

2015-07-30 Thread Ted Yu
Herman: For Pre-built with user-provided Hadoop, spark-1.4.1-bin-hadoop2.6.tgz, e.g., uses hadoop-2.6 profile which defines versions of projects Spark depends on. Hadoop cluster is used to provide storage (hdfs) and resource management (YARN). For the latter, please see:

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
I read about maxRatePerPartition parameter, I haven't set this parameter. Could it be the problem?? Although this wouldn't explain why it doesn't work in one of the clusters. 2015-07-30 14:47 GMT+02:00 Guillermo Ortiz konstt2...@gmail.com: They just share the kafka, the rest of resources are

Re: Spark on YARN

2015-07-30 Thread Jeetendra Gangele
Thanks for information this fixed the issue. Issue was in spark-master memory when I specify manually 1G for master. it start working On 30 July 2015 at 14:26, Shao, Saisai saisai.s...@intel.com wrote: You’d better also check the log of nodemanager, sometimes because your memory usage exceeds

Re: Graceful shutdown for Spark Streaming

2015-07-30 Thread anshu shukla
Yes I was doing same , if You mean that this is the correct way to do Then I will verify it once more in my case . On Thu, Jul 30, 2015 at 1:02 PM, Tathagata Das t...@databricks.com wrote: How is sleep not working? Are you doing streamingContext.start() Thread.sleep(xxx)

Re: How to set log level in spark-submit ?

2015-07-30 Thread Dean Wampler
Did you use an absolute path in $path_to_file? I just tried this with spark-shell v1.4.1 and it worked for me. If the URL is wrong, you should see an error message from log4j that it can't find the file. For windows it would be something like file:/c:/path/to/file, I believe. Dean Wampler, Ph.D.

Python version collision

2015-07-30 Thread Javier Domingo Cansino
Hi, I find rather confusing the documentation about the configuration options. There are a lot of files that are not too clear on where to modify. For example, spark-env vs spark-defaults. I am getting an error with Python versions collision: File

Re: Spark Interview Questions

2015-07-30 Thread Sandeep Giri
i have prepared some interview questions: http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-1 http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2 please provide your feedback. On Wed, Jul 29, 2015, 23:43 Pedro Rodriguez ski.rodrig...@gmail.com wrote: You

Re: help plz! how to use zipWithIndex to each subset of a RDD

2015-07-30 Thread askformore
Hi @rok, thanks I got it -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-plz-how-to-use-zipWithIndex-to-each-subset-of-a-RDD-tp24071p24080.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
They just share the kafka, the rest of resources are independents. I tried to stop one cluster and execute just the cluster isn't working but it happens the same. 2015-07-30 14:41 GMT+02:00 Guillermo Ortiz konstt2...@gmail.com: I have some problem with the JobScheduler. I have executed same

Re: Twitter Connector-Spark Streaming

2015-07-30 Thread Akhil Das
You can create a custom receiver and then inside it you can write yourown piece of code to receive data, filter them etc before giving it to spark. Thanks Best Regards On Thu, Jul 30, 2015 at 6:49 PM, Sadaf Khan sa...@platalytics.com wrote: okay :) then is there anyway to fetch the tweets

Re: Connection closed/reset by peers error

2015-07-30 Thread firemonk9
I am having the same issue. Have you found any resolution ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Connection-closed-reset-by-peers-error-tp21459p24081.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Problems with JobScheduler

2015-07-30 Thread Cody Koeninger
Just so I'm clear, the difference in timing you're talking about is this: 15/07/30 14:33:59 INFO DAGScheduler: Job 24 finished: foreachRDD at MetricsSpark.scala:67, took 60.391761 s 15/07/30 14:37:35 INFO DAGScheduler: Job 93 finished: foreachRDD at MetricsSpark.scala:67, took 0.531323 s Are

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
I have three topics with one partition each topic. So each jobs run about one topics. 2015-07-30 16:20 GMT+02:00 Cody Koeninger c...@koeninger.org: Just so I'm clear, the difference in timing you're talking about is this: 15/07/30 14:33:59 INFO DAGScheduler: Job 24 finished: foreachRDD at

Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
Hi Spark users and developers, I wonder which git commit was used to build the latest master-nightly build found at: http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/? I downloaded the build but I couldn't find the information related to it. Thank you! Best Regards,

Re: Lost task - connection closed

2015-07-30 Thread firemonk9
I am getting same error. Any resolution on this issue ? Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-task-connection-closed-tp21361p24082.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Upgrade of Spark-Streaming application

2015-07-30 Thread Cody Koeninger
You can't use checkpoints across code upgrades. That may or may not change in the future, but for now that's a limitation of spark checkpoints (regardless of whether you're using Kafka). Some options: - Start up the new job on a different cluster, then kill the old job once it's caught up to

Re: Problems with JobScheduler

2015-07-30 Thread Cody Koeninger
If the jobs are running on different topicpartitions, what's different about them? Is one of them 120x the throughput of the other, for instance? You should be able to eliminate cluster config as a difference by running the same topic partition on the different clusters and comparing the

Re: Spark build/sbt assembly

2015-07-30 Thread Rahul Palamuttam
Hi Akhil, Yes I did try to remove it, and i tried to build again. However that jar keeps getting recreated, whenever i run ./build/sbt assembly Thanks, Rahul P On Thu, Jul 30, 2015 at 12:38 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you try removing this jar?

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread Umesh Kacha
Hi Cody sorry my bad you were right there was a typo in topicSet. When I corrected typo in topicSet it started working. Thanks a lot. Regards On Thu, Jul 30, 2015 at 7:43 PM, Cody Koeninger c...@koeninger.org wrote: Can you post the code including the values of kafkaParams and topicSet,

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread Cody Koeninger
Can you post the code including the values of kafkaParams and topicSet, ideally the relevant output of kafka-topics.sh --describe as well On Wed, Jul 29, 2015 at 11:39 PM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi thanks for the response. Like I already mentioned in the question kafka topic

How to register array class with Kyro in spark-defaults.conf

2015-07-30 Thread Wang, Ningjun (LNG-NPV)
I register my class with Kyro in spark-defaults.conf as follow spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrationRequired true spark.kryo.classesToRegister ltn.analytics.es.EsDoc But I got the following

Failed to load class for data source: org.apache.spark.sql.cassandra

2015-07-30 Thread Benjamin Ross
Hey all, I'm running what should be a very straight-forward application of the Cassandra sql connector, and I'm getting an error: Exception in thread main java.lang.RuntimeException: Failed to load class for data source: org.apache.spark.sql.cassandra at

RE: Failed to load class for data source: org.apache.spark.sql.cassandra

2015-07-30 Thread Benjamin Ross
I'm submitting the application this way: spark-submit test-2.0.5-SNAPSHOT-jar-with-dependencies.jar I've confirmed that org.apache.spark.sql.cassandra and org.apache.cassandra classes are in the jar. Apologies for this relatively newbie question - I'm still new to both spark and scala.

Re: Spark Master Build Git Commit Hash

2015-07-30 Thread Ted Yu
Looking at: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-NightlyBuilds Maven artifacts should be here: https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/ though the jars were dated July 16th.

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread Michael Armbrust
Perhaps I'm missing what you are trying to accomplish, but if you'd like to avoid the null values do an inner join instead of an outer join. Additionally, I'm confused about how the result of joinedDF.filter(joinedDF( y).isNotNull).show still contains null values in the column y. This doesn't

Re: [Parquet + Dataframes] Column names with spaces

2015-07-30 Thread Michael Armbrust
You can't use these names due to limitations in parquet (and the library it self with silently generate corrupt files that can't be read, hence the error we throw). You can alias a column by df.select(df(old).alias(new)), which is essential what withColumnRenamed does. Alias in this case means

Re: unsubscribe

2015-07-30 Thread Brandon White
https://www.youtube.com/watch?v=JncgoPKklVE On Thu, Jul 30, 2015 at 1:30 PM, ziqiu...@accenture.com wrote: -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received

unsubscribe

2015-07-30 Thread ziqiu.li
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by

Re: Problems with JobScheduler

2015-07-30 Thread Tathagata Das
Yes, and that is indeed the problem. It is trying to process all the data in Kafka, and therefore taking 60 seconds. You need to set the rate limits for that. On Thu, Jul 30, 2015 at 8:51 AM, Cody Koeninger c...@koeninger.org wrote: If you don't set it, there is no maximum rate, it will get

How do i specify the data types in a DF

2015-07-30 Thread afarahat
Hello; I have a simple file of mobile IDFA. they look like gregconv = ['00013FEE-7561-47F3-95BC-CA18D20BCF78', '000D9B97-2B54-4B80-AAA1-C1CB42CFBF3A', '000F9E1F-BC7E-47E1-BF68-C68F6D987B96'] I am trying to make this RDD into a data frame ConvRecord = Row(IDFA) gregconvdf = gregconv.map(lambda

Does Spark Streaming need to list all the files in a directory?

2015-07-30 Thread Brandon White
Is this a known bottle neck for Spark Streaming textFileStream? Does it need to list all the current files in a directory before he gets the new files? Say I have 500k files in a directory, does it list them all in order to get the new files?

Re: Does Spark Streaming need to list all the files in a directory?

2015-07-30 Thread Tathagata Das
For the first time it needs to list them. AFter that the list should be cached by the file stream implementation (as far as I remember). On Thu, Jul 30, 2015 at 3:55 PM, Brandon White bwwintheho...@gmail.com wrote: Is this a known bottle neck for Spark Streaming textFileStream? Does it need

Parquet SaveMode.Append Trouble.

2015-07-30 Thread satyajit vegesna
Hi, I am new to using Spark and Parquet files, Below is what i am trying to do, on Spark-shell, val df = sqlContext.parquetFile(/data/LM/Parquet/Segment/pages/part-m-0.gz.parquet) Have also tried below command, val

RE: Failed to load class for data source: org.apache.spark.sql.cassandra

2015-07-30 Thread Benjamin Ross
If anyone's curious, the issue here is that I was using the 1.2.4 connector of the datastax spark Cassandra connector, rather than the 1.4.0-M1 pre-release. 1.2.4 doesn't fully support data frames, and it's presumably still only experimental in 1.4.0-M1. Ben From: Benjamin Ross Sent:

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread Martin Senne
Dear Michael, dear all, distinguishing those records that have a match in mapping from those that don't is the crucial point. Record(x : Int, a: String) Mapping(x: Int, y: Int) Thus Record(1, hello) Record(2, bob) Mapping(2, 5) yield (2, bob, 5) on an inner join. BUT I'm also interested in

RE: Heatmap with Spark Streaming

2015-07-30 Thread Mohammed Guller
Umesh, You can create a web-service in any of the languages supported by Spark and stream the result from this web-service to your D3-based client using Websocket or Server-Sent Events. For example, you can create a webservice using Play. This app will integrate with Spark streaming in the

Losing files in hdfs after creating spark sql table

2015-07-30 Thread Ron Gonzalez
Hi, After I create a table in spark sql and load infile an hdfs file to it, the file is no longer queryable if I do hadoop fs -ls. Is this expected? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: How RDD lineage works

2015-07-30 Thread Tathagata Das
You have to read the original Spark paper to understand how RDD lineage works. https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf On Thu, Jul 30, 2015 at 9:25 PM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at: core/src/test/scala/org/apache/spark/CheckpointSuite.scala

How RDD lineage works

2015-07-30 Thread bit1...@163.com
Hi, I don't get a good understanding how RDD lineage works, so I would ask whether spark provides a unit test in the code base to illustrate how RDD lineage works. If there is, What's the class name is it? Thanks! bit1...@163.com

Re: Problem submiting an script .py against an standalone cluster.

2015-07-30 Thread Marcelo Vanzin
Can you share the part of the code in your script where you create the SparkContext instance? On Thu, Jul 30, 2015 at 7:19 PM, fordfarline fordfarl...@gmail.com wrote: Hi All, I`m having an issue when lanching an app (python) against a stand alone cluster, but runs in local, as it doesn't

Re: How RDD lineage works

2015-07-30 Thread Ted Yu
Please take a look at: core/src/test/scala/org/apache/spark/CheckpointSuite.scala Cheers On Thu, Jul 30, 2015 at 7:39 PM, bit1...@163.com bit1...@163.com wrote: Hi, I don't get a good understanding how RDD lineage works, so I would ask whether spark provides a unit test in the code base to

Re: Problem submiting an script .py against an standalone cluster.

2015-07-30 Thread Anh Hong
You might want to run spark-submit with option --deploy-mode cluster On Thursday, July 30, 2015 7:24 PM, Marcelo Vanzin van...@cloudera.com wrote: Can you share the part of the code in your script where you create the SparkContext instance? On Thu, Jul 30, 2015 at 7:19 PM,

Re: Heatmap with Spark Streaming

2015-07-30 Thread Tathagata Das
I do suggest that the non-spark related discussions be taken to a different this forum as it does not directly contribute to the contents of this user list. On Thu, Jul 30, 2015 at 8:52 PM, UMESH CHAUDHARY umesh9...@gmail.com wrote: Thanks for the valuable suggestion. I also started with

Re: Re: How RDD lineage works

2015-07-30 Thread bit1...@163.com
Thanks TD and Zhihong for the guide. I will check it bit1...@163.com From: Tathagata Das Date: 2015-07-31 12:27 To: Ted Yu CC: bit1...@163.com; user Subject: Re: How RDD lineage works You have to read the original Spark paper to understand how RDD lineage works.

Re: Re: How RDD lineage works

2015-07-30 Thread bit1...@163.com
The following is copied from the paper, is something related with rdd lineage. Is there a unit test that covers this scenario(rdd partition lost and recovery)? Thanks. If a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute just

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread gaurav sharma
I have run into similar excpetions ERROR DirectKafkaInputDStream: ArrayBuffer(java.net.SocketTimeoutException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([AdServe,1])) and the issue has happened on Kafka Side, where my broker offsets go out of sync, or do not return

Re: Getting the number of slaves

2015-07-30 Thread Tathagata Das
To clarify, that is the number of executors requested by the SparkContext from the cluster manager. On Tue, Jul 28, 2015 at 5:18 PM, amkcom amk...@gmail.com wrote: try sc.getConf.getInt(spark.executor.instances, 1) -- View this message in context:

Re: Re: How RDD lineage works

2015-07-30 Thread Tathagata Das
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FailureSuite.scala This may help. On Thu, Jul 30, 2015 at 10:42 PM, bit1...@163.com bit1...@163.com wrote: The following is copied from the paper, is something related with rdd lineage. Is there a unit test that

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread Martin Senne
Dear Michael, dear all, motivation: object OtherEntities { case class Record( x:Int, a: String) case class Mapping( x: Int, y: Int ) val records = Seq( Record(1, hello), Record(2, bob)) val mappings = Seq( Mapping(2, 5) ) } Now I want to perform an *left outer join* on records and

Re: Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
Hi Ted, The problem is that I don't know if the build uses the commits happened on the same day or it is possible that it builds based on Jul 15th commits. Just a thought, it might be possible to replace SNAPSHOT with the git commit hash in the filename so people will know which one is based on.

Re: Spark SQL Error

2015-07-30 Thread Akhil Das
It seem an issue with the ES connector https://github.com/elastic/elasticsearch-hadoop/issues/482 Thanks Best Regards On Tue, Jul 28, 2015 at 6:14 AM, An Tran tra...@gmail.com wrote: Hello all, I am currently having an error with Spark SQL access Elasticsearch using Elasticsearch Spark

Re: streaming issue

2015-07-30 Thread Akhil Das
What operation are you doing with streaming? Also can you look in the datanode logs and see whats going on? Thanks Best Regards On Tue, Jul 28, 2015 at 8:18 AM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Hi, I got a error when running spark streaming as below .

Re: Spark and Speech Recognition

2015-07-30 Thread Akhil Das
Like this? val data = sc.textFile(/sigmoid/audio/data/, 24).foreachPartition(urls = speachRecognizer(urls)) Let 24 be the total number of cores that you have on all the workers. Thanks Best Regards On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf opus...@gmail.com wrote: Hello, I am writing a

Re: Spark on YARN

2015-07-30 Thread Jeetendra Gangele
I can't see the application logs here. All the logs are going into stderr. can anybody help here? On 30 July 2015 at 12:21, Jeetendra Gangele gangele...@gmail.com wrote: I am running below command this is default spark PI program but this is not running all the log are going in stderr but at

Re: Heatmap with Spark Streaming

2015-07-30 Thread UMESH CHAUDHARY
Thanks For the suggestion Akhil! I looked at https://github.com/mbostock/d3/wiki/Gallery to know more about d3, all examples described here are on static data, how we can update our heat map from updated data, if we store it in Hbase or Mysql. I mean, do we need to query back and fourth for it. Is

Re: Writing streaming data to cassandra creates duplicates

2015-07-30 Thread Priya Ch
Hi All, Can someone throw insights on this ? On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com wrote: Hi TD, Thanks for the info. I have the scenario like this. I am reading the data from kafka topic. Let's say kafka has 3 partitions for the topic. In my

Re: Spark build/sbt assembly

2015-07-30 Thread Akhil Das
Did you try removing this jar? build/sbt-launch-0.13.7.jar Thanks Best Regards On Tue, Jul 28, 2015 at 12:08 AM, Rahul Palamuttam rahulpala...@gmail.com wrote: Hi All, I hope this is the right place to post troubleshooting questions. I've been following the install instructions and I get

Re: help plz! how to use zipWithIndex to each subset of a RDD

2015-07-30 Thread rok
zipWithIndex gives you global indices, which is not what you want. You'll want to use flatMap with a map function that iterates through each iterable and returns the (String, Int, String) tuple for each element. On Thu, Jul 30, 2015 at 4:13 AM, askformore [via Apache Spark User List]

Re: Heatmap with Spark Streaming

2015-07-30 Thread Akhil Das
You can easily push data to an intermediate storage from spark streaming (like HBase or a SQL/NoSQL DB etc) and then power your dashboards with d3 js. Thanks Best Regards On Tue, Jul 28, 2015 at 12:18 PM, UMESH CHAUDHARY umesh9...@gmail.com wrote: I have just started using Spark Streaming and

Re: sc.parallelize(512k items) doesn't always use 64 executors

2015-07-30 Thread Akhil Das
sc.parallelize takes a second parameter which is the total number of partitions, are you using that? Thanks Best Regards On Wed, Jul 29, 2015 at 9:27 PM, Kostas Kougios kostas.koug...@googlemail.com wrote: Hi, I do an sc.parallelize with a list of 512k items. But sometimes not all executors

Re: Spark on YARN

2015-07-30 Thread Jeff Zhang
15/07/30 12:13:35 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM AM is killed somehow, may due to preemption. Does it always happen ? Resource manager log would be helpful. On Thu, Jul 30, 2015 at 4:17 PM, Jeetendra Gangele gangele...@gmail.com wrote: I can't see the application

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
The difference is that one recives more data than the others two. I can pass thought parameters the topics, so, I could execute the code trying with one topic and figure out with one is the topic, although I guess that it's the topics which gets more data. Anyway it's pretty weird those delays in

How to control Spark Executors from getting Lost when using YARN client mode?

2015-07-30 Thread unk1102
Hi I have one Spark job which runs fine locally with less data but when I schedule it on YARN to execute I keep on getting the following ERROR and slowly all executors gets removed from UI and my job fails 15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 8 on myhost1.com: remote Rpc

Re: sc.parallelise to work more like a producer/consumer?

2015-07-30 Thread Kostas Kougios
there is a work around. sc.parallelise(items, items size / 2) This way each executor will get a batch of 2 items at a time, simulating a producer-consumer. With /4 it will get 4 items. -- View this message in context:

Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread martinibus77
Hi all, 1. *Columns in dataframes can be nullable and not nullable. Having a nullable column of Doubles, I can use the following Scala code to filter all non-null rows:* val df = . // some code that creates a DataFrame df.filter( df(columnname).isNotNull() ) +-+-++

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread Michael Armbrust
We don't yet updated nullability information based on predicates as we don't actually leverage this information in many places yet. Why do you want to update the schema? On Thu, Jul 30, 2015 at 11:19 AM, martinibus77 martin.se...@googlemail.com wrote: Hi all, 1. *Columns in dataframes can be

TFIDF Transformation

2015-07-30 Thread ziqiu.li
Hello spark users, I hope your week is going fantastic! I am having some troubles with the TFIDF in MLlib and was wondering if anyone can point me to the right direction. The data ingestion and the initial term frequency count code taken from the example works fine (I am using the first

Spark on YARN

2015-07-30 Thread Jeetendra Gangele
I am running below command this is default spark PI program but this is not running all the log are going in stderr but at the terminal job is succeeding .I guess there are con issue job it not at all launching /bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster

Is it Spark Serialization bug ?

2015-07-30 Thread Subshiri S
Hi all, I have tried to use lambda expression in spark task, and it throws java.lang.IllegalArgumentException: Invalid lambda deserialization exception. This exception is thrown when the is code like transform(pRDD-pRDD.map(t-t._2)) . The code snippet is below. JavaPairDStreamString,Integer

Re: Graceful shutdown for Spark Streaming

2015-07-30 Thread Tathagata Das
How is sleep not working? Are you doing streamingContext.start() Thread.sleep(xxx) streamingContext.stop() On Wed, Jul 29, 2015 at 6:55 PM, anshu shukla anshushuk...@gmail.com wrote: If we want to stop the application after fix-time period , how it will work . (How to give the duration in

Re: How to control Spark Executors from getting Lost when using YARN client mode?

2015-07-30 Thread Ashwin Giridharan
What is your cluster configuration ( size and resources) ? If you do not have enough resources, then your executor will not run. Moreover allocating 8 cores to an executor is too much. If you have a cluster with four nodes running NodeManagers, each equipped with 4 cores and 8GB of memory, then

Re: sc.parallelize(512k items) doesn't always use 64 executors

2015-07-30 Thread Konstantinos Kougios
yes,thanks, that sorted out the issue. On 30/07/15 09:26, Akhil Das wrote: sc.parallelize takes a second parameter which is the total number of partitions, are you using that? Thanks Best Regards On Wed, Jul 29, 2015 at 9:27 PM, Kostas Kougios kostas.koug...@googlemail.com

Re: Difference between RandomForestModel and RandomForestClassificationModel

2015-07-30 Thread Bryan Cutler
Hi Praveen, In MLLib, the major difference is that RandomForestClassificationModel makes use of a newer API which utilizes ML pipelines. I can't say for certain if they will produce the same exact result for a given dataset, but I believe they should. Bryan On Wed, Jul 29, 2015 at 12:14 PM,

Re: HiveQL to SparkSQL

2015-07-30 Thread Bigdata techguy
Thanks Jorn for the response and for the pointer questions to Hive optimization tips. I believe I have done the possible applicable things to improve hive query performance including but not limited to - running on TEZ, using partitioning, bucketing, using explain to make sure partition pruning

TFIDF Transformation

2015-07-30 Thread hans ziqiu li
Hello spark users! I am having some troubles with the TFIDF in MLlib and was wondering if anyone can point me to the right direction. The data ingestion and the initial term frequency count code taken from the example works fine (I am using the first example from this page:

Re: apache-spark 1.3.0 and yarn integration and spring-boot as a container

2015-07-30 Thread Steve Loughran
you need to fix your configuration so that the resource manager hostname/URL is set...that address there is the listen on any port path On 30 Jul 2015, at 10:47, Nirav Patel npa...@xactlycorp.commailto:npa...@xactlycorp.com wrote: 15/07/29 11:19:26 INFO client.RMProxy: Connecting to

[Parquet + Dataframes] Column names with spaces

2015-07-30 Thread angelini
Hi all, Our data has lots of human readable column names (names that include spaces), is it possible to use these with Parquet and Dataframes? When I try and write the Dataframe I get the following error: (I am using PySpark) `AnalysisException: Attribute name Name with Space contains invalid

Cast Error DataFrame/RDD doing group by and case class

2015-07-30 Thread Rishabh Bhardwaj
Hi, I have just started learning DF in sparks and encountered the following error: I am creating the following : *case class A(a1:String,a2:String,a3:String)* *case class B(b1:String,b2:String,b3:String)* *case class C(key:A,value:Seq[B])* Now I have to do a DF with struc (key :{..},value:{..}

Re: Spark and Speech Recognition

2015-07-30 Thread Simon Elliston Ball
You might also want to consider broadcasting the models to ensure you get one instance shared across cores in each machine, otherwise the model will be serialised to each task and you'll get a copy per executor (roughly core in this instance) Simon Sent from my iPhone On 30 Jul 2015, at

Twitter Connector-Spark Streaming

2015-07-30 Thread Sadaf
Hi. I am writing twitter connector using spark streaming. but it fetched the random tweets. Is there any way to receive the tweets of a particular account? I made an app on twitter and used the credentials as given below. def managingCredentials(): Option[twitter4j.auth.Authorization]= {

Error SparkStreaming after a while executing.

2015-07-30 Thread Guillermo Ortiz
I'm executing a job with Spark Streaming and got this error all times when the job has been executing for a while (usually hours of days). I have no idea why it's happening. 15/07/30 13:02:14 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception

RE: Spark on YARN

2015-07-30 Thread Shao, Saisai
You’d better also check the log of nodemanager, sometimes because your memory usage exceeds the limit of Yarn container’s configuration. I’ve met similar problem before, here is the warning log in nodemanager: 2015-07-07 17:06:07,141 WARN

Re: Writing streaming data to cassandra creates duplicates

2015-07-30 Thread Juan Rodríguez Hortalá
Hi, Just my two cents. I understand your problem is that your problem is that you have messages with the same key in two different dstreams. What I would do would be making a union of all the dstreams with StreamingContext.union or several calls to DStream.union, and then I would create a pair

Re: How to set log level in spark-submit ?

2015-07-30 Thread Alexander Krasheninnikov
I saw such example in docs: --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file://$path_to_file but, unfortunately, it does not work for me. On 30.07.2015 05:12, canan chen wrote: Yes, that should work. What I mean is is there any option in spark-submit command that I can specify

Re: Heatmap with Spark Streaming

2015-07-30 Thread Akhil Das
You can integrate it with any language (like php) and use ajax calls to update the charts. Thanks Best Regards On Thu, Jul 30, 2015 at 2:11 PM, UMESH CHAUDHARY umesh9...@gmail.com wrote: Thanks For the suggestion Akhil! I looked at https://github.com/mbostock/d3/wiki/Gallery to know more

Upgrade of Spark-Streaming application

2015-07-30 Thread Nicola Ferraro
Hi, I've read about the recent updates about spark-streaming integration with Kafka (I refer to the new approach without receivers). In the new approach, metadata are persisted in checkpoint folders on HDFS so that the SparkStreaming context can be recreated in case of failures. This means that

How to perform basic statistics on a Json file to explore my numeric and non-numeric variables?

2015-07-30 Thread SparknewUser
I've imported a Json file which has this schema : sqlContext.read.json(filename).printSchema root |-- COL: long (nullable = true) |-- DATA: array (nullable = true) ||-- element: struct (containsNull = true) |||-- Crate: string (nullable = true)