I have some problem with the JobScheduler. I have executed same code in two
cluster. I read from three topics in Kafka with DirectStream so I have
three tasks.
I have check YARN and there aren't more jobs launched.
The cluster where I have troubles I got this logs:
15/07/30 14:32:58 INFO
Oh... That was embarrassingly easy!
Thank you that was exactly the understanding of partitions that I needed.
P
On Thu, Jul 30, 2015 at 6:35 AM, Simon Elliston Ball
si...@simonellistonball.com wrote:
You might also want to consider broadcasting the models to ensure you get
one instance
Herman:
For Pre-built with user-provided Hadoop, spark-1.4.1-bin-hadoop2.6.tgz,
e.g., uses hadoop-2.6 profile which defines versions of projects Spark
depends on.
Hadoop cluster is used to provide storage (hdfs) and resource management
(YARN).
For the latter, please see:
I read about maxRatePerPartition parameter, I haven't set this parameter.
Could it be the problem?? Although this wouldn't explain why it doesn't
work in one of the clusters.
2015-07-30 14:47 GMT+02:00 Guillermo Ortiz konstt2...@gmail.com:
They just share the kafka, the rest of resources are
Thanks for information this fixed the issue. Issue was in spark-master
memory when I specify manually 1G for master. it start working
On 30 July 2015 at 14:26, Shao, Saisai saisai.s...@intel.com wrote:
You’d better also check the log of nodemanager, sometimes because your
memory usage exceeds
Yes I was doing same , if You mean that this is the correct way to do
Then I will verify it once more in my case .
On Thu, Jul 30, 2015 at 1:02 PM, Tathagata Das t...@databricks.com wrote:
How is sleep not working? Are you doing
streamingContext.start()
Thread.sleep(xxx)
Did you use an absolute path in $path_to_file? I just tried this with
spark-shell v1.4.1 and it worked for me. If the URL is wrong, you should
see an error message from log4j that it can't find the file. For windows it
would be something like file:/c:/path/to/file, I believe.
Dean Wampler, Ph.D.
Hi,
I find rather confusing the documentation about the configuration options.
There are a lot of files that are not too clear on where to modify. For
example, spark-env vs spark-defaults.
I am getting an error with Python versions collision:
File
i have prepared some interview questions:
http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-1
http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2
please provide your feedback.
On Wed, Jul 29, 2015, 23:43 Pedro Rodriguez ski.rodrig...@gmail.com wrote:
You
Hi @rok, thanks I got it
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/help-plz-how-to-use-zipWithIndex-to-each-subset-of-a-RDD-tp24071p24080.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
They just share the kafka, the rest of resources are independents. I tried
to stop one cluster and execute just the cluster isn't working but it
happens the same.
2015-07-30 14:41 GMT+02:00 Guillermo Ortiz konstt2...@gmail.com:
I have some problem with the JobScheduler. I have executed same
You can create a custom receiver and then inside it you can write yourown
piece of code to receive data, filter them etc before giving it to spark.
Thanks
Best Regards
On Thu, Jul 30, 2015 at 6:49 PM, Sadaf Khan sa...@platalytics.com wrote:
okay :)
then is there anyway to fetch the tweets
I am having the same issue. Have you found any resolution ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Connection-closed-reset-by-peers-error-tp21459p24081.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Just so I'm clear, the difference in timing you're talking about is this:
15/07/30 14:33:59 INFO DAGScheduler: Job 24 finished: foreachRDD at
MetricsSpark.scala:67, took 60.391761 s
15/07/30 14:37:35 INFO DAGScheduler: Job 93 finished: foreachRDD at
MetricsSpark.scala:67, took 0.531323 s
Are
I have three topics with one partition each topic. So each jobs run about
one topics.
2015-07-30 16:20 GMT+02:00 Cody Koeninger c...@koeninger.org:
Just so I'm clear, the difference in timing you're talking about is this:
15/07/30 14:33:59 INFO DAGScheduler: Job 24 finished: foreachRDD at
Hi Spark users and developers,
I wonder which git commit was used to build the latest master-nightly build
found at:
http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/?
I downloaded the build but I couldn't find the information related to it.
Thank you!
Best Regards,
I am getting same error. Any resolution on this issue ?
Thank you
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-task-connection-closed-tp21361p24082.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You can't use checkpoints across code upgrades. That may or may not change
in the future, but for now that's a limitation of spark checkpoints
(regardless of whether you're using Kafka).
Some options:
- Start up the new job on a different cluster, then kill the old job once
it's caught up to
If the jobs are running on different topicpartitions, what's different
about them? Is one of them 120x the throughput of the other, for
instance? You should be able to eliminate cluster config as a difference
by running the same topic partition on the different clusters and comparing
the
Hi Akhil,
Yes I did try to remove it, and i tried to build again.
However that jar keeps getting recreated, whenever i run ./build/sbt
assembly
Thanks,
Rahul P
On Thu, Jul 30, 2015 at 12:38 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Did you try removing this jar?
Hi Cody sorry my bad you were right there was a typo in topicSet. When I
corrected typo in topicSet it started working. Thanks a lot.
Regards
On Thu, Jul 30, 2015 at 7:43 PM, Cody Koeninger c...@koeninger.org wrote:
Can you post the code including the values of kafkaParams and topicSet,
Can you post the code including the values of kafkaParams and topicSet,
ideally the relevant output of kafka-topics.sh --describe as well
On Wed, Jul 29, 2015 at 11:39 PM, Umesh Kacha umesh.ka...@gmail.com wrote:
Hi thanks for the response. Like I already mentioned in the question kafka
topic
I register my class with Kyro in spark-defaults.conf as follow
spark.serializer
org.apache.spark.serializer.KryoSerializer
spark.kryo.registrationRequired true
spark.kryo.classesToRegister ltn.analytics.es.EsDoc
But I got the following
Hey all,
I'm running what should be a very straight-forward application of the Cassandra
sql connector, and I'm getting an error:
Exception in thread main java.lang.RuntimeException: Failed to load class for
data source: org.apache.spark.sql.cassandra
at
I'm submitting the application this way:
spark-submit test-2.0.5-SNAPSHOT-jar-with-dependencies.jar
I've confirmed that org.apache.spark.sql.cassandra and org.apache.cassandra
classes are in the jar.
Apologies for this relatively newbie question - I'm still new to both spark and
scala.
Looking at:
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-NightlyBuilds
Maven artifacts should be here:
https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
though the jars were dated July 16th.
Perhaps I'm missing what you are trying to accomplish, but if you'd like to
avoid the null values do an inner join instead of an outer join.
Additionally, I'm confused about how the result of joinedDF.filter(joinedDF(
y).isNotNull).show still contains null values in the column y. This
doesn't
You can't use these names due to limitations in parquet (and the library it
self with silently generate corrupt files that can't be read, hence the
error we throw).
You can alias a column by df.select(df(old).alias(new)), which is
essential what withColumnRenamed does. Alias in this case means
https://www.youtube.com/watch?v=JncgoPKklVE
On Thu, Jul 30, 2015 at 1:30 PM, ziqiu...@accenture.com wrote:
--
This message is for the designated recipient only and may contain
privileged, proprietary, or otherwise confidential information. If you have
received
This message is for the designated recipient only and may contain privileged,
proprietary, or otherwise confidential information. If you have received it in
error, please notify the sender immediately and delete the original. Any other
use of the e-mail by
Yes, and that is indeed the problem. It is trying to process all the data
in Kafka, and therefore taking 60 seconds. You need to set the rate limits
for that.
On Thu, Jul 30, 2015 at 8:51 AM, Cody Koeninger c...@koeninger.org wrote:
If you don't set it, there is no maximum rate, it will get
Hello;
I have a simple file of mobile IDFA. they look like
gregconv = ['00013FEE-7561-47F3-95BC-CA18D20BCF78',
'000D9B97-2B54-4B80-AAA1-C1CB42CFBF3A',
'000F9E1F-BC7E-47E1-BF68-C68F6D987B96']
I am trying to make this RDD into a data frame
ConvRecord = Row(IDFA)
gregconvdf = gregconv.map(lambda
Is this a known bottle neck for Spark Streaming textFileStream? Does it
need to list all the current files in a directory before he gets the new
files? Say I have 500k files in a directory, does it list them all in order
to get the new files?
For the first time it needs to list them. AFter that the list should be
cached by the file stream implementation (as far as I remember).
On Thu, Jul 30, 2015 at 3:55 PM, Brandon White bwwintheho...@gmail.com
wrote:
Is this a known bottle neck for Spark Streaming textFileStream? Does it
need
Hi,
I am new to using Spark and Parquet files,
Below is what i am trying to do, on Spark-shell,
val df =
sqlContext.parquetFile(/data/LM/Parquet/Segment/pages/part-m-0.gz.parquet)
Have also tried below command,
val
If anyone's curious, the issue here is that I was using the 1.2.4 connector of
the datastax spark Cassandra connector, rather than the 1.4.0-M1 pre-release.
1.2.4 doesn't fully support data frames, and it's presumably still only
experimental in 1.4.0-M1.
Ben
From: Benjamin Ross
Sent:
Dear Michael, dear all,
distinguishing those records that have a match in mapping from those that
don't is the crucial point.
Record(x : Int, a: String)
Mapping(x: Int, y: Int)
Thus
Record(1, hello)
Record(2, bob)
Mapping(2, 5)
yield (2, bob, 5) on an inner join.
BUT I'm also interested in
Umesh,
You can create a web-service in any of the languages supported by Spark and
stream the result from this web-service to your D3-based client using Websocket
or Server-Sent Events.
For example, you can create a webservice using Play. This app will integrate
with Spark streaming in the
Hi,
After I create a table in spark sql and load infile an hdfs file to
it, the file is no longer queryable if I do hadoop fs -ls.
Is this expected?
Thanks,
Ron
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
You have to read the original Spark paper to understand how RDD lineage
works.
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
On Thu, Jul 30, 2015 at 9:25 PM, Ted Yu yuzhih...@gmail.com wrote:
Please take a look at:
core/src/test/scala/org/apache/spark/CheckpointSuite.scala
Hi,
I don't get a good understanding how RDD lineage works, so I would ask whether
spark provides a unit test in the code base to illustrate how RDD lineage works.
If there is, What's the class name is it?
Thanks!
bit1...@163.com
Can you share the part of the code in your script where you create the
SparkContext instance?
On Thu, Jul 30, 2015 at 7:19 PM, fordfarline fordfarl...@gmail.com wrote:
Hi All,
I`m having an issue when lanching an app (python) against a stand alone
cluster, but runs in local, as it doesn't
Please take a look at:
core/src/test/scala/org/apache/spark/CheckpointSuite.scala
Cheers
On Thu, Jul 30, 2015 at 7:39 PM, bit1...@163.com bit1...@163.com wrote:
Hi,
I don't get a good understanding how RDD lineage works, so I would ask
whether spark provides a unit test in the code base to
You might want to run spark-submit with option --deploy-mode cluster
On Thursday, July 30, 2015 7:24 PM, Marcelo Vanzin van...@cloudera.com
wrote:
Can you share the part of the code in your script where you create the
SparkContext instance?
On Thu, Jul 30, 2015 at 7:19 PM,
I do suggest that the non-spark related discussions be taken to a different
this forum as it does not directly contribute to the contents of this user
list.
On Thu, Jul 30, 2015 at 8:52 PM, UMESH CHAUDHARY umesh9...@gmail.com
wrote:
Thanks for the valuable suggestion.
I also started with
Thanks TD and Zhihong for the guide. I will check it
bit1...@163.com
From: Tathagata Das
Date: 2015-07-31 12:27
To: Ted Yu
CC: bit1...@163.com; user
Subject: Re: How RDD lineage works
You have to read the original Spark paper to understand how RDD lineage works.
The following is copied from the paper, is something related with rdd lineage.
Is there a unit test that covers this scenario(rdd partition lost and recovery)?
Thanks.
If a partition of an RDD is lost, the RDD has enough information about how it
was derived from other RDDs to recompute
just
I have run into similar excpetions
ERROR DirectKafkaInputDStream: ArrayBuffer(java.net.SocketTimeoutException,
org.apache.spark.SparkException: Couldn't find leader offsets for
Set([AdServe,1]))
and the issue has happened on Kafka Side, where my broker offsets go out of
sync, or do not return
To clarify, that is the number of executors requested by the SparkContext
from the cluster manager.
On Tue, Jul 28, 2015 at 5:18 PM, amkcom amk...@gmail.com wrote:
try sc.getConf.getInt(spark.executor.instances, 1)
--
View this message in context:
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FailureSuite.scala
This may help.
On Thu, Jul 30, 2015 at 10:42 PM, bit1...@163.com bit1...@163.com wrote:
The following is copied from the paper, is something related with rdd
lineage. Is there a unit test that
Dear Michael, dear all,
motivation:
object OtherEntities {
case class Record( x:Int, a: String)
case class Mapping( x: Int, y: Int )
val records = Seq( Record(1, hello), Record(2, bob))
val mappings = Seq( Mapping(2, 5) )
}
Now I want to perform an *left outer join* on records and
Hi Ted,
The problem is that I don't know if the build uses the commits happened on
the same day or it is possible that it builds based on Jul 15th commits.
Just a thought, it might be possible to replace SNAPSHOT with the git
commit hash in the filename so people will know which one is based on.
It seem an issue with the ES connector
https://github.com/elastic/elasticsearch-hadoop/issues/482
Thanks
Best Regards
On Tue, Jul 28, 2015 at 6:14 AM, An Tran tra...@gmail.com wrote:
Hello all,
I am currently having an error with Spark SQL access Elasticsearch using
Elasticsearch Spark
What operation are you doing with streaming? Also can you look in the
datanode logs and see whats going on?
Thanks
Best Regards
On Tue, Jul 28, 2015 at 8:18 AM, guoqing0...@yahoo.com.hk
guoqing0...@yahoo.com.hk wrote:
Hi,
I got a error when running spark streaming as below .
Like this?
val data = sc.textFile(/sigmoid/audio/data/, 24).foreachPartition(urls =
speachRecognizer(urls))
Let 24 be the total number of cores that you have on all the workers.
Thanks
Best Regards
On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf opus...@gmail.com wrote:
Hello, I am writing a
I can't see the application logs here. All the logs are going into stderr.
can anybody help here?
On 30 July 2015 at 12:21, Jeetendra Gangele gangele...@gmail.com wrote:
I am running below command this is default spark PI program but this is
not running all the log are going in stderr but at
Thanks For the suggestion Akhil!
I looked at https://github.com/mbostock/d3/wiki/Gallery to know more about
d3, all examples described here are on static data, how we can update our
heat map from updated data, if we store it in Hbase or Mysql. I mean, do we
need to query back and fourth for it.
Is
Hi All,
Can someone throw insights on this ?
On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com
wrote:
Hi TD,
Thanks for the info. I have the scenario like this.
I am reading the data from kafka topic. Let's say kafka has 3 partitions
for the topic. In my
Did you try removing this jar? build/sbt-launch-0.13.7.jar
Thanks
Best Regards
On Tue, Jul 28, 2015 at 12:08 AM, Rahul Palamuttam rahulpala...@gmail.com
wrote:
Hi All,
I hope this is the right place to post troubleshooting questions.
I've been following the install instructions and I get
zipWithIndex gives you global indices, which is not what you want. You'll
want to use flatMap with a map function that iterates through each iterable
and returns the (String, Int, String) tuple for each element.
On Thu, Jul 30, 2015 at 4:13 AM, askformore [via Apache Spark User List]
You can easily push data to an intermediate storage from spark streaming
(like HBase or a SQL/NoSQL DB etc) and then power your dashboards with d3
js.
Thanks
Best Regards
On Tue, Jul 28, 2015 at 12:18 PM, UMESH CHAUDHARY umesh9...@gmail.com
wrote:
I have just started using Spark Streaming and
sc.parallelize takes a second parameter which is the total number of
partitions, are you using that?
Thanks
Best Regards
On Wed, Jul 29, 2015 at 9:27 PM, Kostas Kougios
kostas.koug...@googlemail.com wrote:
Hi, I do an sc.parallelize with a list of 512k items. But sometimes not all
executors
15/07/30 12:13:35 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15:
SIGTERM
AM is killed somehow, may due to preemption. Does it always happen ?
Resource manager log would be helpful.
On Thu, Jul 30, 2015 at 4:17 PM, Jeetendra Gangele gangele...@gmail.com
wrote:
I can't see the application
The difference is that one recives more data than the others two. I can
pass thought parameters the topics, so, I could execute the code trying
with one topic and figure out with one is the topic, although I guess that
it's the topics which gets more data.
Anyway it's pretty weird those delays in
Hi I have one Spark job which runs fine locally with less data but when I
schedule it on YARN to execute I keep on getting the following ERROR and
slowly all executors gets removed from UI and my job fails
15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 8 on
myhost1.com: remote Rpc
there is a work around.
sc.parallelise(items, items size / 2)
This way each executor will get a batch of 2 items at a time, simulating a
producer-consumer. With /4 it will get 4 items.
--
View this message in context:
Hi all,
1. *Columns in dataframes can be nullable and not nullable. Having a
nullable column of Doubles, I can use the following Scala code to filter all
non-null rows:*
val df = . // some code that creates a DataFrame
df.filter( df(columnname).isNotNull() )
+-+-++
We don't yet updated nullability information based on predicates as we
don't actually leverage this information in many places yet. Why do you
want to update the schema?
On Thu, Jul 30, 2015 at 11:19 AM, martinibus77 martin.se...@googlemail.com
wrote:
Hi all,
1. *Columns in dataframes can be
Hello spark users,
I hope your week is going fantastic! I am having some troubles with the TFIDF
in MLlib and was wondering if anyone can point me to the right direction.
The data ingestion and the initial term frequency count code taken from the
example works fine (I am using the first
I am running below command this is default spark PI program but this is not
running all the log are going in stderr but at the terminal job is
succeeding .I guess there are con issue job it not at all launching
/bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-cluster
Hi all, I have tried to use lambda expression in spark task, and it throws
java.lang.IllegalArgumentException: Invalid lambda deserialization
exception. This exception is thrown when the is code like
transform(pRDD-pRDD.map(t-t._2)) . The code snippet is below.
JavaPairDStreamString,Integer
How is sleep not working? Are you doing
streamingContext.start()
Thread.sleep(xxx)
streamingContext.stop()
On Wed, Jul 29, 2015 at 6:55 PM, anshu shukla anshushuk...@gmail.com
wrote:
If we want to stop the application after fix-time period , how it will
work . (How to give the duration in
What is your cluster configuration ( size and resources) ?
If you do not have enough resources, then your executor will not run.
Moreover allocating 8 cores to an executor is too much.
If you have a cluster with four nodes running NodeManagers, each equipped
with 4 cores and 8GB of memory,
then
yes,thanks, that sorted out the issue.
On 30/07/15 09:26, Akhil Das wrote:
sc.parallelize takes a second parameter which is the total number of
partitions, are you using that?
Thanks
Best Regards
On Wed, Jul 29, 2015 at 9:27 PM, Kostas Kougios
kostas.koug...@googlemail.com
Hi Praveen,
In MLLib, the major difference is that RandomForestClassificationModel
makes use of a newer API which utilizes ML pipelines. I can't say for
certain if they will produce the same exact result for a given dataset, but
I believe they should.
Bryan
On Wed, Jul 29, 2015 at 12:14 PM,
Thanks Jorn for the response and for the pointer questions to Hive
optimization tips.
I believe I have done the possible applicable things to improve hive
query performance including but not limited to - running on TEZ, using
partitioning, bucketing, using explain to make sure partition pruning
Hello spark users!
I am having some troubles with the TFIDF in MLlib and was wondering if
anyone can point me to the right direction.
The data ingestion and the initial term frequency count code taken from the
example works fine (I am using the first example from this page:
you need to fix your configuration so that the resource manager hostname/URL is
set...that address there is the listen on any port path
On 30 Jul 2015, at 10:47, Nirav Patel
npa...@xactlycorp.commailto:npa...@xactlycorp.com wrote:
15/07/29 11:19:26 INFO client.RMProxy: Connecting to
Hi all,
Our data has lots of human readable column names (names that include
spaces), is it possible to use these with Parquet and Dataframes?
When I try and write the Dataframe I get the following error:
(I am using PySpark)
`AnalysisException: Attribute name Name with Space contains invalid
Hi,
I have just started learning DF in sparks and encountered the following
error:
I am creating the following :
*case class A(a1:String,a2:String,a3:String)*
*case class B(b1:String,b2:String,b3:String)*
*case class C(key:A,value:Seq[B])*
Now I have to do a DF with struc
(key :{..},value:{..}
You might also want to consider broadcasting the models to ensure you get one
instance shared across cores in each machine, otherwise the model will be
serialised to each task and you'll get a copy per executor (roughly core in
this instance)
Simon
Sent from my iPhone
On 30 Jul 2015, at
Hi.
I am writing twitter connector using spark streaming. but it fetched the
random tweets.
Is there any way to receive the tweets of a particular account?
I made an app on twitter and used the credentials as given below.
def managingCredentials(): Option[twitter4j.auth.Authorization]=
{
I'm executing a job with Spark Streaming and got this error all times when
the job has been executing for a while (usually hours of days).
I have no idea why it's happening.
15/07/30 13:02:14 ERROR LiveListenerBus: Listener EventLoggingListener
threw an exception
You’d better also check the log of nodemanager, sometimes because your memory
usage exceeds the limit of Yarn container’s configuration.
I’ve met similar problem before, here is the warning log in nodemanager:
2015-07-07 17:06:07,141 WARN
Hi,
Just my two cents. I understand your problem is that your problem is that
you have messages with the same key in two different dstreams. What I would
do would be making a union of all the dstreams with StreamingContext.union
or several calls to DStream.union, and then I would create a pair
I saw such example in docs:
--conf
spark.driver.extraJavaOptions=-Dlog4j.configuration=file://$path_to_file
but, unfortunately, it does not work for me.
On 30.07.2015 05:12, canan chen wrote:
Yes, that should work. What I mean is is there any option in
spark-submit command that I can specify
You can integrate it with any language (like php) and use ajax calls to
update the charts.
Thanks
Best Regards
On Thu, Jul 30, 2015 at 2:11 PM, UMESH CHAUDHARY umesh9...@gmail.com
wrote:
Thanks For the suggestion Akhil!
I looked at https://github.com/mbostock/d3/wiki/Gallery to know more
Hi,
I've read about the recent updates about spark-streaming integration with
Kafka (I refer to the new approach without receivers).
In the new approach, metadata are persisted in checkpoint folders on HDFS
so that the SparkStreaming context can be recreated in case of failures.
This means that
I've imported a Json file which has this schema :
sqlContext.read.json(filename).printSchema
root
|-- COL: long (nullable = true)
|-- DATA: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- Crate: string (nullable = true)
89 matches
Mail list logo