-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Hi,
by now I understood maybe a bit better how spark-submit and YARN play
together and how Spark driver and slaves play together on YARN.
Now for my usecase, as described on
https://spark.apache.org/docs/latest/submitting-applications.html, I would
probably have a end-user-facing gateway that
When you get a stream from sc.fileStream() spark will process only files with
file timestamp then current timestamp so all data from HDFS should not be
processed again. You may have a another problem - spark will not process
files that moved to your HDFS folder between your application restarts.
Hi,
My requirement is to extract certain fields from json files, run queries on
them and save the result to cassandra.
I was able to parse json , filter the result and save the rdd(regular) to
cassandra.
Now, when I try to read the json file through sqlContext , execute some
queries on the same
Hi,
I have been launching Spark in the same ways for the past months, but I have
only recently started to have problems with it. I launch Spark using
spark-ec2 script, but then I cannot access the web UI when I type
address:8080 into the browser (it doesn't work with lynx either from the
master
I want to create a temporary variables in a spark code.
Can I do this?
for (i - num)
{
val temp = ..
{
do something
}
temp.unpersist()
}
Thank You
Which version of spark are you having?
Thanks
Best Regards
On Thu, Sep 11, 2014 at 3:10 PM, mrm ma...@skimlinks.com wrote:
Hi,
I have been launching Spark in the same ways for the past months, but I
have
only recently started to have problems with it. I launch Spark using
spark-ec2
like this?
var temp = ...
for (i - num)
{
temp = ..
{
do something
}
temp.unpersist()
}
Thanks
Best Regards
On Thu, Sep 11, 2014 at 3:26 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
I want to create a temporary variables in a spark code.
Can I do this?
for (i - num)
{
I tried 1.0.0, 1.0.1 and 1.0.2. I also tried the latest github commit.
After several hours trying to launch it, now it seems to be working, this is
what I did (not sure if any of these steps helped):
1/ clone the spark repo into the master node
2/ run sbt/sbt assembly
3/ copy spark and spark-ec2
Hello, we are in Sematext (https://apps.sematext.com/) are writing
Monitoring tool for Spark and we came across one question:
How to enable JMX metrics for YARN deployment?
We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
to file $SPARK_HOME/conf/metrics.properties but it doesn't
Thanks for all
I'm going to check both solution
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-scale-more-consumer-to-Kafka-stream-tp13883p13959.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi all
I am trying to run kinesis spark streaming application on a standalone
spark cluster. The job works find in local mode but when I submit it (using
spark-submit), it doesn't do anything. I enabled logs
for org.apache.spark.streaming.kinesis package and I regularly get the
following in
Hi,
I am new to spark. I encountered an issue when trying to connect to Cassandra
using Spark Cassandra connector. Can anyone help me. Following are the details.
1) Following Spark and Cassandra versions I am using on LUbuntu12.0.
i)spark-1.0.2-bin-hadoop2
ii) apache-cassandra-2.0.10
2) In
You will have to create create KeySpace and Table.
See the message,
Table not found: EmailKeySpace.Emails
Looks like you have not created the Emails table.
On Thu, Sep 11, 2014 at 6:04 PM, Karunya Padala
karunya.pad...@infotech-enterprises.com wrote:
Hi,
I am new to spark. I
I have created key space called EmailKeySpace’and table called Emails and
inserted some data in the Cassandra. See my Cassandra console screen shot.
[cid:image001.png@01CFCDEB.8FB55CB0]
Regards,
Karunya.
From: Reddy Raja [mailto:areddyr...@gmail.com]
Sent: 11 September 2014 18:07
To: Karunya
Has anyone tried using Raspberry Pi for Spark? How efficient is it to use
around 10 Pi's for local testing env ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html
Sent from the Apache Spark User List mailing list archive at
I agree Gerard. Thanks for pointing this..
Dib
On Thu, Sep 11, 2014 at 5:28 PM, Gerard Maas gerard.m...@gmail.com wrote:
This pattern works.
One note, thought: Use 'union' only if you need to group the data from all
RDDs into one RDD for processing (like count distinct or need a groupby).
Pi's bus speed, memory size and access speed, and processing ability are
limited. The only benefit could be the power consumption.
On Thu, Sep 11, 2014 at 8:04 AM, Sandeep Singh sand...@techaddict.me
wrote:
Has anyone tried using Raspberry Pi for Spark? How efficient is it to use
around 10
Hi,
I’m guessing the problem is that driver or executor cannot get the
metrics.properties configuration file in the yarn container, so metrics system
cannot load the right sinks.
Thanks
Jerry
From: Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com]
Sent: Thursday, September 11, 2014
Hi
I am trying the Spark sample program “SparkPi”, I got an error unable to
create new native thread, how to resolve this?
14/09/11 21:36:16 INFO scheduler.DAGScheduler: Completed ResultTask(0, 644)
14/09/11 21:36:16 INFO scheduler.TaskSetManager: Finished TID 643 in 43 ms on
node1 (progress:
Hi Shao, thx for explanation, any ideas how to fix it? Where should I put
metrics.properties file?
On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai saisai.s...@intel.com wrote:
Hi,
I’m guessing the problem is that driver or executor cannot get the
metrics.properties configuration file in the
I think you can try to use ” spark.metrics.conf” to manually specify the path
of metrics.properties, but the prerequisite is that each container should find
this file in their local FS because this file is loaded locally.
Besides I think this might be a kind of workaround, a better solution is
After every loop I want the temp variable to cease to exist
On Thu, Sep 11, 2014 at 4:33 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
like this?
var temp = ...
for (i - num)
{
temp = ..
{
do something
}
temp.unpersist()
}
Thanks
Best Regards
On Thu, Sep 11, 2014
Hi,
Can you attach more logs to see if there is some entry from ContextCleaner?
I met very similar issue before…but haven’t get resolved
Best,
--
Nan Zhu
On Thursday, September 11, 2014 at 10:13 AM, Dibyendu Bhattacharya wrote:
Dear All,
Not sure if this is a false alarm.
This is my case about broadcast variable:
14/07/21 19:49:13 INFO Executor: Running task ID 4 14/07/21 19:49:13 INFO
DAGScheduler: Completed ResultTask(0, 2) 14/07/21 19:49:13 INFO TaskSetManager:
Finished TID 2 in 95 ms on localhost (progress: 3/106) 14/07/21 19:49:13 INFO
TableOutputFormat:
Hi guys,
any luck with this issue, anyone?
I aswell tried all the possible exclusion combos to a no avail.
thanks for your ideas
reinis
-Original-Nachricht-
Von: Stephen Boesch java...@gmail.com
An: user user@spark.apache.org
Datum: 28-06-2014 15:12
Betreff: Re: HBase 0.96+
Hi again, yeah , I've tried to use ” spark.metrics.conf” before my question
in ML, had no luck:(
Any other ideas from somebody?
Seems nobody use metrics in YARN deployment mode.
How about Mesos? I didn't try but maybe Spark has the same difficulties on
Mesos?
PS: Spark is great thing in general,
Hi Vladimir
How about use --files option with spark-submit?
- Kousuke
(2014/09/11 23:43), Vladimir Tretyakov wrote:
Hi again, yeah , I've tried to use ” spark.metrics.conf” before my
question in ML, had no luck:(
Any other ideas from somebody?
Seems nobody use metrics in YARN deployment
Is there some doc that I missed that describes what execution engines Python
is support for with Spark? If we use spark-submit, with a yarn cluster an
error is produced saying 'Error: Cannot currently run Python driver programs
on cluster'.
Thanks in advance
David
--
View this message in
Thank you!! I can do this using saveAsTable with the schemaRDD, right?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Table-not-found-using-jdbc-console-to-query-sparksql-hive-thriftserver-tp13840p13979.html
Sent from the Apache Spark User List mailing
HI,
Can someone please tell me how to compile the spark source code to effect
the changes in the source code. I was trying to ship the jars to all the
slaves, but in vain.
-Karthik
I am running a simple Spark Streaming program that pulls in data from
Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
data and persists to a store.
The program is running in local mode right now and runs out of memory after
a while. I am yet to investigate heap dumps
filed jira SPARK-3489 https://issues.apache.org/jira/browse/SPARK-3489
On Thu, Sep 4, 2014 at 9:36 AM, Mohit Jaggi mohitja...@gmail.com wrote:
Folks,
I sent an email announcing
https://github.com/AyasdiOpenSource/df
This dataframe is basically a map of RDDs of columns(along with DSL
The heap size of JVM can not been changed dynamically, so you
need to config it before running pyspark.
If you run it in local mode, you should config spark.driver.memory
(in 1.1 or master).
Or, you can use --driver-memory 2G (should work in 1.0+)
On Wed, Sep 10, 2014 at 10:43 PM, Mohit Singh
In the spark source folder, execute `sbt/sbt assembly`
On Thu, Sep 11, 2014 at 8:27 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:
HI,
Can someone please tell me how to compile the spark source code to effect
the changes in the source code. I was trying to ship the jars to all the
Hi, Kousuke,
Can you please explain a bit detailed what do you mean, I am new in Spark,
looked at https://spark.apache.org/docs/latest/submitting-applications.html
seems there is no '--files' option.
I just have to add '--files /path-to-metrics.properties' ? Undocumented
ability?
Thx for
Limited memory could also cause you some problems and limit usability. If
you're looking for a local testing environment, vagrant boxes may serve you
much better.
On Thu, Sep 11, 2014 at 6:18 AM, Chen He airb...@gmail.com wrote:
Pi's bus speed, memory size and access speed, and processing
I did change it to be 1 gb. It still ran out of memory but a little later.
The streaming job isnt handling a lot of data. In every 2 seconds, it
doesn't get more than 50 records. Each record size is not more than 500
bytes.
On Sep 11, 2014 10:54 PM, Bharat Venkat bvenkat.sp...@gmail.com wrote:
Just curiois... What's the use case you are looking to implement?
On Sep 11, 2014 10:50 PM, Daniil Osipov daniil.osi...@shazam.com wrote:
Limited memory could also cause you some problems and limit usability. If
you're looking for a local testing environment, vagrant boxes may serve you
much
We've found that Raspberry Pi is not enough for Hadoop/Spark mainly
because the memory consumption. What we've built is a cluster form
with 22 Cubieboards, each contains 1 GB RAM.
Best regards,
-chanwit
--
Chanwit Kaewkasi
linkedin.com/in/chanwit
On Thu, Sep 11, 2014 at 8:04 PM, Sandeep Singh
Even when I comment out those 3 lines, I still get the same error. Did
someone solve this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-tp11369p13992.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
To answer my own question, in case someone else runs into this. The spark user
needs to be in the same group on the namenode, and hdfs caches that information
for it seems like at least an hour. Magically started working on its own.
Greg
From: Greg
Dependency hell... My fav problem :).
I had run into a similar issue with hbase and jetty. I cant remember thw
exact fix, but is are excerpts from my dependencies that may be relevant:
val hadoop2Common = org.apache.hadoop % hadoop-common % hadoop2Version
excludeAll(
Hello all,
I'm trying to run a Driver on my local network with a deployment on EC2 and
it's not working. I was wondering if either the master or slave instances
(in standalone) connect back to the driver program.
I outlined the details of my observations in a previous post but here is
what I'm
Hi,
I have the following code snippet. It works fine on spark-shell but in a
standalone app it reports No TypeTag available for MySchema” at compile time
when calling hc.createScheamaRdd(rdd). Anybody knows what might be missing?
Thanks,
Du
--
Import org.apache.spark.sql.hive.HiveContext
This might be a better question to ask on the cassandra mailing list as I
believe that is where the exception is coming from.
On Thu, Sep 11, 2014 at 2:37 AM, lmk lakshmi.muralikrish...@gmail.com
wrote:
Hi,
My requirement is to extract certain fields from json files, run queries on
them and
Still fairly new to Spark so please bear with me. I am trying to write a
streaming app that has multiple workers that read from sockets and process
the data. Here is a very simplified version of what I am trying to do:
val carStreamSeq = (1 to 2).map( _ = ssc.socketTextStream(host, port)
I am seeing this same issue with Spark 1.0.1 (tried with file:// for local file
) :
scala val lines = sc.textFile(file:///home/monir/.bashrc)
lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
console:12
scala val linecount = lines.count
Michael Armbrust wrote
You'll need to run parquetFile(path).registerTempTable(name) to
refresh the table.
I'm not seeing that function on SchemaRDD in 1.0.2, is there something I'm
missing?
SchemaRDD Scaladoc
Solved it.
The problem occurred because the case class was defined within a test case in
FunSuite. Moving the case class definition out of test fixed the problem.
From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID
Date: Thursday, September 11, 2014 at 11:25 AM
To:
Hi There,
I am new to Spark and I was wondering when you have so much memory on each
machine of the cluster, is it better to run multiple workers with limited
memory on each machine or is it better to run a single worker with access
to the majority of the machine memory? If the answer is it
Hi,
I am trying to create a new table from a select query as follows:
CREATE TABLE IF NOT EXISTS new_table ROW FORMAT DELIMITED FIELDS TERMINATED
BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION
'/user/test/new_table' AS select * from table
this works in Hive, but in Spark SQL
The implementation of SparkSQL is currently incomplete. You may try it out
with HiveContext instead of SQLContext.
On 9/11/14, 1:21 PM, jamborta jambo...@gmail.com wrote:
Hi,
I am trying to create a new table from a select query as follows:
CREATE TABLE IF NOT EXISTS new_table ROW FORMAT
Just moving it out of test is not enough. Must move the case class definition
to the top level. Otherwise it would report a runtime error of task not
serializable when executing collect().
From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID
Date: Thursday, September 11,
thanks. this was actually using hivecontext.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14009.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Seems starting spark-shell in local mode solves this. But still then it cannot
recognize file beginning with a '.'
MASTER=local[4] ./bin/spark-shell
.
scala val lineCount = sc.textFile(/home/monir/ref).count
lineCount: Long = 68
scala val lineCount2 =
Thank you, Aniket for your hint!
Alas, I am facing really hellish situation as it seems, because I have
integration tests using BOTH spark and HBase (Minicluster). Thus I get either:
class javax.servlet.ServletRegistration's signer information does not match
signer information of other classes
Which version of spark are you running?
If you are running the latest one, then could try running not a window but
a simple event count on every 2 second batch, and see if you are still
running out of memory?
TD
On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar
aniket.bhatna...@gmail.com
This is very puzzling, given that this works in the local mode.
Does running the kinesis example work with your spark-submit?
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
The instructions are present
This was already answered at the bottom of this same thread -- read below.
On Thu, Sep 11, 2014 at 9:51 PM, sp...@orbit-x.de wrote:
class javax.servlet.ServletRegistration's signer information does not
match signer information of other classes in the same package
java.lang.SecurityException:
Hi,
I'm trying to make spark work on multithreads java application.
What i'm trying to do is,
- Create a Single SparkContext
- Create Multiple SparkILoop and SparkIMain
- Inject created SparkContext into SparkIMain interpreter.
Thread is created by every user request and take a SparkILoop and
What is the schema of table?
On Thu, Sep 11, 2014 at 4:30 PM, jamborta jambo...@gmail.com wrote:
thanks. this was actually using hivecontext.
--
View this message in context:
1.0.1 does not have the support on outer joins (added in 1.1). Can you try
1.1 branch?
On Wed, Sep 10, 2014 at 9:28 PM, boyingk...@163.com boyingk...@163.com
wrote:
Hi,michael :
I think Arthur.hk.chan arthur.hk.c...@gmail.com isn't here now,I Can
Show something:
1)my spark version is 1.0.1
Hi,
Our spark streaming app is configured to pull data from Kafka in 1 hour
batch duration which performs aggregation of data by specific keys and
store the related RDDs to HDFS in the transform phase. We have tried
checkpoint of 7 days on the DStream of Kafka to ensure that the generated
stream
Iterating an RDD gives you each partition in order of their split index.
I'd like to be able to get each partition in reverse order, but I'm having
difficultly implementing the compute() method. I thought I could do
something like this:
override def getDependencies: Seq[Dependency[_]] = {
I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
the second release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 171 developers!
This release brings operational and performance improvements in Spark
core including a new
So I have a bunch of hardware with different core and memory setups. Is
there a way to do one of the following:
1. Express a ratio of cores to memory to retain. The spark worker config
would represent all of the cores and all of the memory usable for any
application, and the application would
Hi,
I am using Spark 1.0.2 on a mesos cluster. After I run my job, when I try to
look at the detailed application stats using a history server@18080, the
stats don't show up for some of the jobs even though the job completed
successfully and the event logs are written to the log folder. The log
I see the binary packages include hadoop 1, 2.3 and 2.4.
Does Spark 1.1.0 support hadoop 2.5.0 at below address?
http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Friday,
Hi All,
I'm having some trouble with the coalesce and repartition functions for
SchemaRDD objects in pyspark. When I run:
sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}',
'{foo:baz}'])).coalesce(1)
I get this error:
Py4JError: An error occurred while calling o94.coalesce. Trace:
Hi,
On Fri, Sep 12, 2014 at 9:12 AM, Patrick Wendell pwend...@gmail.com wrote:
I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
the second release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 171 developers!
Great,
I’m not sure if I’m completely answering your question here but I’m currently
working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with Hadoop 2.4
without any issues.
On September 11, 2014 at 18:11:46, Haopu Wang (hw...@qilinsoft.com) wrote:
I see the binary packages include hadoop 1,
It sort of depends on the definition of efficiently. From a work flow
perspective I would agree but from an I/O perspective, wouldn’t there be the
same multi-pass from the standpoint of the Hive context needing to push the
data into HDFS? Saying this, if you’re pushing the data into HDFS and
Please correct me if I’m wrong but I was under the impression as per the maven
repositories that it was just to stay more in sync with the various version of
Hadoop. Looking at the latest documentation
(https://spark.apache.org/docs/latest/building-with-maven.html), there are
multiple Hadoop
From the web page
(https://spark.apache.org/docs/latest/building-with-maven.html) which is
pointed out by you, it’s saying “Because HDFS is not protocol-compatible across
versions, if you want to read from HDFS, you’ll need to build Spark against the
specific HDFS version in your environment.”
Hi guys,
I configured Spark with the configuration in spark-env.sh:
export SPARK_DAEMON_JAVA_OPTS=-Dspark.deploy.recoveryMode=ZOOKEEPER
-Dspark.deploy.zookeeper.url=host1:2181,host2:2181,host3:2181
-Dspark.deploy.zookeeper.dir=/spark
And I started spark-shell on one master host1(active):
Yes, atleast for my query scenarios, I have been able to use Spark 1.1 with
Hadoop 2.4 against Hadoop 2.5. Note, Hadoop 2.5 is considered a relatively
minor release
(http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available)
where Hadoop 2.4 and 2.3 were considered
Got it, thank you, Denny!
From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Friday, September 12, 2014 11:04 AM
To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell
Subject: RE: Announcing Spark 1.1.0!
Yes, atleast for my query
I've created SPARK-3499 https://issues.apache.org/jira/browse/SPARK-3499 to
track creating a Spark-based distcp utility.
Nick
On Tue, Aug 12, 2014 at 4:20 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Good question; I don't know of one but I believe people at Cloudera had
some thoughts of
When you re-ran sbt did you clear out the packages first and ensure that
the datanucleus jars were generated within lib_managed? I remembered
having to do that when I was working testing out different configs.
On Thu, Sep 11, 2014 at 10:50 AM, alexandria1101
alexandria.shea...@gmail.com wrote:
Could you provide some context about running this in yarn-cluster mode?
The Thrift server that's included within Spark 1.1 is based on Hive 0.12.
Hive has been able to work against YARN since Hive 0.10. So when you start
the thrift server, provided you copied the hive-site.xml over to the Spark
Thanks for all the good work. Very excited about seeing more features and
better stability in the framework.
On Thu, Sep 11, 2014 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote:
I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
the second release on the
Thanks to everyone who contributed to implementing and testing this release!
Matei
On September 11, 2014 at 11:52:43 PM, Tim Smith (secs...@gmail.com) wrote:
Thanks for all the good work. Very excited about seeing more features and
better stability in the framework.
On Thu, Sep 11, 2014 at
I have been doing that. All the modifications to the code are not being
compiled.
On Thu, Sep 11, 2014 at 10:45 PM, Daniil Osipov daniil.osi...@shazam.com
wrote:
In the spark source folder, execute `sbt/sbt assembly`
On Thu, Sep 11, 2014 at 8:27 AM, rapelly kartheek kartheek.m...@gmail.com
I copied the 3 datanucleus jars (datanucleus-api-jdo-3.2.1.jar,
datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar) to the fold lib/
manually, and it works for me.
From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Friday, September 12, 2014 11:28 AM
To: alexandria1101
Cc:
SchemaRDD has a method insertInto(table). When the table is partitioned, it
would be more sensible and convenient to extend it with a list of partition key
and values.
From: Denny Lee denny.g@gmail.commailto:denny.g@gmail.com
Date: Thursday, September 11, 2014 at 6:39 PM
To: Du Li
86 matches
Mail list logo