you can also access SparkConf using sc.getConf in Spark shell though for
StreamingContext you can directly refer sc as Akhil suggested.
On Sun, Dec 28, 2014 at 12:13 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
In the shell you could do:
val ssc = StreamingContext(*sc*, Seconds(1))
as
Hi,
For running spark 1.2 on Hadoop cluster with Kerberos, what spark
configurations are required?
Using existing keytab, can any examples be submitted to the secured cluster
? How?
Thanks,
Hi Folks,
Apologies for cross posting :(
As some of you may already know, @ApacheCon NA 2015 is happening in Austin,
TX April 13th-16th.
This email is specifically written to attract all folks interested in
Science and Healthcare... this is an official call to arms! I am aware that
there are
I installed the custom as a standalone mode as normal. The master and slaves
started successfully.
However, I got error when I ran a job. It seems to me from the error message
the some library was compiled against hadoop1, but my spark was compiled
against hadoop2.
15/01/08 23:27:36 INFO
Hi Manoj,
As long as you're logged in (i.e. you've run kinit), everything should
just work. You can run klist to make sure you're logged in.
On Thu, Jan 8, 2015 at 3:49 PM, Manoj Samel manojsamelt...@gmail.com wrote:
Hi,
For running spark 1.2 on Hadoop cluster with Kerberos, what spark
Pl ignore the keytab question for now, the question wasn't fully described
Some old communication (Oct 14) says Spark is not certified with Kerberos.
Can someone comment on this aspect ?
On Thu, Jan 8, 2015 at 3:53 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Manoj,
As long as you're
I ran this with CDH 5.2 without a problem (sorry don't have 5.3
readily available at the moment):
$ HBASE='/opt/cloudera/parcels/CDH/lib/hbase/\*'
$ spark-submit --driver-class-path $HBASE --conf
spark.executor.extraClassPath=$HBASE --master yarn --class
org.apache.spark.examples.HBaseTest
One approach is to first transform this RDD into a PairRDD by taking the
field you are going to do aggregation on as key
On Tue, Dec 23, 2014 at 1:47 AM, sachin Singh sachin.sha...@gmail.com
wrote:
Hi,
I have a csv file having fields as a,b,c .
I want to do aggregation(sum,average..) based
I ran the release spark in cdh5.3.0 but got the same error. Anyone tried to
run spark in cdh5.3.0 using its newAPIHadoopRDD?
command:
spark-submit --master spark://master:7077 --jars
/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar
On Thu, Jan 8, 2015 at 4:09 PM, Manoj Samel manojsamelt...@gmail.com wrote:
Some old communication (Oct 14) says Spark is not certified with Kerberos.
Can someone comment on this aspect ?
Spark standalone doesn't support kerberos. Spark running on top of
Yarn works fine with kerberos.
--
I try to save RDD as text file to local file system (Linux) but it does not
work
Launch spark-shell and run the following
val r = sc.parallelize(Array(a, b, c))
r.saveAsTextFile(file:///home/cloudera/tmp/out1)
IOException: Mkdirs failed to create
Suppose I give three files paths to spark context to read and each file has
schema in first row. how can we skip schema lines from headers
val rdd=sc.textFile(file1,file2,file3);
--
View this message in context:
Hallo,
Based on experiences with other software in virtualized environments I
cannot really recommend this. However, I am not sure how Spark reacts. You
may face unpredictable task failures depending on utilization, tasks
connecting to external systems (databases etc.) may fail unexpectedly and
Hello,
I am new to Spark. I have adapted an example code to do binary
classification using logistic regression. I tried it on rcv1_train.binary
dataset using LBFGS.runLBFGS solver, and obtained correct loss.
Now, I'd like to run code in parallel across 16 cores of my single CPU
socket. If I
Are you running the program in local mode or in standalone cluster mode?
Thanks
Best Regards
On Fri, Jan 9, 2015 at 10:12 AM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
I try to save RDD as text file to local file system (Linux) but it does
not work
Launch spark-shell
saveAsHadoopFiles requires you to specify the output format which i believe
you are not specifying anywhere and hence the program crashes.
You could try something like this:
Class? extends OutputFormat?,? outputFormatClass = (Class? extends
OutputFormat?,?) (Class?)
Here's how you do it:
val joined_stream = *myStream*.transform((x: RDD[(String, String)]) =
{ val prdd = new PairRDDFunctions[String, String](x)
prdd.join(*myRDD*)})
Thanks
Best Regards
On Thu, Jan 8, 2015 at 10:20 PM, Asim Jalis asimja...@gmail.com wrote:
Is there a way
Thank you Pankaj. We are able to create the Uber JAR (very good to bind all
dependency JARs together) and run it on spark-jobserver. One step better
than what we are.
However, now facing *SparkException: Job aborted due to stage failure: All
masters are unresponsive! Giving up*. We may need to
Did you try something like:
val file = sc.textFile(/home/akhld/sigmoid/input)
val skipped = file.filter(row = !row.contains(header))
skipped.take(10).foreach(println)
Thanks
Best Regards
On Fri, Jan 9, 2015 at 11:48 AM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
Suppose I
1) Thank you everyone for the help once again...the support here is really
amazing and I hope to contribute soon!
2) The solution I actually ended up using was from this thread:
Hi,
On Wed, Jan 7, 2015 at 9:47 AM, Asim Jalis asimja...@gmail.com wrote:
One approach I was considering was to use mapPartitions. It is
straightforward to compute the moving average over a partition, except for
near the end point. Does anyone see how to fix that?
Well, I guess this is not
looks like it is trying to save the file in Hdfs.
Check if you have set any hadoop path in your system.
On Fri, Jan 9, 2015 at 12:14 PM, Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
Can you check permissions etc as I am able to run
r.saveAsTextFile(file:///home/cloudera/tmp/out1)
I cam across this http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/.
You can take a look.
On Fri Jan 09 2015 at 12:08:49 PM Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
I have the similar kind of requirement where I want to push avro data into
parquet. But it seems you have
Yes, I am calling the saveAsHadoopFiles on the Dstream. However, when I
call print on the Dstream it works? If I had to do foreachRDD to
saveAsHadoopFile, then why is it working for print?
Also, if I am doing foreachRDD, do I need connections, or can I simply put
the saveAsHadoopFiles inside the
Hi,
I'm wondering whether it is a good idea to overcommit CPU cores on
the spark cluster.
For example, in our testing cluster, each worker machine has 24
physical CPU cores. However, we are allowed to set the CPU core number to
48 or more in the spark configuration file. As a result,
I have the similar kind of requirement where I want to push avro data into
parquet. But it seems you have to do it on your own. There is parquet-mr
project that uses hadoop to do so. I am trying to write a spark job to do
similar kind of thing.
On Fri, Jan 9, 2015 at 3:20 AM, Jerry Lam
Any ideas? :)
From: Nathan
nathan.mccar...@quantium.com.aumailto:nathan.mccar...@quantium.com.au
Date: Wednesday, 7 January 2015 2:53 pm
To: user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Subject: SparkSQL schemaRDD MapPartitions calls -
Hi spark users,
I'm using spark SQL to create parquet files on HDFS. I would like to store
the avro schema into the parquet meta so that non spark sql applications
can marshall the data without avro schema using the avro parquet reader.
Currently, schemaRDD.saveAsParquetFile does not allow to do
I'm building a system that collects data using Spark Streaming, does some
processing with it, then saves the data. I want the data to be queried by
multiple applications, and it sounds like the Thrift JDBC/ODBC server might
be the right tool to handle the queries. However, the documentation for
as per my understanding RDDs do not get replicated, underlying Data does if
it's in HDFS.
On Thu, Dec 25, 2014 at 9:04 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Hi,
I want to find the time taken for replicating an rdd in spark cluster
along with the computation time on the
On Thu, Jan 8, 2015 at 3:33 PM, freedafeng freedaf...@yahoo.com wrote:
I installed the custom as a standalone mode as normal. The master and slaves
started successfully.
However, I got error when I ran a job. It seems to me from the error message
the some library was compiled against hadoop1,
are you calling the saveAsText files on the DStream --looks like it? Look
at the section called Design Patterns for using foreachRDD in the link
you sent -- you want to do dstream.foreachRDD(rdd = rdd.saveAs)
On Thu, Jan 8, 2015 at 5:20 PM, Su She suhsheka...@gmail.com wrote:
Hello
Hello Everyone,
Thanks in advance for the help!
I successfully got my Kafka/Spark WordCount app to print locally. However,
I want to run it on a cluster, which means that I will have to save it to
HDFS if I want to be able to read the output.
I am running Spark 1.1.0, which means according to
I am working with CDH5.2 (Spark 1.0.0) and wondering which version of Spark
comes with SparkSQL by default. Also, will SparkSQL come enabled to access
the Hive Metastore? Is there an easier way to enable Hive support without
have to build the code with various switches?
Thanks,
Abhi
--
Abhi
Disclaimer: this seems more of a CDH question, I'd suggest sending
these to the CDH mailing list in the future.
CDH 5.2 actually has Spark 1.1. It comes with SparkSQL built-in, but
it does not include the thrift server because of incompatibilities
with the CDH version of Hive. To use Hive
Hi Kevin,
Say A has 10 ids, so you are pulling data from B's data source only for
these 10 ids?
What if you load A and B as separate schemaRDDs and then do join. Spark
will optimize the path anyway when action is fired .
On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin yun...@ebay.com wrote:
Hi,
Could anyone come up with your experience on how to do this?
I have created a cluster and installed cdh5.3.0 on it with basically core +
Hbase. but cloudera installed and configured the spark in its parcels
anyway. I'd like to install our custom spark on this cluster to use the
hadoop and hbase
Disclaimer: CDH questions are better handled at cdh-us...@cloudera.org.
But the question I'd like to ask is: why do you need your own Spark
build? What's wrong with CDH's Spark that it doesn't work for you?
On Thu, Jan 8, 2015 at 3:01 PM, freedafeng freedaf...@yahoo.com wrote:
Could anyone come
Hi, Cheng
I checked the Input data for each stage. For example, in my attached
screen snapshot, the input data is 1212.5MB, which is the total amount of
the whole table
[image: Inline image 1]
And, I also check the input data for each task (in the stage detail
page). And the sum of
`--jars` accepts a comma-separated list of jars. See the usage about
`--jars`
--jars JARS Comma-separated list of local jars to include on the driver and
executor classpaths.
Best Regards,
Shixiong Zhu
2015-01-08 19:23 GMT+08:00 Guillermo Ortiz konstt2...@gmail.com:
I'm trying to execute
I'm trying to execute Spark from a Hadoop Cluster, I have created this
script to try it:
#!/bin/bash
export HADOOP_CONF_DIR=/etc/hadoop/conf
SPARK_CLASSPATH=
for lib in `ls /user/local/etc/lib/*.jar`
do
SPARK_CLASSPATH=$SPARK_CLASSPATH:$lib
done
Hi Experts,
I am running spark inside YARN job.
The spark-streaming job is running fine in CDH-5.0.0 but after the upgrade
to 5.3.0 it cannot fetch containers with the below errors. Looks like the
container id is incorrect and a string is present in a pace where it's
expecting a number.
thanks!
2015-01-08 12:59 GMT+01:00 Shixiong Zhu zsxw...@gmail.com:
`--jars` accepts a comma-separated list of jars. See the usage about
`--jars`
--jars JARS Comma-separated list of local jars to include on the driver and
executor classpaths.
Best Regards,
Shixiong Zhu
2015-01-08
When I try to execute my task with Spark it starts to copy the jars it
needs to HDFS and it finally fails, I don't know exactly why. I have
checked HDFS and it copies the files, so, it seems to work that part.
I changed the log level to debug but there's nothing else to help.
What else does Spark
Hi Vanzin,
I am using the MapR distribution of Hadoop. The history server logs are created
by a job with the permissions:
drwxrwx--- - myusername mygroup 2 2015-01-08 09:14
/apps/spark/historyserver/logs/spark-1420708455212
However, the permissions of the higher directories
Hello everyone.
With respect to the configuration problem that I explained before
Do you have any idea what is wrong there?
The problem in a nutshell:
- When more than one master is started in the cluster, all of them are
scheduling independently, thinking they are all leaders.
- zookeeper
Hi all,
We have a web application that connects to a Spark cluster to trigger some
calculation there. It also caches big amount of data in the Spark executors'
cache.
To meet high availability requirements we need to run 2 instances of our web
application on different hosts. Doing this
Hey Xuelin, which data item in the Web UI did you check?
On 1/7/15 5:37 PM, Xuelin Cao wrote:
Hi,
Curious and curious. I'm puzzled by the Spark SQL cached table.
Theoretically, the cached table should be columnar table, and only
scan the column that included in my SQL.
However, in my
Actually it does causes builds with SBT 0.13.7 to fail with the error
Conflicting cross-version suffixes. I have raised a defect SPARK-5143 for
this.
On Wed Jan 07 2015 at 23:44:21 Marcelo Vanzin van...@cloudera.com wrote:
This particular case shouldn't cause problems since both of those
Hi,
Im facing this error on spark ec2 cluster when a job is submitted its says
that native hadoop libraries are not found I have checked spark-env.sh and
all the folders in the path but unable to find the problem even though the
folder are containing. are there any performance drawbacks if we use
Hi,
I am using Eclipse writing Java code.
I am trying to create a Kafka receiver by:
JavaPairReceiverInputDStreamString, kafka.message.Message a =
KafkaUtils.createStream(jssc, String.class, Message.class,
StringDecoder.class, DefaultDecoder.class,
kafkaParams,
Weird, which version did you use? Just tried a small snippet in Spark
1.2.0 shell as follows, the result showed in the web UI meets the
expectation quite well:
|import org.apache.spark.sql.SQLContext
import sc._
val sqlContext = new SQLContext(sc)
import sqlContext._
Spark SQL supports Hive insertion statement (Hive 0.14.0 style insertion
is not supported though)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
The small SQL dialect provided in Spark SQL doesn't support insertion
This package is moved here: https://github.com/databricks/spark-avro
On 1/6/15 5:12 AM, yanenli2 wrote:
Hi All,
I want to use the SparkSQL to manipulate the data with Avro format. I
found a solution at https://github.com/marmbrus/sql-avro . However it
doesn't compile successfully anymore with
The |+| operator only handles numeric data types, you may register you
own concat function like this:
|sqlContext.registerFunction(concat, (s: String, t: String) = s + t)
sqlContext.sql(select concat('$', col1) from tbl)
|
Cheng
On 1/5/15 1:13 PM, RK wrote:
The issue is happening when I try
Hello,
I tried to save a table created via the hive context as a parquet file but
whatever compression codec (uncompressed, snappy, gzip or lzo) I set via
setConf like:
setConf(spark.sql.parquet.compression.codec, gzip)
the size of the generated files is the always the same, so it seems like
Hmm. Can you set the permissions of /apps/spark/historyserver/logs
to 3777? I'm not sure HDFS respects the group id bit, but it's worth a
try. (BTW that would only affect newly created log directories.)
On Thu, Jan 8, 2015 at 1:22 AM, michael.engl...@nomura.com wrote:
Hi Vanzin,
I am using
Since checkpointing in streaming apps happens every checkpoint duration, in
the event of failure, how is the system able to recover the state changes
that happened after the last checkpoint?
Nevermind my last e-mail. HDFS complains about not understanding 3777...
On Thu, Jan 8, 2015 at 9:46 AM, Marcelo Vanzin van...@cloudera.com wrote:
Hmm. Can you set the permissions of /apps/spark/historyserver/logs
to 3777? I'm not sure HDFS respects the group id bit, but it's worth a
try. (BTW
Hi, Cheng
In your code:
cacheTable(tbl)
sql(select * from tbl).collect() sql(select name from tbl).collect()
Running the first sql, the whole table is not cached yet. So the *input
data will be the original json file. *
After it is cached, the json format data is removed, so the
Ah, my bad... You're absolute right!
Just checked how this number is computed. It turned out that once an RDD
block is retrieved from the block manager, the size of the block is
added to the input bytes. Spark SQL's in-memory columnar format stores
all columns within a single partition into a
I am new to Apache Spark, now i am trying my first project Space Saving
Counting Algorithm and while it compiles in single core using
.setMaster(local) it fails when using .setMaster(local[4]) or any
number1.My Code
follows:=import
I was adding some bad jars I guess. I deleted all the jars and copied
them again and it works.
2015-01-08 14:15 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com:
When I try to execute my task with Spark it starts to copy the jars it
needs to HDFS and it finally fails, I don't know exactly why. I
Hi,
I have imported the Spark source code in Intellij Idea as a SBT project. I try
to do maven install in Intellij Idea by clicking Install in the Spark Project
Parent POM(root),but failed.
I would ask which profiles should be checked. What I want to achieve is staring
Spark in IDE and Hadoop
Popular topic in the last 48 hours! Just about 20 minutes ago I
collected some recent information on just this topic into a pull
request.
https://github.com/apache/spark/pull/3952
On Thu, Jan 8, 2015 at 2:24 PM, Todd bit1...@163.com wrote:
Hi,
I have imported the Spark source code in Intellij
Hi,
Can you please explain which settings did you changed?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-ConnectionManager-Corresponding-SendingConnection-to-ConnectionManagerId-tp17050p21035.html
Sent from the Apache Spark User List mailing list
Hello,
I tried to save a table created via the hive context as a parquet file but
whatever compression codec (uncompressed, snappy, gzip or lzo) I set via
setConf like:
setConf(spark.sql.parquet.compression.codec, gzip)
the size of the generated files is the always the same, so it seems like
I'm glad you solved this issue but have a followup question for you.
Wouldn't Akhil's solution be better for you after all? I run similar
computation where a large set of data gets reduced to a much smaller
aggregate in an interval. If you do saveAsText without coalescing, I
believe you'd get the
Very interesting approach. Thanks for sharing it!
On Thu, Jan 8, 2015 at 5:30 PM, Enno Shioji eshi...@gmail.com wrote:
FYI I found this approach by Ooyala.
/** Instrumentation for Spark based on accumulators.
*
* Usage:
* val instrumentation = new
Is there a way to join non-DStream RDDs with DStream RDDs?
Here is the use case. I have a lookup table stored in HDFS that I want to
read as an RDD. Then I want to join it with the RDDs that are coming in
through the DStream. How can I do this?
Thanks.
Asim
You are looking for dstream.transform(rdd = rdd.op(otherRdd))
The docs contain an example on how to use transform.
https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
-kr, Gerard.
On Thu, Jan 8, 2015 at 5:50 PM, Asim Jalis asimja...@gmail.com
FYI I found this approach by Ooyala.
/** Instrumentation for Spark based on accumulators.
*
* Usage:
* val instrumentation = new SparkInstrumentation(example.metrics)
* val numReqs = sc.accumulator(0L)
* instrumentation.source.registerDailyAccumulator(numReqs, numReqs)
*
Depending on your use cases. If the use case is to extract small amount of
data out of teradata, then you can use the JdbcRDD and soon a jdbc input
source based on the new Spark SQL external data source API.
On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote:
Hi,
I have a
Thanks a lot for your reply.
In fact, I need to work on almost all the data in teradata (~100T). So, I
don't think that jdbcRDD is a good choice.
Cheers
Gen
On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin r...@databricks.com wrote:
Depending on your use cases. If the use case is to extract small
thanks for the reply! Now I know that this package is moved here:
https://github.com/databricks/spark-avro
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-support-for-reading-Avro-files-tp20981p21040.html
Sent from the Apache Spark User List
Hi,
We have file in AWS S3 bucket, that is loaded frequently, When accessing
that file from spark, can we get file properties by some method in spark?
Regards
Raj
--
View this message in context:
Sorry for the noise; but I just remembered you're actually using MapR
(and not HDFS), so maybe the 3777 trick could work...
On Thu, Jan 8, 2015 at 10:32 AM, Marcelo Vanzin van...@cloudera.com wrote:
Nevermind my last e-mail. HDFS complains about not understanding 3777...
On Thu, Jan 8, 2015 at
Use local[*] instead of local to grab all available cores. Using local just
grabs one.
Dean
On Thursday, January 8, 2015, mixtou mix...@gmail.com wrote:
I am new to Apache Spark, now i am trying my first project Space Saving
Counting Algorithm and while it compiles in single core using
Do note that this problem may be fixed in Spark 1.2, as we changed the
default transfer service to use a Netty-based one rather than the
ConnectionManager.
On Thu, Jan 8, 2015 at 7:05 AM, Spidy yoni...@gmail.com wrote:
Hi,
Can you please explain which settings did you changed?
--
View
Can you please file a JIRA issue for this? This will make it easier to
triage this issue.
https://issues.apache.org/jira/browse/SPARK
Thanks,
Josh
On Thu, Jan 8, 2015 at 2:34 AM, frodo777 roberto.vaquer...@bitmonlab.com
wrote:
Hello everyone.
With respect to the configuration problem that
Hi Mukesh,
Those line numbers in ConverterUtils in the stack trace don't appear to
line up with CDH 5.3:
https://github.com/cloudera/hadoop-common/blob/cdh5-2.5.0_5.3.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java
Is it possible
The Julia code is computing the SVD of the Gram matrix. PCA should be
applied to the covariance matrix. -Xiangrui
On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara upulband...@gmail.com wrote:
Hi All,
I tried to do PCA for the Iris dataset
[https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib
Hi,
I've noticed running Spark apps on Mesos is significantly slower compared to
stand-alone or Spark on YARN.
I don't think it should be the case, so I am posting the problem here in
case someone has some explanation
or can point me to some configuration options i've missed.
I'm running the
Just to add to Sandy's comment, check your client configuration
(generally in /etc/spark/conf). If you're using CM, you may need to
run the Deploy Client Configuration command on the cluster to update
the configs to match the new version of CDH.
On Thu, Jan 8, 2015 at 11:38 AM, Sandy Ryza
Have you taken a look at the TeradataDBInputFormat? Spark is compatible
with arbitrary hadoop input formats - so this might work for you:
http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw
On Thu, Jan 8, 2015 at 10:53 AM, gen tang gen.tan...@gmail.com
How did you run this benchmark, and is there a open version I can try it
with?
And what is your configurations, like spark.locality.wait, etc?
Tim
On Thu, Jan 8, 2015 at 11:44 AM, mvle m...@us.ibm.com wrote:
Hi,
I've noticed running Spark apps on Mesos is significantly slower compared
to
In Spark Streaming, is there a way to initialize the state
of updateStateByKey before it starts processing RDDs? I noticed that there
is an overload of updateStateByKey that takes an initialRDD in the latest
sources (although not in the 1.2.0 release). Is there another way to do
this until this
87 matches
Mail list logo