If yarn has capacity to run both simultaneously it will. You should ensure you
are not allocating too many executors for the first app and leave some space
for the second)
You may want to run the application on different yarn queues to control
resource allocation. If you run as a different
Hi,
I have analytics data with timestamps on each element. I'd like to analyze
consecutive elements using Spark, but haven't figured out how to do this.
Essentially what I'd want is a transform from a sorted RDD [A, B, C, D, E]
to an RDD [(A,B), (B,C), (C,D), (D,E)]. (Or some other way to
Frans,
SparkR runs with R 3.1+. If possible, latest verison of R is recommended.
From: Saisai Shao [mailto:sai.sai.s...@gmail.com]
Sent: Thursday, October 22, 2015 11:17 AM
To: Frans Thamura
Cc: Ajay Chander; Doug Balog; user spark mailing list
Subject: Re: Spark_1.5.1_on_HortonWorks
SparkR is
You can try something like this to filter by topic:
val kafkaStringStream = KafkaUtils.createDirectStream[...]
//you might want to create Stream by fetching offsets from zk
kafkaStringStream.foreachRDD{rdd =>
val topics = rdd.map(_._1).distinct().collect()
if (topics.length > 0) {
Hi developers, I've encountered some problem with Spark, and before opening an
issue, I'd like to hear your thoughts.
Currently, if you want to submit a Spark job, you'll need to write the code,
make a jar, and then submit it with spark-submit or
org.apache.spark.launcher.SparkLauncher.
The Spark job-server project may help
(https://github.com/spark-jobserver/spark-jobserver).
--
Ali
On Oct 21, 2015, at 11:43 PM, ?? wrote:
> Hi developers, I've encountered some problem with Spark, and before opening
> an issue, I'd like to hear your thoughts.
>
I can see this artifact in public repos
http://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10/1.5.1
http://central.maven.org/maven2/org/apache/spark/spark-sql_2.10/1.5.1/spark-sql_2.10-1.5.1.jar
check your proxy settings or the list of repos you are using.
Deenar
On 22 October 2015
Hi,
I've trying to load a postgres table using the following expression:
val cachedIndex = cache.get("latest_legacy_group_index")
val mappingsDF = sqlContext.load("jdbc", Map(
"url" -> Config.dataSourceUrl(mode, Some("mappings")),
"dbtable" -> s"(select userid, yid, username from legacyusers
> On 22 Oct 2015, at 15:12, Ashish Shrowty wrote:
>
> I understand that there is some incompatibility with the API between Hadoop
> 2.6/2.7 and Amazon AWS SDK where they changed a signature of
>
Does the spark.deploy.zookeeper.url configuration work correctly when I
point it to a single virtual IP address with more hosts behind it (load
balancer or round robin)?
https://spark.apache.org/docs/latest/spark-standalone.html#high-availability
ZooKeeper FAQ also discusses this topic:
Yes, seems unnecessary. I actually tried patching the
com.databricks.spark.avro reader to only broadcast once per dataset,
instead of every single file/partition. It seems to work just as fine, and
there are significantly less broadcasts and not seeing out of memory issues
any more. Strange that
On 22 Oct 2015, at 02:47, Ajay Chander
> wrote:
Thanks for your time. I have followed your inputs and downloaded
"spark-1.5.1-bin-hadoop2.6" on one of the node say node1. And when I did a pie
test everything seems to be working fine, except that
Hi all,
I am trying to launch a Spark job using yarn-client mode on a cluster. I
have already tried spark-shell with yarn and I can launch the application.
But, I also would like to be able run the driver program from, say eclipse,
while using the cluster to run the tasks. I have also added
I don't think it is possible in the way you try to do it. It is important
to remember that the statements you mention only set up the stream stages,
before the stream is actually running. Once it's running, you cannot
change, remove or add stages.
I am not sure how you determine your condition
Found the problem. There was an rdd.distinct hiding in my code that I had
overlooked, and that caused this behavior (because instead of iterating over
the raw RDD, I was instead iterating over the RDD which had been derived
from it).
Thank you everyone!
- Chris
--
View this message in
Hi all,
Is there a way to run 2 spark applications in parallel under Yarn in the same
cluster?
Currently, if I submit 2 applications, one of them waits till the other one is
completed.
I want both of them to start and run at the same time.
Thanks,
Suman.
I need to take the value from a RDD and update the state of the other RDD.
Is this possible?
On 22 October 2015 at 16:06, Uthayan Suthakar
wrote:
> Hello guys,
>
> I have a stream job that will carryout computations and update the state
> (SUM the value). At some
Thanks Deenar for your response. I am able to get the version 1.5.0 and other
lower version, they all fine but just not the 1.5.1. It's hard to believe it's
proxy settings settings.
What is interesting is that the Intellij does a few things when downloading
this jar: putting into .m2
The result of updatestatebykey is a dstream that emits the entire state every
batch - as an RDD - nothing special about it.
It easy to join / cogroup with another RDD if you have the correct keys in both.
You could load this one when the job starts and/or have it update with
updatestatebykey as
That way, you will eventually end up bloating up that list. Instead, you
could push the stream to a noSQL database (like hbase or cassandra etc) and
then read it back and join it with your current stream if that's what you
are looking for.
Thanks
Best Regards
On Thu, Oct 15, 2015 at 6:11 PM,
Hi Sampo,
You could try zipWithIndex followed by a self join with shifted index
values like this:
val arr = Array((1, "A"), (8, "D"), (7, "C"), (3, "B"), (9, "E"))
val rdd = sc.parallelize(arr)
val sorted = rdd.sortByKey(true)
val zipped = sorted.zipWithIndex.map(x => (x._2, x._1))
val pairs =
Did you try passing the mysql connector jar through --driver-class-path
Thanks
Best Regards
On Sat, Oct 17, 2015 at 6:33 AM, Hurshal Patel wrote:
> Hi all,
>
> I've been struggling with a particularly puzzling issue after upgrading to
> Spark 1.5.1 from Spark 1.4.1.
>
>
Yes, using log4j you can log everything. Here's a thread with example
http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs
Thanks
Best Regards
On Sun, Oct 18, 2015 at 12:10 AM, kali.tumm...@gmail.com <
Hi,
Sorry, I'm not very familiar with those methods and cannot find the 'drop'
method anywhere.
As an example:
val arr = Array((1, "A"), (8, "D"), (7, "C"), (3, "B"), (9, "E"))
val rdd = sc.parallelize(arr)
val sorted = rdd.sortByKey(true)
// ... then what?
Thanks.
Best regards,
*Sampo
Have a look at
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
if you haven't seen that already.
Thanks
Best Regards
On Thu, Oct 15, 2015 at 10:56 PM, java8964 wrote:
> Hi, Sparkers:
>
> I wonder if I can convert a RDD
Can you try fixing spark.blockManager.port to specific port and see if the
issue exists?
Thanks
Best Regards
On Mon, Oct 19, 2015 at 6:21 PM, Eugen Cepoi wrote:
> Hi,
>
> I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN.
> The job is reading data from
Did you read
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Thanks
Best Regards
On Thu, Oct 15, 2015 at 11:31 PM, jeff.sadow...@gmail.com <
jeff.sadow...@gmail.com> wrote:
> I am having issues trying to setup spark to run jobs simultaneously.
>
> I
Hi All
I am trying to access a SQLServer that uses Kerberos for authentication
from Spark. I can successfully connect to the SQLServer from the driver
node, but any connections to SQLServer from executors fails with "Failed to
find any Kerberos tgt".
Convert your data to parquet, it saves space and time.
Thanks
Best Regards
On Mon, Oct 19, 2015 at 11:43 PM, ahaider3 wrote:
> Hi,
> A lot of the data I have in HDFS is compressed. I noticed when I load this
> data into spark and cache it, Spark unrolls the data like
check spark.rdd.compress
On 19 October 2015 at 21:13, ahaider3 wrote:
> Hi,
> A lot of the data I have in HDFS is compressed. I noticed when I load this
> data into spark and cache it, Spark unrolls the data like normal but stores
> the data uncompressed in memory. For
I don't think the one that comes with spark would listen to specific user
feeds, but yes you can filter out the public tweets by passing the filters
argument. Here's an example for you to start
Maven, in general, does some local caching to avoid htting the repo
every time. It's possible this is why you're not seeing 1.5.1. On the
command line you can for example add "mvn -U ..." Not sure of the
equivalent in IntelliJ, but it will be updating the same repo IJ sees.
Try that. The repo
Hi Sandip,
Thanks for your response..
I am not sure if this is the same thing.
I am looking for a way to connect to external network as shown in the
example.
@All - Can anyone else let me know if they have a better solution?
Thanks
Nipun
On Wed, Oct 21, 2015 at 2:07 PM, Sandip Mehta
Hi,
I am getting following exception when submitting a job to Spark 1.5.x from
Scala. The same code works with Spark 1.4.1. Any clues as to what might
causing the exception.
*Code:App.scala*import org.apache.spark.SparkContext
object App {
def main(args: Array[String]) = {
val l =
Hi Sebastian,
You can save models to disk and load them back up. In the snippet below
(copied out of a working Databricks notebook), I train a model, then save
it to disk, then retrieve it back into model2 from disk.
import org.apache.spark.mllib.tree.RandomForest
> import
Hi, I’d like to know if there is a guarantee that Spark YARN shuffle service
has wire compatibility between 1.x versions.
I could run Spark 1.5 job with YARN nodemanagers having shuffle service 1.4,
but it might’ve been just a coincidence.
Now we’re upgrading CDH to 5.3 to 5.4, whose
Hi,
In general in spark stream one can do transformations ( filter, map etc.)
or output operations (collect, forEach) etc. in an event-driven pardigm...
i.e. the action happens only if a message is received.
Is it possible to do actions every few seconds in a polling based fashion,
regardless if
RemoteActorRefProvider is in akka-remote_2.10-2.3.11.jar
jar tvf
~/.m2/repository/com/typesafe/akka/akka-remote_2.10/2.3.11/akka-remote_2.10-2.3.11.jar
| grep RemoteActorRefProvi
1761 Fri May 08 16:13:02 PDT 2015
akka/remote/RemoteActorRefProvider$$anonfun$5.class
1416 Fri May 08 16:13:02 PDT
RemoteActorRefProvider is in akka-remote_2.10-2.3.11.jar
jar tvf
~/.m2/repository/com/typesafe/akka/akka-remote_2.10/2.3.11/akka-remote_2.10-2.3.11.jar
| grep RemoteActorRefProvi
1761 Fri May 08 16:13:02 PDT 2015
akka/remote/RemoteActorRefProvider$$anonfun$5.class
1416 Fri May 08 16:13:02 PDT
Hi,
I have a Spark driver process that I have built into a single ‘fat jar’ this
runs fine, in Cygwin, on my development machine,
I can run:
scala -cp my-fat-jar-1.0.0.jar com.foo.MyMainClass
this works fine, it will submit Spark job, they process, all good.
However, on Linux (all Jars
Thanks Sean. I did that but didn¹t seem to help. However, I manually
downloaded both the pom and jar files from the site, and then run through
mvn dependency:purge-local-repository to clean up the local repo (+ re
download them all). All are good and then the error went away.
Thanks a lot for
Hi,
We have several udf's written in Scala that we use within jobs submitted into
Spark. They work perfectly with the sqlContext after being registered. We also
allow access to saved tables via the Hive Thrift server bundled with Spark.
However, we would like to allow Hive connections to use
Hi,
So I am working on a usecase, where Clients are walking in and out of
geofences and sendingmessages based on that.
I currently have some in Memory Broadcast vars to do certain lookups for
client and geofence info, some of this is also coming from Cassandra.
My current quandry is that I need
Another thing to check is to make sure each one of you executor nodes has the
JCE jars installed.
try{ javax.crypto.Cipher.getMaxAllowedKeyLength("AES") > 128 } catch { case
e:java.security.NoSuchAlgorithmException => false }
Setting "-Dsun.security.krb5.debug=true” and
That sounds like a networking issue to me. Stuff to try
- make sure every executor node can talk to every kafka broker on relevant
ports
- look at firewalls / network config. Even if you can make the initial
connection, something may be happening after a while (we've seen ...
"interesting"...
I believe spark.rdd.compress requires the data to be serialized. In my
case, I have data already compressed which becomes decompressed as I try to
cache it. I believe even when I set spark.rdd.compress to *true, *Spark
will still decompress the data and then serialize it and then compress the
Can't say what is happening, and I have a similar problem here.
While for you the source is:
org.apache.spark.streaming.dstream.WindowedDStream@532d0784 has not been
initialized
For me is:
org.apache.spark.SparkException:
org.apache.spark.streaming.dstream.MapPartitionedDStream@7a2d07cc has
A few months ago, I used the DB2 jdbc drivers. I hit a couple of issues
when using --driver-class-path. At the end, I used the following command to
bypass most of issues:
./bin/spark-submit --jars
/Users/smile/db2driver/db2jcc.jar,/Users/smile/db2driver/db2jcc_license_cisuz.jar
--master local[*]
I understand that there is some incompatibility with the API between Hadoop
2.6/2.7 and Amazon AWS SDK where they changed a signature of
com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold.
The JIRA indicates that this would be fixed in Hadoop 2.8.
I have Avro records stored in Parquet files in HDFS. I want to read these
out as an RDD and save that RDD in Tachyon for any spark job that wants the
data.
How do I save the RDD in Tachyon? What format do I use? Which RDD
'saveAs...' method do I want?
Thanks
Can ypu increase number of partitions and try... Also, i dont think you
need to cache dfs before saving them... U can do away with that as well...
Raghav
On Oct 23, 2015 7:45 AM, "Ram VISWANADHA"
wrote:
> Hi ,
> I am trying to load 931MB file into an RDD, then
I'm managing Spark Streaming applications which run on Cloud Dataproc
(https://cloud.google.com/dataproc/). Spark Streaming applications running
on a Cloud Dataproc cluster seem to run in client mode on YARN.
Some of my applications sometimes stop due to the application failure.
I'd like YARN to
Looks like currently there's no way for Spark Streaming to restart
automatically in yarn-client mode, because in yarn-client mode, AM and
driver are two processes, Yarn only control the restart of AM, not driver,
so it is not supported in yarn-client mode.
You can write some scripts to monitor
Hi ,
I am trying to load 931MB file into an RDD, then create a DataFrame and store
the data in a Parquet file. The save method of Parquet file is hanging. I have
set the timeout to 1800 but still the system fails to respond and hangs. I
can’t spot any errors in my code. Can someone help me?
Hi Deenar,
The version of Spark you have may not be compiled with YARN support. If
you inspect the contents of the assembly jar, does
org.apache.spark.deploy.yarn.ExecutorLauncher exist? If not, you'll need
to find a version that does have the YARN classes. You can also build your
own using
Hi,
I had written spark streaming application using kafka stream and its
writing to hdfs for every hour(batch time). I would like to know how to get
offset or commit offset of kafka stream while writing to hdfs so that if
there is any issue or redeployment, i'll start from the point where i did a
I was able to solve this by myself. What I did is changing the way spark
computes the partitioning for binary files.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140p25170.html
Sent from the Apache
1.Whether Spark will use disk when the memory is not enough on MEMORY_ONLY
Storage Level?
2.If not, How can i set Storage Level when i use Hive on Spark?
3.Do Spark have any intention of dynamically determined Hive on MapReduce
or Hive on Spark, base on SQL features.
Thanks in advance
Best
Did you try coalesce? It doesn't shuffle the data around.
Thanks
Best Regards
On Wed, Oct 21, 2015 at 10:27 AM, shahid wrote:
> Hi
>
> I have a large partition(data skewed) i need to split it to no. of
> partitions, repartitioning causes lot of shuffle. Can we do that..?
>
>
Drop is a method on scala’s collections (array, list, etc) - not on Spark’s
RDDs. You can look at it as long as you use mapPartitions or something like
reduceByKey, but it totally depends on the use-cases you have for analytics.
The others have suggested better solutions using only spark’s
Actually, I found a design issue in self joins. When we have multiple-layer
projections above alias, the information of alias relation between alias
and actual columns are lost. Thus, when resolving the alias in self joins,
the rules treat the alias (e.g., in Projection) as normal columns. This
Hey,
I try to figure out the best practice on saving and loading models which have
bin fitted with the ML package - i.e. with the RandomForest classifier.
There is PMML support in the MLib package afaik but not in ML - is that correct?
How do you approach this, so that you do not have to fit
Hi,
Firstly want to say a big thanks to Cody for contributing the kafka
direct stream.
I have been using the receiver based approach for months but the
direct stream is a much better solution for my use case.
The job in question is now ported over to the direct stream doing
idempotent outputs
Hi,
Firstly want to say a big thanks to Cody for contributing the kafka
direct stream.
I have been using the receiver based approach for months but the
direct stream is a much better solution for my use case.
The job in question is now ported over to the direct stream doing
idempotent outputs
Hi Sampo,
There is a sliding method you could try inside the
org.apache.spark.mllib.rdd.RDDFunctions package, though it’s DeveloperApi stuff
(https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions)
import org.apache.spark.{SparkConf, SparkContext}
Hello everyone,
I am doing some analytics experiments under a 4 server stand-alone cluster in a
spark shell, mostly involving a huge database with groupBy and aggregations.
I am picking 6 groupBy columns and returning various aggregated results in a
dataframe. GroupBy fields are of two types,
Ankit shared an issue with you
---
> Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4
> ---
>
> Key: SPARK-11213
> URL:
Hi, Saif,
Could you post your code here? It might help others reproduce the errors
and give you a correct answer.
Thanks,
Xiao Li
2015-10-22 8:27 GMT-07:00 :
> Hello everyone,
>
> I am doing some analytics experiments under a 4 server stand-alone cluster
> in a
Hi,
Excellent, the sliding method seems to be just what I'm looking for. Hope
it becomes part of the stable API, I'd assume there to be lots of uses with
time-related data.
Dylan's suggestion seems reasonable as well, if DeveloperApi is not an
option.
Thanks!
Best regards,
*Sampo
Hi Ankit,
Here is my solution for this:-
1) Download the latest Spark 1.5.1(Just copied the following link from
spark.apache.org, if it doesn't work then gran a new one from the website.)
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz
2) Unzip the folder and rename/move
Hello guys,
I have a stream job that will carryout computations and update the state
(SUM the value). At some point, I would like to reset the state. I could
drop the state by setting 'None' but I don't want to drop it. I would like
to keep the state but update the state.
For example:
many use it.
how do you add aws sdk to classpath?
check in environment ui what is in cp.
you should make sure that in your cp the version is compatible with one
that spark compiled with
I think 1.7.4 is compatible(at least we use it)
make sure that you don't get other versions from other
Huh indeed this worked, thanks. Do you know why this happens, is that some
known issue?
Thanks,
Eugen
2015-10-22 19:08 GMT+07:00 Akhil Das :
> Can you try fixing spark.blockManager.port to specific port and see if the
> issue exists?
>
> Thanks
> Best Regards
>
> On
Hi
I am running 10 node standalone cluster on aws
and loading 100G data on HDFS.. doing first groupby operation.
and then generating pairs from the groupedrdd (key,[a1,b1],key,[a,b,c])
generating the pairs like
(a1,b1),(a,b),(a,c) ... n
PairRDD will get large in size.
some stats from ui when
Thanks, sorry I cannot share the data and not sure how much significant it will
be for you.
I am reproducing the issue on a smaller piece of the content and see wether I
find a reason on the inconsistence.
val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band
", "aget",
Hi Spark users and developers,
I read the ticket [SPARK-8578] (Should ignore user defined output committer
when appending data) which ignore DirectParquetOutputCommitter if append
mode is selected. The logic was that it is unsafe to use because it is not
possible to revert a failed job in append
Hi I have got the prebuilt version of Spark 1.5 for Hadoop 2.6 (
http://www.apache.org/dyn/closer.lua/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz)
working with CDH 5.4.0 in local mode on a cluster with Kerberos. It works
well including connecting to the Hive metastore. I am facing an issue
Solved!
The problem has nothing to do about class and object refactory. But in the
process of this refactory I made a change that is similar of your code.
Before this refactory, I processed the DStream inside the function that I
sent to StreamingContext.getOrCreate. After, I started processing
nevermind my last email. res2 is filtered so my test does not make sense. The
issue is not reproduced there. I have the problem somwhere else.
From: Ellafi, Saif A.
Sent: Thursday, October 22, 2015 12:57 PM
To: 'Xiao Li'
Cc: user
Subject: RE: Spark groupby and agg inconsistent and missing data
Hi - I tried to download the Spark SQL 2.10 and version 1.5.1 from Intellij
using the maven library:
-Project Structure
-Global Library, click on the + to select Maven Repository
-Type in org.apache.spark to see the list.
-The list result only shows version up to spark-sql_2.10-1.1.1
-I tried
On Thu, Oct 22, 2015 at 5:40 AM, Akhil Das
wrote:
> Did you read
> https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
>
I did.
I had set the option
spark.scheduler.mode FAIR
in conf/spark-defaults.conf
and
created
i am seeing the same thing. its gona completely crazy creating broadcasts
for the last 15 mins or so. killing it...
On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg wrote:
> Hi,
>
> Running spark 1.5.0 in yarn-client mode, and am curios in why there are so
> many broadcast
82 matches
Mail list logo