Hi all any help will be much appreciated my spark job runs fine but in the
middle it starts loosing executors because of netafetchfailed exception
saying shuffle not found at the location since executor is lost
On Jul 31, 2015 11:41 PM, Umesh Kacha umesh.ka...@gmail.com wrote:
Hi thanks for the
I have user logs that I have taken from a csv and converted into a DataFrame
in order to leverage the SparkSQL querying features. A single user will
create numerous entries per hour, and I would like to gather some basic
statistical information for each user; really just the count of the user
Hi,
Recently, I met some problems about scheduler delay in pyspark. I worked
several days on this problem, but not success. Therefore, I come to here to
ask for help.
I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to
merge value by adding two list
if I do reduceByKey as
Hi, there was this issue for Scala 2.11.
https://issues.apache.org/jira/browse/SPARK-7944
It should be fixed on master branch. You may be hitting that.
Best,
Burak
On Sun, Aug 2, 2015 at 9:06 PM, Ted Yu yuzhih...@gmail.com wrote:
I tried the following command on master branch:
bin/spark-shell
In addition, you do not need to use --jars with --packages. --packages will
get the jar for you.
Best,
Burak
On Mon, Aug 3, 2015 at 9:01 AM, Burak Yavuz brk...@gmail.com wrote:
Hi, there was this issue for Scala 2.11.
https://issues.apache.org/jira/browse/SPARK-7944
It should be fixed on
Did anybody try to convert HiveQL queries to SparkSQL? If so, would you
share the experience, pros cons please? Thank you.
On Thu, Jul 30, 2015 at 10:37 AM, Bigdata techguy bigdatatech...@gmail.com
wrote:
Thanks Jorn for the response and for the pointer questions to Hive
optimization tips.
Can you show related code in DriverAccumulator.java ?
Which Spark release do you use ?
Cheers
On Mon, Aug 3, 2015 at 3:13 PM, Anubhav Agarwal anubha...@gmail.com wrote:
Hi,
I am trying to modify my code to use HDFS and multiple nodes. The code
works fine when I run it locally in a single
Once you submit a pull request for some JIRA, the JIRA would be assigned to
you.
Cheers
On Mon, Aug 3, 2015 at 3:50 PM, Namit Katariya katariya.na...@gmail.com
wrote:
My username on the Apache JIRA is katariya.namit. Could one of the admins
please add me to the contributors group so that I
I think I just answered my own question. The privitization of the RDD API
might have resulted in my error, because this worked:
randomMatBr - SparkR:::broadcast(sc, randomMat)
On Mon, Aug 3, 2015 at 4:59 PM, Deborah Siegel deborah.sie...@gmail.com
wrote:
Hello,
In looking at the SparkR
I think this question applies regardless if I have two completely separate
Spark jobs or tasks on different machines, or two cores that are part of
the same task on the same machine.
If two jobs/tasks/cores/stages both save to the same parquet directory in
parallel like this:
We are using a local hive context in order to run unit tests. Our unit
tests runs perfectly fine if we run why by one using sbt as the next
example:
sbt test-only com.company.pipeline.scalers.ScalerSuite.scala
sbt test-only com.company.pipeline.labels.ActiveUsersLabelsSuite.scala
However, if we
Hi Namit,
There's no need to assign a bug to yourself to say you're working on it.
The recommended way is to just post a PR on github - the bot will update
the bug saying that you have a patch open to fix the issue.
On Mon, Aug 3, 2015 at 3:50 PM, Namit Katariya katariya.na...@gmail.com
wrote:
Hello,
In looking at the SparkR codebase, it seems as if broadcast variables ought
to be working based on the tests.
I have tried the following in sparkR shell, and similar code in RStudio,
but in both cases got the same message
randomMat - matrix(nrow=10, ncol=10, data=rnorm(100))
Hi,
I am trying to modify my code to use HDFS and multiple nodes. The code
works fine when I run it locally in a single machine with a single worker.
I have been trying to modify it and I get the following error. Any hint
would be helpful.
java.lang.NullPointerException
at
Hello,
I'm planning to use DF1.except(DF2) to get difference between two
dataframes. I'd like to know how exactly this API works.
Both explain() and spark UI show except as an operation on its own.
Internally, does does it do a hash partition of both dataframes?
If so will it do auto broadcast if
TestHive takes care of creating a temporary directory for each invocation
so that multiple test runs won't conflict.
On Mon, Aug 3, 2015 at 3:09 PM, Cesar Flores ces...@gmail.com wrote:
We are using a local hive context in order to run unit tests. Our unit
tests runs perfectly fine if we run
Hi,
Can I use multiple UpdateStateByKey Functions in the Streaming job? Suppose
I need to maintain the state of the user session in the form of a Json and
counts of various other metrics which has different keys ? Can I use
multiple updateStateByKey functions to maintain the state for different
Hi,
I recently installed Cloudera CDH 5.4.4. Sparks comes shipped with this
version. I created Spark gateways. But I get the following error when run
Spark shell from the gateway. Does anyone have any similar experience ? If
so, please share the solution. Google shows to copy the Conf files from
Is your data skewed? What happens if you do rdd.count()?
On 4 Aug 2015 05:49, Jasleen Kaur jasleenkaur1...@gmail.com wrote:
I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
an option due to permission issues).
- num-executors 800
- spark.akka.frameSize=1024
Hello,
I am running Spark 1.4.0 on Mesos 0.22.1, and usually I run my jobs in
coarse-grained mode.
I have written some single-threaded standalone Scala applications for a
problem
that I am working on, and I am unable to get a Spark solution that comes
close
to the performance of this
The code was written in 1.4 but I am compiling it and running it with 1.3.
import it.unimi.dsi.fastutil.objects.Object2ObjectOpenHashMap;
import org.apache.spark.AccumulableParam;
import scala.Tuple4;
import thomsonreuters.trailblazer.operation.DriverCalc;
import
That should not be a fatal error, it's just a noisy exception.
Anyway, it should go away if you add YARN gateways to those nodes (aside
from Spark gateways).
On Mon, Aug 3, 2015 at 7:10 PM, Upen N ukn...@gmail.com wrote:
Hi,
I recently installed Cloudera CDH 5.4.4. Sparks comes shipped with
Hi Upen,
Did you deploy the client configs after assigning the gateway roles? You should
be able to do this from Cloudera Manager.
Can you try this and let us know what you see when you run spark-shell?
Guru Medasani
gdm...@gmail.com
On Aug 3, 2015, at 9:10 PM, Upen N ukn...@gmail.com
Putting your code in a file I find the following on line 17:
stepAcc = new StepAccumulator();
However I don't think that was where the NPE was thrown.
Another thing I don't understand was that there were two addAccumulator()
calls at the top of stack trace while in your code I
My username on the Apache JIRA is katariya.namit. Could one of the admins
please add me to the contributors group so that I can have a starter task
assigned to myself?
Thanks,
Namit
Hi Guru,
I am executing this on DataStax Enterprise Spark node and ~/.dserc file
exists which consists Cassandra credentials but still getting the error
Below is the given command
dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld
--jars
Hi Satish,
Can you add more error or log info to the email?
Guru Medasani
gdm...@gmail.com
On Jul 31, 2015, at 1:06 AM, satish chandra j jsatishchan...@gmail.com
wrote:
HI,
I have submitted a Spark Job with options jars,class,master as local but i am
getting an error as below
dse
Here is the solution this looks perfect for me.
thanks for all your help
http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
On 28 July 2015 at 23:27, Jörn Franke jornfra...@gmail.com wrote:
Can you put some transparent cache in front of the database? Or
Thanks Satish. I only see the INFO messages and don’t see any error messages in
the output you pasted.
Can you paste the log with the error messages?
Guru Medasani
gdm...@gmail.com
On Aug 3, 2015, at 11:12 PM, satish chandra j jsatishchan...@gmail.com
wrote:
Hi Guru,
I am executing
Hi All,
I am running the WikiPedia parsing example present in the Advance
Analytics with Spark book.
https://github.com/sryza/aas/blob/d3f62ef3ed43a59140f4ae8afbe2ef81fc643ef2/ch06-lsa/src/main/scala/com/cloudera/datascience/lsa/ParseWikipedia.scala#l112
The partitions of the RDD returned by
Your table is in which database - default or result. By default spark will
try to look for table in default database.
If the table exists in the result database try to prefix the table name
with database name like select * from result.salarytest or set the
database by executing use database name
1.In spark 1.3(Non receiver) - If my batch interval is 1 sec and I don't
set spark.streaming.kafka.maxRatePerPartition - so default behavious is to
bring all messages from kafka from last offset to current offset ?
Say no of messages were large and it took 5 sec to process those so will
all jobs
Is there any setting to allow --files to copy jar from driver to executor
nodes.
When I am passing some jar files using --files to executors and adding them
in class path of executor it throws exception of File not found
15/08/03 07:59:50 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8,
Are you sitting behind a firewall and accessing a remote master machine? In
that case, have a look at this
http://spark.apache.org/docs/latest/configuration.html#networking, you
might want to fix few properties like spark.driver.host, spark.driver.host
etc.
Thanks
Best Regards
On Mon, Aug 3,
Hi,
Its an application that maintains some state from the DStream using
updateStateByKey() operation. It then selects some of the records from
current batch using some criteria over current values and the state and
carries over the remaining values to next batch.
Following is the pseudo code :
Sea, it exists, trust me. We have spark in production under Yarn.
if you want more control use Yarn if you can. At least it kills the
executor if it hogs memory..
I am explicitly setting
spark.yarn.executor.memoryOverhead to the same size as heap for one of our
processes
For example:
Hello *,
We are trying to build some Batch jobs using Spark on Mesos. Mesos offer's
two main mode of deployment of Spark job.
1. Fine-grained
2. Coarse-grained
When we are running the spark jobs in fine grained mode then spark is using
max amount of offers from Mesos and running the job.
Your master log files will be on the spark home folder/logs at the master
machine. Do they show an error ?
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
Check out Reifier at Spark Summit 2015
Reading from the input stream and the error stream (in separate threads) indeed
unblocked the launcher and it exited properly. Thanks for your responses!
Best regards,
Tomasz
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, July 31, 2015 19:20
To: Elkhan Dadashov
Cc: Tomasz Guziałek;
Can you tell us more about streaming app? DStream operation that you are
using?
On Sun, Aug 2, 2015 at 9:14 PM, Anand Nalya anand.na...@gmail.com wrote:
Hi,
I'm writing a Streaming application in Spark 1.3. After running for some
time, I'm getting following execption. I'm sure, that no other
In Yarn client mode, Spark driver URL will be redirected to Yarn web proxy
server, but I don't want to use this dynamic name, is it possible to still
use host:port as standalone mode?
All examples of Spark Stream programming that I see assume streams of lines
that are then tokenised and acted upon (like the WordCount example).
How do I process Streams that span multiple lines? Are there examples that I
can use?
Hi,
I am having a problem serializing a custom partitioner that I have written
that extends Externalizable. The partitioner wraps a java TreeSet which
stores table splits. There are thousands of splits.
I noticed earlier that my spark job was taking over 30 seconds just to
transmit a task to
Looks like related work is in progress. e.g.
SPARK-5158
Cheers
On Mon, Aug 3, 2015 at 10:05 AM, MrJew kouz...@gmail.com wrote:
Hello,
Similar to other cluster systems e.g Zookeeper, Hazelcast. Spark has the
problem that is protected from the outside world however anyone having
access to
Are you looking for RDD.wholeTextFiles?
On 3 August 2015 at 10:57, Spark Enthusiast sparkenthusi...@yahoo.in
wrote:
All examples of Spark Stream programming that I see assume streams of
lines that are then tokenised and acted upon (like the WordCount example).
How do I process Streams that
Sorry.
SparkContext.wholeTextFiles
Not sure about streams.
On 3 August 2015 at 14:50, Michal Čizmazia mici...@gmail.com wrote:
Are you looking for RDD.wholeTextFiles?
On 3 August 2015 at 10:57, Spark Enthusiast sparkenthusi...@yahoo.in
wrote:
All examples of Spark Stream programming
Hello,
Similar to other cluster systems e.g Zookeeper, Hazelcast. Spark has the
problem that is protected from the outside world however anyone having
access to the host can run a spark node without the need for authentication.
Currently we are using Spark 1.3.1. Is there a way to enable
the reason that redirect is there is for security reasons; in a kerberos
enabled cluster the RM proxy does the authentication, then forwards the
requests to the running application. There's no obvious way to disable it in
the spark application master, and I wouldn't recommend doing this anyway,
When I tried to compile against hbase 1.1.1, I got:
[ERROR]
/home/hbase/ssoh/src/main/scala/org/apache/spark/sql/hbase/SparkSqlRegionObserver.scala:124:
overloaded method next needs result type
[ERROR] override def next(result: java.util.List[Cell], limit: Int) =
next(result)
Is there plan to
Does RDD.cartesian involve shuffling?
Thanks!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
@Silvio: the mapPartitions instantiates a HttpSolrServer, then for each
query string in the partition, sends the query to Solr using SolrJ, and
gets back the top N results. It then reformats the result data into one
long string and returns the key value pair as (query string, result string).
On 3 Aug 2015, at 10:05, MrJew kouz...@gmail.com wrote:
Hello,
Similar to other cluster systems e.g Zookeeper,
Actually, Zookeeper supports SASL authentication of your Kerberos tokens.
https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zookeeper+and+SASL
Hazelcast. Spark has the
Hello!
I am developing a Spark program that uses both batch and streaming
(separately). They are both pretty much the exact same programs, except the
inputs come from different sources. Unfortunately, RDD's and DStream's
define all of their transformations in their own files, and so I have two
hi,
I've run into some poor RF behavior, although not as pronounced as you..
would be great to get more insight into this one
Thanks!
On Mon, Aug 3, 2015 at 8:21 AM pkphlam pkph...@gmail.com wrote:
Hi,
This might be a long shot, but has anybody run into very poor predictive
performance
in general, what is your configuration? use --conf spark.logConf=true
we have 1.4.1 in production standalone cluster and haven't experienced what
you are describing
can you verify in web-ui that indeed spark got your 50g per executor limit?
I mean in configuration page..
might be you are using
Hi Everyone,
I am using Apache Spark for 2 weeks and as of now I am querying hive
tables using spark java api. And it is working fine in Hadoop single mode
but when I tried the same code in Hadoop multi cluster it throws
org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't
Hi Everyone,
I am using Apache Spark for 2 weeks and as of now I am querying hive
tables using spark java api. And it is working fine in Hadoop single mode
but when I tried the same code in Hadoop multi cluster it throws
org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't
Just to be clear, did you rebuild your job against spark 1.4.1 as well as
upgrading the cluster?
On Mon, Aug 3, 2015 at 8:36 AM, Netwaver wanglong_...@163.com wrote:
Hi All,
I have a spark streaming + kafka program written by Scala, it
works well on Spark 1.3.1, but after I migrate
Hi All,
I have a spark streaming + kafka program written by Scala, it works
well on Spark 1.3.1, but after I migrate my Spark cluster to 1.4.1 and rerun
this program, I meet below exception:
ERROR scheduler.ReceiverTracker: Deregistered receiver for stream
0: Error starting
Hi Sujit,
From experimenting with Spark (and other documentation), my understanding
is as follows:
1. Each application consists of one or more Jobs
2. Each Job has one or more Stages
3. Each Stage creates one or more Tasks (normally, one Task per
Partition)
4. Master
This sounds like a bug. What version of spark? and can you provide the
stack trace?
On Sun, Aug 2, 2015 at 11:27 AM, fuellee lee lifuyu198...@gmail.com wrote:
I'm trying to process a bunch of large json log files with spark, but it
fails every time with `scala.MatchError`, Whether I give it
DStreams transform function helps me solve this issue elegantly. Thanks!
On Mon, Aug 3, 2015 at 1:42 PM, Sidd S ssinga...@gmail.com wrote:
Hello!
I am developing a Spark program that uses both batch and streaming
(separately). They are both pretty much the exact same programs, except the
In general it needs to be a Seq of Tuples for the implicit toDF to work
(which is a little tricky when there is only one column).
scala Seq(Tuple1(new
java.sql.Timestamp(System.currentTimeMillis))).toDF(a)
res3: org.apache.spark.sql.DataFrame = [a: timestamp]
or with multiple columns
scala
I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
an option due to permission issues).
- num-executors 800
- spark.akka.frameSize=1024
- spark.default.parallelism=25600
- driver-memory=4G
- executor-memory=32G.
- My input size is around 1.5TB.
My problem
hi sujit
Can you spin it with 4 (server)*4 (cores) 16 cores i.e there should be 16
cores in your cluster, try to use same no. of partitions. Also look at the
http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-td23824.html
On Tue, Aug 4, 2015 at 1:46 AM, Ajay
I wanted to confirm whether this is now supported, such as in Spark v1.3.0
I've read varying info online just thought I'd verify.
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p24117.html
Sent from the Apache Spark
66 matches
Mail list logo