That sounds good to me. Shall I open a JIRA / PR about updating the site
community page?
On 2015년 1월 23일 (금) at 오전 4:37 Patrick Wendell patr...@databricks.com
wrote:
Hey Nick,
So I think we what can do is encourage people to participate on the
stack overflow topic, and this I think we can do
Do i need to manually install and configure hadoop before doing anything with
spark standalone?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-Standalone-to-a-Cluster-tp21339.html
Sent from the Apache Spark User List mailing list archive
if you're running the test via sbt you can examine the classpath that sbt
uses for the test (show runtime:full-classpath or last run)-- I find this
helps once too many includes and excludes interact.
On Thu, Jan 22, 2015 at 3:50 PM, Adrian Mocanu amoc...@verticalscope.com
wrote:
I use spark
+1
On Fri, Jan 23, 2015 at 5:58 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
That sounds good to me. Shall I open a JIRA / PR about updating the site
community page?
On 2015년 1월 23일 (금) at 오전 4:37 Patrick Wendell patr...@databricks.com
wrote:
Hey Nick,
So I think we what can
I need someone to send me a snapshot of his /conf/spark-env.sh file cause i
don't understand how to set some vars like SPARK_MASTER etc
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-Standalone-to-a-Cluster-tp21341.html
Sent from the
Hey all,
I am running into a problem where YARN kills containers for being over
their memory allocation (which is about 8G for executors plus 6G for
overhead), and I noticed that in those containers there are tons of
pyspark.daemon processes hogging memory. Here's a snippet from a container
with
It looks to me like your executor actually crashed and didn't just finish
properly.
Can you check the executor log?
It is available in the UI, or on the worker machine, under $SPARK_HOME/work/
app-20150123155114-/6/stderr (unless you manually changed the work
directory location but in that
not needed. You can directly follow the installation
However you might need sbt to package your files to jar.
On Fri, Jan 23, 2015 at 11:54 AM, riginos samarasrigi...@gmail.com wrote:
Do i need to manually install and configure hadoop before doing anything
with
spark standalone?
--
View
Hello mingyu,
That is a reasonable way of doing this. Spark Streaming natively does
not support sticky because Spark launches tasks based on data
locality. If there is no locality (example reduce tasks can run
anywhere), location is randomly assigned. So the cogroup or join
introduces a locality
Thanks. I looked at the dependency tree. I did not see any dependent jar of
hadoop-core from hadoop2. However the jar built from maven has the class:
org/apache/hadoop/mapreduce/task/TaskAttemptContextImpl.class
Do you know why?
Date: Fri, 23 Jan 2015 17:01:48 +
Subject: RE: spark
Using Spark 1.2
Read a CSV file, apply schema to convert to SchemaRDD and then
schemaRdd.saveAsParquetFile
If the schema includes Timestamptype, it gives following trace when doing
the save
Exception in thread main java.lang.RuntimeException: Unsupported datatype
TimestampType
at
Hello,
I'm trying to kick off a spark streaming job to a stand alone master using
spark submit inside of init.d. This is what I have:
DAEMON=spark-submit --class Streamer --executor-memory 500M
--total-executor-cores 4 /path/to/assembly.jar
start() {
$DAEMON -p
I am running into some problems with Spark Streaming when reading from
Kafka.I used Spark 1.2.0 built on CDH5.
The example is based on:
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala
* It works with default
I've tried various ideas, but I'm really just shooting in the dark.
I have an 8 node cluster of r3.8xlarge machines. The RDD (with 1024 partitions)
I'm trying to save off to S3 is approximately 1TB in size (with the partitions
pretty evenly distributed in size).
I just tried a test to dial
https://issues.apache.org/jira/browse/SPARK-5390
On Fri Jan 23 2015 at 12:05:00 PM Gerard Maas gerard.m...@gmail.com wrote:
+1
On Fri, Jan 23, 2015 at 5:58 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
That sounds good to me. Shall I open a JIRA / PR about updating the site
Hey Darin,
Are you running this over EMR or as a standalone cluster? I've had
occasional success in similar cases by digging through all executor logs
and trying to find exceptions that are not caused by the application
shutdown (but the logs remain my main pain point with Spark).
That aside,
I'm trying to understand the basics of Spark internals and Spark
documentation for submitting applications in local mode says for
spark-submit --master setting:
local[K] Run Spark locally with K worker threads (ideally, set this to the
number of cores on your machine).
local[*] Run Spark locally
Do you mind filing a JIRA issue for this which includes the actual error
message string that you saw? https://issues.apache.org/jira/browse/SPARK
On Thu, Jan 22, 2015 at 8:31 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
I am not sure if you get the same exception as I do --
The local mode still parallelizes calculations and it is useful for
debugging as it goes through the steps of serialization/deserialization as
a cluster would.
On Fri, Jan 23, 2015 at 5:44 PM, olegshirokikh o...@solver.com wrote:
I'm trying to understand the basics of Spark internals and Spark
Spark Committers: Please advise the way forward for this issue.
Thanks for your support.
Regards,
Venkat
From: Venkat, Ankam
Sent: Thursday, January 22, 2015 9:34 AM
To: 'Frank Austin Nothaft'; 'user@spark.apache.org'
Cc: 'Nick Allen'
Subject: RE: How to 'Pipe' Binary Data in Apache Spark
How
https://issues.apache.org/jira/browse/SPARK-5389
I marked as minor since I also just discovered that I can run it under
PowerShell just fine. Vladimir, feel free to change the bug if you're
getting a different message or a more serious issue.
On Fri, Jan 23, 2015 at 4:44 PM, Josh Rosen
Thanks for the ideas Sven.
I'm using stand-alone cluster (Spark 1.2).
FWIW, I was able to get this running (just now). This is the first time it's
worked in probably my last 10 attempts.
In addition to limiting the executors to only 50% of the cluster. In the
settings below, I additionally
Thanks. But I think I already mark all the Spark and Hadoop reps as provided.
Why the cluster's version is not used?
Any way, as I mentioned in the previous message, after changing the
hadoop-client to version 1.2.1 in my maven deps, I already pass the exception
and go to another one as
These are all definitely symptoms of mixing incompatible versions of libraries.
I'm not suggesting you haven't excluded Spark / Hadoop, but, this is
not the only way Hadoop deps get into your app. See my suggestion
about investigating the dependency tree.
On Fri, Jan 23, 2015 at 1:53 PM, ey-chih
Sorry I still did not quiet get your resolution. In my jar, there are
following three related classes:
org/apache/hadoop/mapreduce/task/TaskAttemptContextImpl.classorg/apache/hadoop/mapreduce/task/TaskAttemptContextImpl$DummyReporter.classorg/apache/hadoop/mapreduce/TaskAttemptContext.class
I
Hi Sven,
What version of Spark are you running? Recent versions have a change that
allows PySpark to share a pool of processes instead of starting a new one
for each task.
-Sandy
On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser kras...@gmail.com wrote:
Hey all,
I am running into a problem
Hi,
I have written spark streaming kafka receiver using kafka simple consumer
api:
https://github.com/mykidong/spark-kafka-simple-consumer-receiver
This kafka receiver can be used as alternative to the current spark
streaming kafka receiver which is just written in high level kafka consumer
api.
Hi,
In that case, you can try the following.
val joinRDD = kafkaStream.transform( streamRDD = {
val ids = streamRDD.map(_._2).collect();
ids.map(userId = ctable.select(user_name).where(userid = ?,
userId).toArray(0).get[String](0))
// better create a query which checks for all those ids at
Can you tell us the problem you're facing ?
Please see this thread:
http://search-hadoop.com/m/JW1q5SsB5m
Cheers
On Fri, Jan 23, 2015 at 9:02 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
Is there a better programming construct than while loop in Spark?
Thank You
Hi,
Is there a better programming construct than while loop in Spark?
Thank You
Yarn only has the ability to kill not checkpoint or sig suspend. If you
use too much memory it will simply kill tasks based upon the yarn config.
https://issues.apache.org/jira/browse/YARN-2172
On Friday, January 23, 2015, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Sven,
What version of
I'm not sure about this, but I suspect the answer is: spark doesn't
guarantee a stable sort, nor does it plan to in the future, so the
implementation has more flexibility.
But you might be interested in the work being done on secondary sort, which
could give you the guarantees you want:
I'd do the same but put an extra condition to check whether the job has
successfully started or not by checking the application ui (port
availability 4040 would do, if you want more complex one then write a
parser for the same.) after putting the main script on sleep for some time
(say 2 minutes).
Hey Adam,
I'm not sure I understand just yet what you have in mind. My takeaway from
the logs is that the container actually was above its allotment of about
14G. Since 6G of that are for overhead, I assumed there to be plenty of
space for Python workers, but there seem to be more of those than
It should be a bug, the Python worker did not exit normally, could you
file a JIRA for this?
Also, could you show how to reproduce this behavior?
On Fri, Jan 23, 2015 at 11:45 PM, Sven Krasser kras...@gmail.com wrote:
Hey Adam,
I'm not sure I understand just yet what you have in mind. My
Can you also try increasing the akka framesize?
.set(spark.akka.frameSize,50) // Set it to a higher number
Thanks
Best Regards
On Sat, Jan 24, 2015 at 3:58 AM, Darin McBeath ddmcbe...@yahoo.com.invalid
wrote:
Thanks for the ideas Sven.
I'm using stand-alone cluster (Spark 1.2).
FWIW, I
Hey Sandy,
I'm using Spark 1.2.0. I assume you're referring to worker reuse? In this
case I've already set spark.python.worker.reuse to false (but it I also so
this behavior when keeping it enabled).
Best,
-Sven
On Fri, Jan 23, 2015 at 4:51 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi
ssc.union will return a DStream, you should do something like:
val lines = ssc.union(parallelInputs)
lines.print()
Thanks
Best Regards
On Sat, Jan 24, 2015 at 12:55 AM, Chen Song chen.song...@gmail.com wrote:
I am running into some problems with Spark Streaming when reading from
Kafka.I
Which variable is that you don't understand?
Here's a minimalistic spark-env.sh of mine.
export SPARK_MASTER_IP=192.168.10.28
export HADOOP_CONF_DIR=/home/akhld/sigmoid/localcluster/hadoop/conf
export HADOOP_HOME=/home/akhld/sigmoid/localcluster/hadoop/
Thanks
Best Regards
On Fri, Jan 23,
After I changed the dependency to the following:
dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version1.2.1/version
exclusions
This is kinda a how-long-is-a-piece-of-string question. There is no
one tuning for 'terabytes of data'. You can easily run a Spark job
that processes hundreds of terabytes with no problem with defaults --
something trivial like counting. You can create Spark jobs that will
never complete -- trying
Hi,
histogram method return normal scala types not a RDD. So you will not
have saveAsTextFile.
You can use makeRDD method make a rdd out of the data and saveAsObject file
val hist = a.histogram(10)
val histRDD = sc.makeRDD(hist)
histRDD.saveAsObjectFile(path)
On Fri, Jan 23, 2015 at 5:37 AM, SK
So, you should not depend on Hadoop artifacts unless you use them
directly. You should mark Hadoop and Spark deps as provided. Then the
cluster's version is used at runtime with spark-submit. That's the
usual way to do it, which works.
If you need to embed Spark in your app and are running it
Hi,
I've exactly the same issue. I've tried to mark the libraries as 'provided'
but then IntelliJ IDE seems to have deleted the libraries locallythat is
I am not able to build/run the stuff in the IDE.
Is the issue resolved ? I'm not very experienced in SBTI've tried to
exclude the
Use mvn dependency:tree or sbt dependency-tree to print all of the
dependencies. You are probably bringing in more servlet API libs from
some other source?
On Fri, Jan 23, 2015 at 10:57 AM, Marco marco@gmail.com wrote:
Hi,
I've exactly the same issue. I've tried to mark the libraries as
Below is code I have written. I am getting NotSerializableException. How
can I handle this scenario?
kafkaStream.foreachRDD(rdd = {
println()
rdd.foreachPartition(partitionOfRecords = {
partitionOfRecords.foreach(
record = {
//Write for CSV.
As you can see, the result of histogram() is a pair of arrays, since
of course it's small. It's not necessary and in fact is huge overkill
to make it back into an RDD so you can save it across a bunch of
partitions.
This isn't a job for Spark, but simple Scala code. Off the top of my
head (maybe
It does not necessarily shuffle, yes. I believe it will not if you are
strictly reducing the number of partitions, and do not force a
shuffle. So I think the answer is 'yes'.
If you have a huge number of small files, you can also consider
wholeTextFiles, which gives you entire files of content in
Heh, this question keeps coming up. You can't use a context or RDD
inside a distributed operation, only from the driver. Here you're
trying to call textFile from within foreachPartition.
On Fri, Jan 23, 2015 at 10:59 AM, Nishant Patel
nishant.k.pa...@gmail.com wrote:
Below is code I have
I looked into the source code of SparkHadoopMapReduceUtil.scala. I think it is
broken in the following code:
def newTaskAttemptContext(conf: Configuration, attemptId: TaskAttemptID):
TaskAttemptContext = {val klass = firstAvailableClass(
I try to work with nested parquet data. To read and write the parquet file is
actually working now but when I try to query a nested field with SqlContext
I get an exception:
RuntimeException: Can't access nested field in type
ArrayType(StructType(List(StructField(...
I generate the parquet file
Hello,
I am trying to do a groupBy on 5 attributes to get results in a form like a
pivot table in microsoft excel. The keys are the attribute tuples and
values are double arrays(maybe very large). Based on the code below, I am
getting back correct results, but would like to optimize it further(I
I also think the code is not robust enough. First, Spark works with hadoop1,
why the code try hadoop2 first. Also the following code only handle
ClassNotFoundException. It should handle all the exceptions.
private def firstAvailableClass(first: String, second: String): Class[_] = {
try
I found a workaround.
I can make my auxiliary data a RDD. Partition it and cache it.
Later, I can cogroup it with other RDDs and Spark will try to keep the
cached RDD partitions where they are and not shuffle them.
--
View this message in context:
54 matches
Mail list logo