Thank you, but that doesn't answer my general question.
I might need to enrich my records using different datasources (or DB's)
So the general use case I need to support is to have some kind of Function
that has init() logic for creating connection to DB, query the DB for each
records and enrich
Look at mapPartitions. Where as map turns one value V1 into one value
V2, mapPartitions lets you turn one entire Iterator[V1] to one whole
Iterator [V2]. The function that does so can perform some
initialization at its start, and then process all of the values, and
clean up at its end. This is how
Normally any setup that has inferior mode for scala repl will also support
spark repl (with little or no modifications).
Apart from that I personally use spark repl normally by invoking
spark-shell in a shell in emacs, and I keep the scala tags(etags) for the
spark loaded. With this setup it is
I want to use spark cluster through a scala function. So I can integrate spark
into my program directly.
For example:
When I call count function in my own program, my program will deploy the
function to the cluster , so I can get the result directly
def count()=
{
val master =
Hi,
This is my first code in shark 0.9.1. I am new to spark and shark. So I
don't know where I went wrong. It will be really helpful, If some one out
there can troubleshoot the problem.
First of all I will give a glimpse on my code which is developed in
IntellijIdea. This code is
Hi,
This is my first code in shark 0.9.1. I am new to spark and shark. So I
don't know where I went wrong. It will be really helpful, If some one out
there can troubleshoot the problem.
First of all I will give a glimpse on my code which is developed in
IntellijIdea. This code is
Hi,
I'm starting spark-shell like this:
SPARK_MEM=1g SPARK_JAVA_OPTS=-Dspark.cleaner.ttl=3600
/spark/bin/spark-shell -c 3
but when I try to create a streaming context
val scc = new StreamingContext(sc, Seconds(10))
I get:
org.apache.spark.SparkException: Spark Streaming cannot be used
Hello,
I am executing the SparkPageRank example. It uses the cache() API for
persistence of RDDs. And if I am not wrong, it in turn uses MEMORY_ONLY
storage level. However, in oprofile report it shows a lot of activity in
writeObject0 function.
There is not even a single Spilling in-memory...
Hello,
I am running SparkPageRank example which uses cache() API for persistence.
This AFAIK, uses MEMORY_ONLY storage level. But even in this setup, I see a
lot of WARN ExternalAppendOnlyMap: Spilling in-memory map of messages
in the log. Why is it so? I thought that MEMORY_ONLY means kick
i have graphx queries running inside a service where i collect the results
to the driver and do not hold any references to the rdds involved in the
queries. my assumption was that with the references gone spark would go and
remove the cached rdds from memory (note, i did not cache them, graphx
These messages are actually not about spilling the RDD, they're about spilling
intermediate state in a reduceByKey, groupBy or other operation whose state
doesn't fit in memory. We have to do that in these cases to avoid going out of
memory. You can minimize spilling by having more reduce tasks
Thanks for the reply. I understand this now.
But in another situation, when I use large heap size to avoid any spilling
(I confirm, there are no spilling messages in log), I still see a lot of
time being spent in writeObject0() function. Can you please tell me why
would there be any serialization
Even in local mode, Spark serializes data that would be sent across the
network, e.g. in a reduce operation, so that you can catch errors that would
happen in distributed mode. You can make serialization much faster by using the
Kryo serializer; see
Yeah, maybe I should bump the issue to major. Now that I thought about
to give my previous answer, this should be easy to fix just by doing a
foreachRDD on all the input streams within the system (rather than
explicitly doing it like I asked you to do).
Thanks Alan, for testing this out and
I am bumping into this problem as well. I am trying to move to akka 2.3.x
from 2.2.x in order to port to Scala 2.11 - only akka 2.3.x is available in
Scala 2.11. All 2.2.x akka works fine, and all 2.3.x akka give the
following exception in new SparkContext. Still investigating why..
Responses inline.
On Wed, Jul 23, 2014 at 4:13 AM, lalit1303 la...@sigmoidanalytics.com wrote:
Hi,
Thanks TD for your reply. I am still not able to resolve the problem for my
use case.
I have let's say 1000 different RDD's, and I am applying a transformation
function on each RDD and I want
BIDMach is CPU and GPU-accelerated Machine Learning Library also from Berkeley.
https://github.com/BIDData/BIDMach/wiki/Benchmarks
They did benchmark against Spark 0.9, and they claimed that it's
significantly faster than Spark MLlib. In Spark 1.0, lot of
performance optimization had been done,
17 matches
Mail list logo