Re: Spark Function setup and cleanup

2014-07-26 Thread Yosi Botzer
Thank you, but that doesn't answer my general question. I might need to enrich my records using different datasources (or DB's) So the general use case I need to support is to have some kind of Function that has init() logic for creating connection to DB, query the DB for each records and enrich

Re: Spark Function setup and cleanup

2014-07-26 Thread Sean Owen
Look at mapPartitions. Where as map turns one value V1 into one value V2, mapPartitions lets you turn one entire Iterator[V1] to one whole Iterator [V2]. The function that does so can perform some initialization at its start, and then process all of the values, and clean up at its end. This is how

Re: Emacs Setup Anyone?

2014-07-26 Thread Prashant Sharma
Normally any setup that has inferior mode for scala repl will also support spark repl (with little or no modifications). Apart from that I personally use spark repl normally by invoking spark-shell in a shell in emacs, and I keep the scala tags(etags) for the spark loaded. With this setup it is

How can I integrate spark cluster into my own program without using spark-submit?

2014-07-26 Thread Lizhengbing (bing, BIPA)
I want to use spark cluster through a scala function. So I can integrate spark into my program directly. For example: When I call count function in my own program, my program will deploy the function to the cluster , so I can get the result directly def count()= { val master =

Fwd: Exception : org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table

2014-07-26 Thread Bilna Govind
Hi, This is my first code in shark 0.9.1. I am new to spark and shark. So I don't know where I went wrong. It will be really helpful, If some one out there can troubleshoot the problem. First of all I will give a glimpse on my code which is developed in IntellijIdea. This code is

Exception : org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table

2014-07-26 Thread Bilna Govind
Hi, This is my first code in shark 0.9.1. I am new to spark and shark. So I don't know where I went wrong. It will be really helpful, If some one out there can troubleshoot the problem. First of all I will give a glimpse on my code which is developed in IntellijIdea. This code is

Help using streaming from Spark Shell

2014-07-26 Thread Yana Kadiyska
Hi, I'm starting spark-shell like this: SPARK_MEM=1g SPARK_JAVA_OPTS=-Dspark.cleaner.ttl=3600 /spark/bin/spark-shell -c 3 but when I try to create a streaming context val scc = new StreamingContext(sc, Seconds(10)) I get: org.apache.spark.SparkException: Spark Streaming cannot be used

Lot of object serialization even with MEMORY_ONLY

2014-07-26 Thread lokesh.gidra
Hello, I am executing the SparkPageRank example. It uses the cache() API for persistence of RDDs. And if I am not wrong, it in turn uses MEMORY_ONLY storage level. However, in oprofile report it shows a lot of activity in writeObject0 function. There is not even a single Spilling in-memory...

Spilling in-memory... messages in log even with MEMORY_ONLY

2014-07-26 Thread lokesh.gidra
Hello, I am running SparkPageRank example which uses cache() API for persistence. This AFAIK, uses MEMORY_ONLY storage level. But even in this setup, I see a lot of WARN ExternalAppendOnlyMap: Spilling in-memory map of messages in the log. Why is it so? I thought that MEMORY_ONLY means kick

graphx cached partitions wont go away

2014-07-26 Thread Koert Kuipers
i have graphx queries running inside a service where i collect the results to the driver and do not hold any references to the rdds involved in the queries. my assumption was that with the references gone spark would go and remove the cached rdds from memory (note, i did not cache them, graphx

Re: Spilling in-memory... messages in log even with MEMORY_ONLY

2014-07-26 Thread Matei Zaharia
These messages are actually not about spilling the RDD, they're about spilling intermediate state in a reduceByKey, groupBy or other operation whose state doesn't fit in memory. We have to do that in these cases to avoid going out of memory. You can minimize spilling by having more reduce tasks

Re: Spilling in-memory... messages in log even with MEMORY_ONLY

2014-07-26 Thread lokesh.gidra
Thanks for the reply. I understand this now. But in another situation, when I use large heap size to avoid any spilling (I confirm, there are no spilling messages in log), I still see a lot of time being spent in writeObject0() function. Can you please tell me why would there be any serialization

Re: Spilling in-memory... messages in log even with MEMORY_ONLY

2014-07-26 Thread Matei Zaharia
Even in local mode, Spark serializes data that would be sent across the network, e.g. in a reduce operation, so that you can catch errors that would happen in distributed mode. You can make serialization much faster by using the Kryo serializer; see

Re: streaming window not behaving as advertised (v1.0.1)

2014-07-26 Thread Tathagata Das
Yeah, maybe I should bump the issue to major. Now that I thought about to give my previous answer, this should be easy to fix just by doing a foreachRDD on all the input streams within the system (rather than explicitly doing it like I asked you to do). Thanks Alan, for testing this out and

Re: SparkContext startup time out

2014-07-26 Thread Anand Avati
I am bumping into this problem as well. I am trying to move to akka 2.3.x from 2.2.x in order to port to Scala 2.11 - only akka 2.3.x is available in Scala 2.11. All 2.2.x akka works fine, and all 2.3.x akka give the following exception in new SparkContext. Still investigating why..

Re: java.lang.StackOverflowError when calling count()

2014-07-26 Thread Tathagata Das
Responses inline. On Wed, Jul 23, 2014 at 4:13 AM, lalit1303 la...@sigmoidanalytics.com wrote: Hi, Thanks TD for your reply. I am still not able to resolve the problem for my use case. I have let's say 1000 different RDD's, and I am applying a transformation function on each RDD and I want

Spark MLlib vs BIDMach Benchmark

2014-07-26 Thread DB Tsai
BIDMach is CPU and GPU-accelerated Machine Learning Library also from Berkeley. https://github.com/BIDData/BIDMach/wiki/Benchmarks They did benchmark against Spark 0.9, and they claimed that it's significantly faster than Spark MLlib. In Spark 1.0, lot of performance optimization had been done,