Re: OOM writing out sorted RDD

2014-08-09 Thread Bharath Ravi Kumar
Update: as expected, switching to kryo merely delays the inevitable. Does anyone have experience controlling memory consumption while processing (e.g. writing out) imbalanced partitions? On 09-Aug-2014 10:41 am, "Bharath Ravi Kumar" wrote: > Our prototype application reads a 20GB dataset from HDF

Spark SQL JSON dataset query nested datastructures

2014-08-09 Thread Sathish Kumaran Vairavelu
I have a simple JSON dataset as below. How do I query all parts.lock for the id=1. JSON: { "id": 1, "name": "A green door", "price": 12.50, "tags": ["home", "green"], "parts" : [ { "lock" : "One lock", "key" : "single key" }, { "lock" : "2 lock", "key" : "2 key" } ] } Query: select id,name,price,

Overriding dstream window definition

2014-08-09 Thread Ruchir Jha
Hi I intend on using the same Spark Streaming program for both real time and batch processing of my time stamped data. However with batch processing all window based operations would be meaningless because (I assume) the window is defined by the arrival times of data and it is not possible to def

Re: No space left on device

2014-08-09 Thread Jim Donahue
Root partitions on AWS instances tend to be small (for example, an m1.large instance has 2 420 GB drives, but only a 10 GB root partition). Matei's probably right on about this - just need to be careful where things like the logs get stored. From: Matei Zaharia mailto:matei.zaha...@gmail.com>>

feature space search

2014-08-09 Thread filipus
i am wondering if i can use spark in order to search for interesting featrures/attributes for modelling. In fact I just come from some introductional sites about vowpal wabbit. i some how like the idea of out of the core modelling. well, i have transactional data where customers purchased products

Re: No space left on device

2014-08-09 Thread Matei Zaharia
Your map-only job should not be shuffling, but if you want to see what's running, look at the web UI at http://:4040. In fact the job should not even write stuff to disk except inasmuch as the Hadoop S3 library might build up blocks locally before sending them on. My guess is that it's not /mnt

How to read zip files from HDFS into spark-shell using scala

2014-08-09 Thread Alton Alexander
I've tried uploading a zip file that contains a csv to hdfs and then read it into spark using spark-shell and the first line is all messed up. However when i upload a gzip to hdfs and then read it into spark it does just fine. See output below: Is there a way to read a zip file as is from hdfs in

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-09 Thread Kevin James Matzen
I have a related question. With Hadoop, I would do the same thing for non-serializable objects and setup(). I also had a use case where it was so expensive to initialize the non-serializable object that I would make it a static member of the mapper, turn on JVM reuse across tasks, and then preven

Re: KMeans Input Format

2014-08-09 Thread AlexanderRiggers
Thank you for your help. After restructuring my code to Seans input, it worked without changing Spark context. I now took the same file format just a bigger file(2.7GB) from s3 to my cluster with 4 c3.xlarge instances and Spark 1.0.2. Unluckly my task freezes again after a short time. I tried it w

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-09 Thread Fengyun RAO
Although nobody answers the Two questions, in my practice, it seems both are yes. 2014-08-04 19:50 GMT+08:00 Fengyun RAO : > object LogParserWrapper { > private val logParser = { > val settings = new ... > val builders = new > new LogParser(builders, settings) >

set SPARK_LOCAL_DIRS issue

2014-08-09 Thread Baoqiang Cao
Hi I’m trying to using a specific dir for spark working directory since I have limited space at /tmp. I tried: 1) export SPARK_LOCAL_DIRS=“/mnt/data/tmp” or 2) SPARK_LOCAL_DIRS=“/mnt/data/tmp” in spark-env.sh But neither worked, since the output of spark still saying ERROR DiskBlockObjectWrit