Update: as expected, switching to kryo merely delays the inevitable. Does
anyone have experience controlling memory consumption while processing
(e.g. writing out) imbalanced partitions?
On 09-Aug-2014 10:41 am, "Bharath Ravi Kumar" wrote:
> Our prototype application reads a 20GB dataset from HDF
I have a simple JSON dataset as below. How do I query all parts.lock for
the id=1.
JSON: { "id": 1, "name": "A green door", "price": 12.50, "tags": ["home",
"green"], "parts" : [ { "lock" : "One lock", "key" : "single key" }, {
"lock" : "2 lock", "key" : "2 key" } ] }
Query: select id,name,price,
Hi
I intend on using the same Spark Streaming program for both real time and
batch processing of my time stamped data. However with batch processing all
window based operations would be meaningless because (I assume) the window
is defined by the arrival times of data and it is not possible to def
Root partitions on AWS instances tend to be small (for example, an m1.large
instance has 2 420 GB drives, but only a 10 GB root partition). Matei's
probably right on about this - just need to be careful where things like the
logs get stored.
From: Matei Zaharia mailto:matei.zaha...@gmail.com>>
i am wondering if i can use spark in order to search for interesting
featrures/attributes for modelling. In fact I just come from some
introductional sites about vowpal wabbit. i some how like the idea of out of
the core modelling.
well, i have transactional data where customers purchased products
Your map-only job should not be shuffling, but if you want to see what's
running, look at the web UI at http://:4040. In fact the job should not
even write stuff to disk except inasmuch as the Hadoop S3 library might build
up blocks locally before sending them on.
My guess is that it's not /mnt
I've tried uploading a zip file that contains a csv to hdfs and then
read it into spark using spark-shell and the first line is all messed
up. However when i upload a gzip to hdfs and then read it into spark
it does just fine. See output below:
Is there a way to read a zip file as is from hdfs in
I have a related question. With Hadoop, I would do the same thing for
non-serializable objects and setup(). I also had a use case where it
was so expensive to initialize the non-serializable object that I
would make it a static member of the mapper, turn on JVM reuse across
tasks, and then preven
Thank you for your help. After restructuring my code to Seans input, it
worked without changing Spark context. I now took the same file format just
a bigger file(2.7GB) from s3 to my cluster with 4 c3.xlarge instances and
Spark 1.0.2. Unluckly my task freezes again after a short time. I tried it
w
Although nobody answers the Two questions, in my practice, it seems both
are yes.
2014-08-04 19:50 GMT+08:00 Fengyun RAO :
> object LogParserWrapper {
> private val logParser = {
> val settings = new ...
> val builders = new
> new LogParser(builders, settings)
>
Hi
I’m trying to using a specific dir for spark working directory since I have
limited space at /tmp. I tried:
1)
export SPARK_LOCAL_DIRS=“/mnt/data/tmp”
or 2)
SPARK_LOCAL_DIRS=“/mnt/data/tmp” in spark-env.sh
But neither worked, since the output of spark still saying
ERROR DiskBlockObjectWrit
11 matches
Mail list logo