Hi All,
I am interested to collect() a large RDD so that I can run a learning
algorithm on it. I've noticed that when I don't increase
SPARK_DRIVER_MEMORY I can run out of memory. I've also noticed that it
looks like the same fraction of memory is reserved for storage on the
driver as on the
I've had multiple jobs crash due to java.io.IOException: unexpected
exception type; I've been running the 1.1 branch for some time and am now
running the 1.1 release binaries. Note that I only use PySpark. I haven't
kept detailed notes or the tracebacks around since there are other problems
that
26, 2014 at 1:32 PM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
I've had multiple jobs crash due to java.io.IOException: unexpected
exception type; I've been running the 1.1 branch for some time and am now
running the 1.1 release binaries. Note that I only use PySpark. I haven't
kept
dav...@databricks.com wrote:
On Thu, Sep 25, 2014 at 11:25 AM, Brad Miller
bmill...@eecs.berkeley.edu wrote:
Hi Davies,
Thanks for your help.
I ultimately re-wrote the code to use broadcast variables, and then
received
an error when trying to broadcast self.all_models that the size did
, Brad Miller
bmill...@eecs.berkeley.edu wrote:
Hi Davies,
That's interesting to know. Here's more details about my code. The
object
(self) contains pointers to the spark_context (which seems to generate
errors during serialization) so I strip off the extra state using the
outer
Hi All,
I suspect I am experiencing a bug. I've noticed that while running
larger jobs, they occasionally die with the exception
java.util.NoSuchElementException: key not found xyz, where xyz
denotes the ID of some particular task. I've excerpted the log from
one job that died in this way below
Hi Andrew,
I agree with Nicholas. That was a nice, concise summary of the
meaning of the locality customization options, indicators and default
Spark behaviors. I haven't combed through the documentation
end-to-end in a while, but I'm also not sure that information is
presently represented
= SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)
On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Hi All,
I'm having some trouble with the coalesce and repartition functions for
SchemaRDD objects in pyspark. When I run:
sqlCtx.jsonRDD
Hi All,
I'm having some trouble with the coalesce and repartition functions for
SchemaRDD objects in pyspark. When I run:
sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}',
'{foo:baz}'])).coalesce(1)
I get this error:
Py4JError: An error occurred while calling o94.coalesce. Trace:
My approach may be partly influenced by my limited experience with SQL and
Hive, but I just converted all my dates to seconds-since-epoch and then
selected samples from specific time ranges using integer comparisons.
On Thu, Sep 4, 2014 at 6:38 PM, Cheng, Hao hao.ch...@intel.com wrote:
There
to back a table in a SQLContext.
On Fri, Sep 5, 2014 at 9:53 AM, Benjamin Zaitlen quasi...@gmail.com wrote:
Hi Brad,
When you do the conversion is this a Hive/Spark job or is it a
pre-processing step before loading into HDFS?
---Ben
On Fri, Sep 5, 2014 at 10:29 AM, Brad Miller bmill
How did you specify the HDFS path? When i put
spark.eventLog.dir hdfs://
crosby.research.intel-research.net:54310/tmp/spark-events
in my spark-defaults.conf file, I receive the following error:
An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext.
:
Hi All,
Yesterday I restarted my cluster, which had the effect of clearing /tmp.
When I brought Spark back up and ran my first job, /tmp/spark-events was
re-created and the job ran fine. I later learned that other users were
receiving errors when trying to create a spark context. It turned out
Hi All,
@Andrew
Thanks for the tips. I just built the master branch of Spark last
night, but am still having problems viewing history through the
standalone UI. I dug into the Spark job events directories as you
suggested, and I see at a minimum 'SPARK_VERSION_1.0.0' and
'EVENT_LOG_1'; for
Hi Andrew,
I'm running something close to the present master (I compiled several days
ago) but am having some trouble viewing history.
I set spark.eventLog.dir to true, but continually receive the error
message (via the web UI) Application history not found...No event logs
found for application
Hi All,
I have a Spark job for which I need to increase the amount of memory
allocated to the driver to collect a large-ish (200M) data structure.
Formerly, I accomplished this by setting SPARK_MEM before invoking my
job (which effectively set memory on the driver) and then setting
Hi All,
I'm having some trouble setting the disk spill directory for spark. The
following approaches set spark.local.dir (according to the Environment
tab of the web UI) but produce the indicated warnings:
*In spark-env.sh:*
export SPARK_JAVA_OPTS=-Dspark.local.dir=/spark/spill
*Associated
Hi All,
I'm having a bit of trouble with nested data structures in pyspark with
saveAsParquetFile. I'm running master (as of yesterday) with this pull
request added: https://github.com/apache/spark/pull/1802.
*# these all work*
sqlCtx.jsonRDD(sc.parallelize(['{record:
Thanks Yin!
best,
-Brad
On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai yh...@databricks.com wrote:
Hi Brad,
It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908
to track it. It will be fixed soon.
Thanks,
Yin
On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller bmill
Hi All,
I have a data set where each record is serialized using JSON, and I'm
interested to use SchemaRDDs to work with the data. Unfortunately I've hit
a snag since some fields in the data are maps and list, and are not
guaranteed to be populated for each record. This seems to cause
.
Nick
On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Hi All,
I have a data set where each record is serialized using JSON, and I'm
interested to use SchemaRDDs to work with the data. Unfortunately I've hit
a snag since some fields in the data are maps
Hi All,
I am interested to use jsonRDD and jsonFile to create a SchemaRDD out of
some JSON data I have, but I've run into some instability involving the
following java exception:
An error occurred while calling o1326.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Got it. Thanks!
On Tue, Aug 5, 2014 at 11:53 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Notice the difference in the schema. Are you running the 1.0.1 release,
or a more bleeding-edge version from the repository?
Yep, my bad. I’m running off master at commit
running this on master or the 1.1-RC which
should be coming out this week. Pyspark did not have good support for
nested data previously. If you still encounter issues using a more recent
version, please file a JIRA. Thanks!
On Tue, Aug 5, 2014 at 11:55 AM, Brad Miller bmill...@eecs.berkeley.edu
the
complete schema.
Does this work for you?
Nick
On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Hi All,
I have a data set where each record is serialized using JSON, and I'm
interested to use SchemaRDDs to work with the data. Unfortunately I've
hit
-2376
best,
-brad
On Tue, Aug 5, 2014 at 12:18 PM, Davies Liu dav...@databricks.com wrote:
This sample argument of inferSchema is still no in master, if will
try to add it if it make
sense.
On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Hi Davies,
Thanks
anybody else verify that the second example still crashes (and is meant
to work)? If so, would it be best to modify JIRA-2376 or start a new bug?
https://issues.apache.org/jira/browse/SPARK-2376
best,
-Brad
On Tue, Aug 5, 2014 at 12:10 PM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Nick
]])]
Can’t answer your question about branch stability, though. Spark is a
very active project, so stuff is happening all the time.
Nick
On Tue, Aug 5, 2014 at 7:20 PM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Hi Nick,
Can you check that the call to collect() works as well
the proper schema. We will take a look it.
Thanks,
Yin
On Tue, Aug 5, 2014 at 12:20 PM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Assuming updating to master fixes the bug I was experiencing with jsonRDD
and jsonFile, then pushing sample to master will probably not be
necessary
take the data back to the Python side, SchemaRDD#javaToPython
failed on your cases. I have created
https://issues.apache.org/jira/browse/SPARK-2875 to track it.
Thanks,
Yin
On Tue, Aug 5, 2014 at 9:20 PM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Hi All,
I checked out and built master
Hi All,
Congrats to the entire Spark team on the 1.0.1 release.
In checking out the new features, I noticed that it looks like the
python API docs have been updated, but the title and the header at
the top of the page still say Spark 1.0.0. Clearly not a big
deal... I just wouldn't want anyone
-- Forwarded message --
From: Brad Miller bmill...@eecs.berkeley.edu
Date: Mon, Jun 30, 2014 at 10:20 AM
Subject: odd caching behavior or accounting
To: user@spark.apache.org
Hi All,
I've recently noticed some caching behavior which I did not understand
and may or may not have indicated
Hi All,
I am attempting to develop some unit tests for a program using pyspark and
scikit-learn and I've come across some weird behavior. I receive the
following warning during some tests python/pyspark/serializers.py:327:
DeprecationWarning: integer argument expected, got float.
Although it's
Hi All,
I have experienced some crashing behavior with join in pyspark. When I
attempt a join with 2000 partitions in the result, the join succeeds, but
when I use only 200 partitions in the result, the join fails with the
message Job aborted due to stage failure: Master removed our application:
://issues.apache.org/jira/browse/SPARK-2021 to track
this — it’s something we’ve been meaning to look at soon.
Matei
On Jun 4, 2014, at 8:23 AM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Hi All,
I have experienced some crashing behavior with join in pyspark. When I
attempt a join with 2000
I would echo much of what Andrew has said.
I manage a small/medium sized cluster (48 cores, 512G ram, 512G disk
space dedicated to spark, data storage in separate HDFS shares). I've
been using spark since 0.7, and as with Andrew I've observed
significant and consistent improvements in stability
4. Shuffle on disk
Is it true - I couldn't find it in official docs, but did see this mentioned
in various threads - that shuffle _always_ hits disk? (Disregarding OS
caches.) Why is this the case? Are you planning to add a function to do
shuffle in memory or are there some intrinsic reasons
On Tue, Apr 8, 2014 at 2:56 PM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Hi All,
I poked around a bit more to (1) confirm my suspicions that the crash
was related to memory consumption and (2) figure out why there is no
error shown in 12_stderr, the spark executor log file from
Hi All,
When I run the program shown below, I receive the error shown below.
I am running the current version of branch-0.9 from github. Note that
I do not receive the error when I replace 2 ** 29 with 2 ** X,
where X 29. More interestingly, I do not receive the error when X =
30, and when X
39 matches
Mail list logo