driver memory management

2014-09-28 Thread Brad Miller
Hi All, I am interested to collect() a large RDD so that I can run a learning algorithm on it. I've noticed that when I don't increase SPARK_DRIVER_MEMORY I can run out of memory. I've also noticed that it looks like the same fraction of memory is reserved for storage on the driver as on the

Re: java.io.IOException Error in task deserialization

2014-09-26 Thread Brad Miller
I've had multiple jobs crash due to java.io.IOException: unexpected exception type; I've been running the 1.1 branch for some time and am now running the 1.1 release binaries. Note that I only use PySpark. I haven't kept detailed notes or the tracebacks around since there are other problems that

Re: java.io.IOException Error in task deserialization

2014-09-26 Thread Brad Miller
26, 2014 at 1:32 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: I've had multiple jobs crash due to java.io.IOException: unexpected exception type; I've been running the 1.1 branch for some time and am now running the 1.1 release binaries. Note that I only use PySpark. I haven't kept

Re: java.lang.NegativeArraySizeException in pyspark

2014-09-26 Thread Brad Miller
dav...@databricks.com wrote: On Thu, Sep 25, 2014 at 11:25 AM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi Davies, Thanks for your help. I ultimately re-wrote the code to use broadcast variables, and then received an error when trying to broadcast self.all_models that the size did

Re: java.lang.NegativeArraySizeException in pyspark

2014-09-25 Thread Brad Miller
, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi Davies, That's interesting to know. Here's more details about my code. The object (self) contains pointers to the spark_context (which seems to generate errors during serialization) so I strip off the extra state using the outer

java.util.NoSuchElementException: key not found

2014-09-16 Thread Brad Miller
Hi All, I suspect I am experiencing a bug. I've noticed that while running larger jobs, they occasionally die with the exception java.util.NoSuchElementException: key not found xyz, where xyz denotes the ID of some particular task. I've excerpted the log from one job that died in this way below

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-14 Thread Brad Miller
Hi Andrew, I agree with Nicholas. That was a nice, concise summary of the meaning of the locality customization options, indicators and default Spark behaviors. I haven't combed through the documentation end-to-end in a while, but I'm also not sure that information is presently represented

Re: coalesce on SchemaRDD in pyspark

2014-09-12 Thread Brad Miller
= SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx) On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I'm having some trouble with the coalesce and repartition functions for SchemaRDD objects in pyspark. When I run: sqlCtx.jsonRDD

coalesce on SchemaRDD in pyspark

2014-09-11 Thread Brad Miller
Hi All, I'm having some trouble with the coalesce and repartition functions for SchemaRDD objects in pyspark. When I run: sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}', '{foo:baz}'])).coalesce(1) I get this error: Py4JError: An error occurred while calling o94.coalesce. Trace:

Re: TimeStamp selection with SparkSQL

2014-09-05 Thread Brad Miller
My approach may be partly influenced by my limited experience with SQL and Hive, but I just converted all my dates to seconds-since-epoch and then selected samples from specific time ranges using integer comparisons. On Thu, Sep 4, 2014 at 6:38 PM, Cheng, Hao hao.ch...@intel.com wrote: There

Re: TimeStamp selection with SparkSQL

2014-09-05 Thread Brad Miller
to back a table in a SQLContext. On Fri, Sep 5, 2014 at 9:53 AM, Benjamin Zaitlen quasi...@gmail.com wrote: Hi Brad, When you do the conversion is this a Hive/Spark job or is it a pre-processing step before loading into HDFS? ---Ben On Fri, Sep 5, 2014 at 10:29 AM, Brad Miller bmill

Re: Spark webUI - application details page

2014-08-29 Thread Brad Miller
How did you specify the HDFS path? When i put spark.eventLog.dir hdfs:// crosby.research.intel-research.net:54310/tmp/spark-events in my spark-defaults.conf file, I receive the following error: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. :

/tmp/spark-events permissions problem

2014-08-29 Thread Brad Miller
Hi All, Yesterday I restarted my cluster, which had the effect of clearing /tmp. When I brought Spark back up and ran my first job, /tmp/spark-events was re-created and the job ran fine. I later learned that other users were receiving errors when trying to create a spark context. It turned out

Re: Spark webUI - application details page

2014-08-28 Thread Brad Miller
Hi All, @Andrew Thanks for the tips. I just built the master branch of Spark last night, but am still having problems viewing history through the standalone UI. I dug into the Spark job events directories as you suggested, and I see at a minimum 'SPARK_VERSION_1.0.0' and 'EVENT_LOG_1'; for

Re: Spark webUI - application details page

2014-08-15 Thread Brad Miller
Hi Andrew, I'm running something close to the present master (I compiled several days ago) but am having some trouble viewing history. I set spark.eventLog.dir to true, but continually receive the error message (via the web UI) Application history not found...No event logs found for application

SPARK_DRIVER_MEMORY

2014-08-14 Thread Brad Miller
Hi All, I have a Spark job for which I need to increase the amount of memory allocated to the driver to collect a large-ish (200M) data structure. Formerly, I accomplished this by setting SPARK_MEM before invoking my job (which effectively set memory on the driver) and then setting

SPARK_LOCAL_DIRS

2014-08-14 Thread Brad Miller
Hi All, I'm having some trouble setting the disk spill directory for spark. The following approaches set spark.local.dir (according to the Environment tab of the web UI) but produce the indicated warnings: *In spark-env.sh:* export SPARK_JAVA_OPTS=-Dspark.local.dir=/spark/spill *Associated

trouble with saveAsParquetFile

2014-08-07 Thread Brad Miller
Hi All, I'm having a bit of trouble with nested data structures in pyspark with saveAsParquetFile. I'm running master (as of yesterday) with this pull request added: https://github.com/apache/spark/pull/1802. *# these all work* sqlCtx.jsonRDD(sc.parallelize(['{record:

Re: trouble with saveAsParquetFile

2014-08-07 Thread Brad Miller
Thanks Yin! best, -Brad On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai yh...@databricks.com wrote: Hi Brad, It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will be fixed soon. Thanks, Yin On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller bmill

pyspark inferSchema

2014-08-05 Thread Brad Miller
Hi All, I have a data set where each record is serialized using JSON, and I'm interested to use SchemaRDDs to work with the data. Unfortunately I've hit a snag since some fields in the data are maps and list, and are not guaranteed to be populated for each record. This seems to cause

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
. Nick ​ On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I have a data set where each record is serialized using JSON, and I'm interested to use SchemaRDDs to work with the data. Unfortunately I've hit a snag since some fields in the data are maps

trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Hi All, I am interested to use jsonRDD and jsonFile to create a SchemaRDD out of some JSON data I have, but I've run into some instability involving the following java exception: An error occurred while calling o1326.collect. : org.apache.spark.SparkException: Job aborted due to stage failure:

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
Got it. Thanks! On Tue, Aug 5, 2014 at 11:53 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Notice the difference in the schema. Are you running the 1.0.1 release, or a more bleeding-edge version from the repository? Yep, my bad. I’m running off master at commit

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
running this on master or the 1.1-RC which should be coming out this week. Pyspark did not have good support for nested data previously. If you still encounter issues using a more recent version, please file a JIRA. Thanks! On Tue, Aug 5, 2014 at 11:55 AM, Brad Miller bmill...@eecs.berkeley.edu

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
the complete schema. Does this work for you? Nick On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I have a data set where each record is serialized using JSON, and I'm interested to use SchemaRDDs to work with the data. Unfortunately I've hit

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
-2376 best, -brad On Tue, Aug 5, 2014 at 12:18 PM, Davies Liu dav...@databricks.com wrote: This sample argument of inferSchema is still no in master, if will try to add it if it make sense. On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi Davies, Thanks

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
anybody else verify that the second example still crashes (and is meant to work)? If so, would it be best to modify JIRA-2376 or start a new bug? https://issues.apache.org/jira/browse/SPARK-2376 best, -Brad On Tue, Aug 5, 2014 at 12:10 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Nick

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
]])] Can’t answer your question about branch stability, though. Spark is a very active project, so stuff is happening all the time. Nick ​ On Tue, Aug 5, 2014 at 7:20 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi Nick, Can you check that the call to collect() works as well

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
the proper schema. We will take a look it. Thanks, Yin On Tue, Aug 5, 2014 at 12:20 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Assuming updating to master fixes the bug I was experiencing with jsonRDD and jsonFile, then pushing sample to master will probably not be necessary

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
take the data back to the Python side, SchemaRDD#javaToPython failed on your cases. I have created https://issues.apache.org/jira/browse/SPARK-2875 to track it. Thanks, Yin On Tue, Aug 5, 2014 at 9:20 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I checked out and built master

Re: Announcing Spark 1.0.1

2014-07-12 Thread Brad Miller
Hi All, Congrats to the entire Spark team on the 1.0.1 release. In checking out the new features, I noticed that it looks like the python API docs have been updated, but the title and the header at the top of the page still say Spark 1.0.0. Clearly not a big deal... I just wouldn't want anyone

odd caching behavior or accounting

2014-06-30 Thread Brad Miller
-- Forwarded message -- From: Brad Miller bmill...@eecs.berkeley.edu Date: Mon, Jun 30, 2014 at 10:20 AM Subject: odd caching behavior or accounting To: user@spark.apache.org Hi All, I've recently noticed some caching behavior which I did not understand and may or may not have indicated

pyspark bug with unittest and scikit-learn

2014-06-19 Thread Brad Miller
Hi All, I am attempting to develop some unit tests for a program using pyspark and scikit-learn and I've come across some weird behavior. I receive the following warning during some tests python/pyspark/serializers.py:327: DeprecationWarning: integer argument expected, got float. Although it's

pyspark join crash

2014-06-04 Thread Brad Miller
Hi All, I have experienced some crashing behavior with join in pyspark. When I attempt a join with 2000 partitions in the result, the join succeeds, but when I use only 200 partitions in the result, the join fails with the message Job aborted due to stage failure: Master removed our application:

Re: pyspark join crash

2014-06-04 Thread Brad Miller
://issues.apache.org/jira/browse/SPARK-2021 to track this — it’s something we’ve been meaning to look at soon. Matei On Jun 4, 2014, at 8:23 AM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I have experienced some crashing behavior with join in pyspark. When I attempt a join with 2000

Re: Spark - ready for prime time?

2014-04-10 Thread Brad Miller
I would echo much of what Andrew has said. I manage a small/medium sized cluster (48 cores, 512G ram, 512G disk space dedicated to spark, data storage in separate HDFS shares). I've been using spark since 0.7, and as with Andrew I've observed significant and consistent improvements in stability

Re: Spark - ready for prime time?

2014-04-10 Thread Brad Miller
4. Shuffle on disk Is it true - I couldn't find it in official docs, but did see this mentioned in various threads - that shuffle _always_ hits disk? (Disregarding OS caches.) Why is this the case? Are you planning to add a function to do shuffle in memory or are there some intrinsic reasons

Re: trouble with join on large RDDs

2014-04-09 Thread Brad Miller
On Tue, Apr 8, 2014 at 2:56 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I poked around a bit more to (1) confirm my suspicions that the crash was related to memory consumption and (2) figure out why there is no error shown in 12_stderr, the spark executor log file from

pyspark broadcast error

2014-03-11 Thread Brad Miller
Hi All, When I run the program shown below, I receive the error shown below. I am running the current version of branch-0.9 from github. Note that I do not receive the error when I replace 2 ** 29 with 2 ** X, where X 29. More interestingly, I do not receive the error when X = 30, and when X