Re: possible typos in spark 1.0 documentation

2014-05-31 Thread Patrick Wendell
a pull request for this? We'd be happy to accept the change. - Patrick

Re: Spark hook to create external process

2014-05-31 Thread Patrick Wendell
Currently, an executor is always run in it's own JVM, so it should be possible to just use some static initialization to e.g. launch a sub-process and set up a bridge with which to communicate. This is would be a fairly advanced use case, however. - Patrick On Thu, May 29, 2014 at 8:

Re: How can I dispose an Accumulator?

2014-05-31 Thread Patrick Wendell
. - Patrick On Thu, May 29, 2014 at 2:13 AM, innowireless TaeYun Kim wrote: > Hi, > > > > How can I dispose an Accumulator? > > It has no method like 'unpersist()' which Broadcast provides. > > > > Thanks. > >

Re: Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread Patrick Wendell
Can you look at the logs from the executor or in the UI? They should give an exception with the reason for the task failure. Also in the future, for this type of e-mail please only e-mail the "user@" list and not both lists. - Patrick On Sat, May 31, 2014 at 3:22 AM, prabeesh k w

Re: pyspark MLlib examples don't work with Spark 1.0.0

2014-05-31 Thread Patrick Wendell
I've removed my docs from my site to avoid confusion... somehow that link propogated all over the place! On Sat, May 31, 2014 at 1:58 AM, Xiangrui Meng wrote: > The documentation you looked at is not official, though it is from > @pwendell's website. It was for the Spark SQL release. Please find

Re: Monitoring / Instrumenting jobs in 1.0

2014-05-30 Thread Patrick Wendell
lob/master/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala - Patrick On Fri, May 30, 2014 at 7:09 AM, Daniel Siegmann wrote: > The Spark 1.0.0 release notes state "Internal instrumentation has been added > to allow applications to monitor and instrument Spark jobs."

Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-30 Thread Patrick Wendell
ard way to make them compatible with 2.6 we should do that. For r3.large, we can add that to the script. It's a newer type. Any interest in contributing this? - Patrick On May 30, 2014 5:08 AM, "Jeremy Lee" wrote: > > Hi there! I'm relatively new to the list, so s

Re: Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
;> >>> -- >>> Christopher T. Nguyen >>> Co-founder & CEO, Adatao <http://adatao.com> >>> linkedin.com/in/ctnguyen <http://linkedin.com/in/ctnguyen> >>> >>> >>> >>> >>> On Fri, May 30, 2014 at 3:12 AM, Patrick

Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
he.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick

Re: Spark Streaming and JMS

2014-05-15 Thread Patrick McGloin
Hi Tathagata, Thanks for your response, just the advice I was looking for. I will try this out with Spark 1.0 when it comes out. Best regards, Patrick On 5 May 2014 22:42, Tathagata Das wrote: > A few high-level suggestions. > > 1. I recommend using the new Receiver API in almost

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-05-15 Thread Patrick Wendell
ew SparkContext(conf) - Patrick On Wed, May 14, 2014 at 9:09 AM, Koert Kuipers wrote: > i have some settings that i think are relevant for my application. they are > spark.akka settings so i assume they are relevant for both executors and my > driver program. > &

pyspark python exceptions / py4j exceptions

2014-05-15 Thread Patrick Donovan
Hello, I'm trying to write a python function that does something like: def foo(line): try: return stuff(line) except Exception: raise MoreInformativeException(line) and then use it in a map like so: rdd.map(foo) and have my MoreInformativeException make it back if/when

Re: 1.0.0 Release Date?

2014-05-14 Thread Patrick Wendell
kely to be almost identical to the final release. - Patrick On Tue, May 13, 2014 at 9:40 AM, bhusted wrote: > Can anyone comment on the anticipated date or worse case timeframe for when > Spark 1.0.0 will be released? > > > > -- > View this message in context: > http:/

Re: same log4j slf4j error in spark 9.1

2014-05-13 Thread Patrick Wendell
Hey Adrian, If you are including log4j-over-slf4j.jar in your application, you'll still need to manually exclude slf4j-log4j12.jar from Spark. However, it should work once you do that. Before 0.9.1 you couldn't make it work, even if you added an exclude. - Patrick On Thu, May 8, 2014

Spark Streaming and JMS

2014-05-05 Thread Patrick McGloin
c.registerInputStream(tascQueue) Is this the best way to go? Best regards, Patrick

Re: spark ec2 error

2014-05-04 Thread Patrick Wendell
PM, Patrick Wendell wrote: > Hey Jeremy, > > This is actually a big problem - thanks for reporting it, I'm going to > revert this change until we can make sure it is backwards compatible. > > - Patrick > > On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman > wrote

Re: spark ec2 error

2014-05-04 Thread Patrick Wendell
Hey Jeremy, This is actually a big problem - thanks for reporting it, I'm going to revert this change until we can make sure it is backwards compatible. - Patrick On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman wrote: > Hi all, > > A heads up in case others hit this and are con

Re: when to use broadcast variables

2014-05-03 Thread Patrick Wendell
Broadcast variables need to fit entirely in memory - so that's a pretty good litmus test for whether or not to broadcast a smaller dataset or turn it into an RDD. On Fri, May 2, 2014 at 7:50 AM, Prashant Sharma wrote: > I had like to be corrected on this but I am just trying to say small enough >

Re: Setting the Scala version in the EC2 script?

2014-05-03 Thread Patrick Wendell
your spark-ec2.py script to checkout spark-ec2 from forked version. - Patrick On Thu, May 1, 2014 at 2:14 PM, Ian Ferreira wrote: > Is this possible, it is very annoying to have such a great script, but still > have to manually update stuff afterwards.

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-03 Thread Patrick Wendell
e datasets with many partitions, since often there are bottlenecks at the granularity of a file. Is there a reason you need this to be exactly one file? - Patrick On Sat, May 3, 2014 at 4:14 PM, Chris Fregly wrote: > not sure if this directly addresses your issue, peter, but it's worth >

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Patrick Wendell
This is a consequence of the way the Hadoop files API works. However, you can (fairly easily) add code to just rename the file because it will always produce the same filename. (heavy use of pseudo code) dir = "/some/dir" rdd.coalesce(1).saveAsTextFile(dir) f = new File(dir + "part-0") f.move

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Patrick Wendell
could be like this, it wouldn't violate the contract of union AFIAK the only guarentee is the resulting RDD will contain all elements. - Patrick On Tue, Apr 29, 2014 at 11:26 PM, Mingyu Kim wrote: > Yes, that’s what I meant. Sure, the numbers might not be actually sorted, > but t

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Patrick Wendell
ions returned by RDD.getPartitions() > and the row orders within the partitions determine the row order, I’m not > sure why union doesn’t respect the order because union operation simply > concatenates the two lists of partitions from the two RDDs. > > Mingyu > > > > > On 4/

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Patrick Wendell
You are right, once you sort() the RDD, then yes it has a well defined ordering. But that ordering is lost as soon as you transform the RDD, including if you union it with another RDD. On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim wrote: > Hi Patrick, > > I¹m a little confused about you

Re: JavaSparkConf

2014-04-29 Thread Patrick Wendell
This class was made to be "java friendly" so that we wouldn't have to use two versions. The class itself is simple. But I agree adding java setters would be nice. On Tue, Apr 29, 2014 at 8:32 PM, Soren Macbeth wrote: > There is a JavaSparkContext, but no JavaSparkConf object. I know SparkConf > i

Re: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Patrick Wendell
ut I'm no expert. On Tue, Apr 29, 2014 at 10:14 PM, Liu, Raymond wrote: > For all the tasks, say 32 task on total > > Best Regards, > Raymond Liu > > > -Original Message- > From: Patrick Wendell [mailto:pwend...@gmail.com] > > Is this the serialization throug

Re: NoSuchMethodError from Spark Java

2014-04-29 Thread Patrick Wendell
The signature of this function was changed in spark 1.0... is there any chance that somehow you are actually running against a newer version of Spark? On Tue, Apr 29, 2014 at 8:58 PM, wxhsdp wrote: > i met with the same question when update to spark 0.9.1 > (svn checkout https://github.com/apache

Re: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Patrick Wendell
Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond wrote: > Hi > > I am running a WordCount program which count words from HDFS, and I > noticed that the serializer part of code takes a lot of CPU

Re: Shuffle Spill Issue

2014-04-28 Thread Patrick Wendell
Could you explain more what your job is doing and what data types you are using? These numbers alone don't necessarily indicate something is wrong. The relationship between the in-memory and on-disk shuffle amount is definitely a bit strange, the data gets compressed when written to disk, but unles

Re: launching concurrent jobs programmatically

2014-04-28 Thread Patrick Wendell
erver You can also accomplish this by just having a separate service that submits multiple jobs to a cluster where those jobs e.g. use different jars. - Patrick On Mon, Apr 28, 2014 at 4:44 PM, Andrew Ash wrote: > For the second question, you can submit multiple jobs through the same > S

Re: NullPointerException when run SparkPI using YARN env

2014-04-28 Thread Patrick Wendell
This was fixed in master. I think this happens if you don't set HADOOP_CONF_DIR to the location where your hadoop configs are (e.g. yarn-site.xml). On Sun, Apr 27, 2014 at 7:40 PM, martin.ou wrote: > 1.my hadoop 2.3.0 > 2.SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly > 3.SPARK_YARN

Re: MLLib - libgfortran LD_LIBRARY_PATH

2014-04-28 Thread Patrick Wendell
This can only be a local filesystem though, it can't refer to an HDFS location. This is because it gets passed directly to the JVM. On Mon, Apr 28, 2014 at 9:55 PM, Patrick Wendell wrote: > Yes, you can set SPARK_LIBRARY_PATH in 0.9.X and in 1.0 you can set > spark.executor.extra

Re: MLLib - libgfortran LD_LIBRARY_PATH

2014-04-28 Thread Patrick Wendell
Yes, you can set SPARK_LIBRARY_PATH in 0.9.X and in 1.0 you can set spark.executor.extraLibraryPath. On Mon, Apr 28, 2014 at 9:16 AM, Shubham Chopra wrote: > I am trying to use Spark/MLLib on Yarn and do not have libgfortran > installed on my cluster. Is there any way I can set LD_LIBRARY_PATH s

Re: Running a spark-submit compatible app in spark-shell

2014-04-28 Thread Patrick Wendell
What about if you run ./bin/spark-shell --driver-class-path=/path/to/your/jar.jar I think either this or the --jars flag should work, but it's possible there is a bug with the --jars flag when calling the Repl. On Mon, Apr 28, 2014 at 4:30 PM, Roger Hoover wrote: > A couple of issues: > 1) the

Re: pySpark memory usage

2014-04-28 Thread Patrick Wendell
sees the error first before the reader knows what is going on. Anyways maybe if you have a simpler solution you could sketch it out in the JIRA and we could talk over there. The current proposal in the JIRA is somewhat complicated... - Patrick On Mon, Apr 28, 2014 at 1:01 PM, Jim Blomo

Re: is it okay to reuse objects across RDD's?

2014-04-26 Thread Patrick Wendell
what you do inside of the function. But I'd be careful using this approach... - Patrick On Sat, Apr 26, 2014 at 5:59 AM, Lisonbee, Todd wrote: > For example, > > val originalRDD: RDD[SomeCaseClass] = ... > > // Option 1: objects are copied, setting prop1 in the process > val tra

Re: compile spark 0.9.1 in hadoop 2.2 above exception

2014-04-24 Thread Patrick Wendell
Try running sbt/sbt clean and re-compiling. Any luck? On Thu, Apr 24, 2014 at 5:33 PM, martin.ou wrote: > > > occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3 > > 1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly > > > > 2.found Exception: > > found : org.apache

Re: Task splitting among workers

2014-04-20 Thread Patrick Wendell
For a HadoopRDD, first the spark scheduler calculates the number of tasks based on input splits. Usually people use this with HDFS data so in that case it's based on HDFS blocks. If the HDFS datanodes are co-located with the Spark cluster then it will try to run the tasks on the data node that cont

Re: running tests selectively

2014-04-20 Thread Patrick Wendell
I put some notes in this doc: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On Sun, Apr 20, 2014 at 8:58 PM, Arun Ramakrishnan < sinchronized.a...@gmail.com> wrote: > I would like to run some of the tests selectively. I am in branch-1.0 > > Tried the following two comm

Re: Spark-ec2 asks for password

2014-04-18 Thread Patrick Wendell
Unfortunately - I think a lot of this is due to generally increased latency on ec2 itself. I've noticed that it's way more common than it used to be for instances to come online past the "wait" timeout in the ec2 script. On Fri, Apr 18, 2014 at 9:11 PM, FRANK AUSTIN NOTHAFT wrote: > Aureliano,

Re: Spark on YARN performance

2014-04-11 Thread Patrick Wendell
To reiterate what Tom was saying - the code that runs inside of Spark on YARN is exactly the same code that runs in any deployment mode. There shouldn't be any performance difference once your application starts (assuming you are comparing apples-to-apples in terms of hardware). The differences ar

Re: Hybrid GPU CPU computation

2014-04-11 Thread Patrick Grinaway
I've actually done it using PySpark and python libraries which call cuda code, though I've never done it from scala directly. The only major challenge I've hit is assigning tasks to gpus on multiple gpu machines. Sent from my iPhone > On Apr 11, 2014, at 8:38 AM, Jaonary Rabarisoa wrote: > >

Re: programmatic way to tell Spark version

2014-04-10 Thread Patrick Wendell
Pierre - I'm not sure that would work. I just opened a Spark shell and did this: scala> classOf[SparkContext].getClass.getPackage.getImplementationVersion res4: String = 1.7.0_25 It looks like this is the JVM version. - Patrick On Thu, Apr 10, 2014 at 2:08 PM, Pierre Borckmans < pi

Re: programmatic way to tell Spark version

2014-04-10 Thread Patrick Wendell
I think this was solved in a recent merge: https://github.com/apache/spark/pull/204/files#diff-364713d7776956cb8b0a771e9b62f82dR779 Is that what you are looking for? If so, mind marking the JIRA as resolved? On Wed, Apr 9, 2014 at 3:30 PM, Nicholas Chammas wrote: > Hey Patrick, >

Re: hbase scan performance

2014-04-10 Thread Patrick Wendell
This job might still be faster... in MapReduce there will be other overheads in addition to the fact that doing sequential reads from HBase is slow. But it's possible the bottleneck is the HBase scan performance. - Patrick On Wed, Apr 9, 2014 at 10:10 AM, Jerry Lam wrote: > Hi Dave,

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-10 Thread Patrick Wendell
Okay so I think the issue here is just a conflict between your application code and the Hadoop code. Hadoop 2.0.0 depends on protobuf 2.4.0a: https://svn.apache.org/repos/asf/hadoop/common/tags/release-2.0.0-alpha/hadoop-project/pom.xml Your code is depending on protobuf 2.5.X The protobuf libra

Re: trouble with "join" on large RDDs

2014-04-07 Thread Patrick Wendell
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller wrote: > I am running the latest version of PySpark branch-0.9 and having some > trouble with join. > > One RDD is about 100G (25GB compressed and serialized in memory) with > 130K records, the other RDD is about 10G (2.5G compressed and > serialized in

Re: Heartbeat exceeds

2014-04-04 Thread Patrick Wendell
If you look in the Spark UI, do you see any garbage collection happening? My best guess is that some of the executors are going into GC and they are timing out. You can manually increase the timeout by setting the Spark conf: spark.storage.blockManagerSlaveTimeoutMs to a higher value. In your cas

Re: How to create a RPM package

2014-04-04 Thread Patrick Wendell
in the community has feedback from trying this. - Patrick On Fri, Apr 4, 2014 at 12:43 PM, Rahul Singhal wrote: > Hi Christophe, > > Thanks for your reply and the spec file. I have solved my issue for now. > I didn't want to rely building spark using the spec file (%buil

Re: Largest Spark Cluster

2014-04-04 Thread Patrick Wendell
and on jobs that crunch hundreds of terabytes (uncompressed) of data. - Patrick On Fri, Apr 4, 2014 at 12:05 PM, Parviz Deyhim wrote: > Spark community, > > > What's the size of the largest Spark cluster ever deployed? I've heard > Yahoo is running Spark on several hun

Re: Spark 1.0.0 release plan

2014-04-03 Thread Patrick Wendell
Btw - after that initial thread I proposed a slightly more detailed set of dates: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage - Patrick On Thu, Apr 3, 2014 at 11:28 AM, Matei Zaharia wrote: > Hey Bhaskar, this is still the plan, though QAing might take longer than > 1

Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Patrick Wendell
l piece of functionality and something we might, e.g. want to change the API of over time. - Patrick On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren wrote: > What I'd like is a way to capture the information provided on the stages > page (i.e. cluster:4040/stages via IndexPage). Look

Re: Resilient nature of RDD

2014-04-02 Thread Patrick Wendell
The driver stores the meta-data associated with the partition, but the re-computation will occur on an executor. So if several partitions are lost, e.g. due to a few machines failing, the re-computation can be striped across the cluster making it fast. On Wed, Apr 2, 2014 at 11:27 AM, David Thoma

Re: Spark output compression on HDFS

2014-04-02 Thread Patrick Wendell
For textFile I believe we overload it and let you set a codec directly: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 For saveAsSequenceFile yep, I think Mark is right, you need an option. On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra wrote

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
(default-cli) on project spark-0.9.0-incubating: Error reading assemblies: > No assembly descriptors found. -> [Help 1] > upon runnning > mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean assembly:assembly > > > On Apr 1, 2014, at 4:13 PM, Patrick Wendell wrote: > > D

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
Do you get the same problem if you build with maven? On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey wrote: > SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly > > That's all I do. > > On Apr 1, 2014, at 11:41 AM, Patrick Wendell wrote: > > Vidal - could you show e

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
dependency but it still failed whenever I use the > jar with ScalaBuf dependency. > Spark version is 0.9.0 > > > ~Vipul > > On Mar 31, 2014, at 4:51 PM, Patrick Wendell wrote: > > Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be > getting pull

Re: batching the output

2014-03-31 Thread Patrick Wendell
Ya this is a good way to do it. On Sun, Mar 30, 2014 at 10:11 PM, Vipul Pandey wrote: > Hi, > > I need to batch the values in my final RDD before writing out to hdfs. The > idea is to batch multiple "rows" in a protobuf and write those batches out > - mostly to save some space as a lot of metad

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-31 Thread Patrick Wendell
your dependencies including the exact Spark version and other libraries. - Patrick On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey wrote: > I'm using ScalaBuff (which depends on protobuf2.5) and facing the same > issue. any word on this one? > On Mar 27, 2014, at 6:41 PM, Kanwaldeep

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Patrick Grinaway
Also in NYC, definitely interested in a spark meetup! Sent from my iPhone > On Mar 31, 2014, at 3:07 PM, Jeremy Freeman wrote: > > Happy to help with an NYC meet up (just emailed Andy). I recently moved to > VA, but am back in NYC quite often, and have been turning several > computational peo

Re: Spark webUI - application details page

2014-03-30 Thread Patrick Wendell
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark applications can persist their state so that the UI can be reloaded after they have completed. - Patrick On Sun, Mar 30, 2014 at 10:30 AM, David Thomas wrote: > Is there a way to see 'Application Detail UI&#

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Patrick Wendell
If you call repartition() on the original stream you can set the level of parallelism after it's ingested from Kafka. I'm not sure how it maps kafka topic partitions to tasks for the ingest thought. On Thu, Mar 27, 2014 at 11:09 AM, Scott Clasen wrote: > I have a simple streaming job that create

Re: Announcing Spark SQL

2014-03-27 Thread Patrick Wendell
f fields to the respective cassandra columns. I think all of this would be fairly easy to implement on SchemaRDD and likely will make it into Spark 1.1 - Patrick On Wed, Mar 26, 2014 at 10:59 PM, Rohit Rai wrote: > Great work guys! Have been looking forward to this . . . > > In the blog it ment

Re: Building Spark 0.9.x for CDH5 with mrv1 installation (Protobuf 2.5 upgrade)

2014-03-25 Thread Patrick Wendell
I'm not sure exactly how your cluster is configured. But as far as I can tell Cloudera's MR1 CDH5 dependencies are against Hadoop 2.3. I'd just find the exact CDH version you have and link against the `mr1` version of their published dependencies in that version. So I think you wan't "2.3.0-mr1-cd

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-25 Thread Patrick Wendell
Starting with Spark 0.9 the protobuf dependency we use is shaded and cannot interfere with other protobuf libaries including those in Hadoop. Not sure what's going on in this case. Would someone who is having this problem post exactly how they are building spark? - Patrick On Fri, Mar 21,

Re: How many partitions is my RDD split into?

2014-03-24 Thread Patrick Wendell
Ah we should just add this directly in pyspark - it's as simple as the code Shivaram just wrote. - Patrick On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman wrote: > There is no direct way to get this in pyspark, but you can get it from the > underlying java rdd. For ex

Re: No space left on device exception

2014-03-23 Thread Patrick Wendell
Ognen - just so I understand. The issue is that there weren't enough inodes and this was causing a "No space left on device" error? Is that correct? If so, that's good to know because it's definitely counter intuitive. On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski wrote: > I would love to work

Re: How many partitions is my RDD split into?

2014-03-23 Thread Patrick Wendell
le if you do a highly selective filter on an RDD. For instance, you filter out one day of data from a dataset of a year. - Patrick On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra wrote: > It's much simpler: rdd.partitions.size > > > On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chamm

Re: combining operations elegantly

2014-03-23 Thread Patrick Wendell
ngle pass automatically... but that's not quite released yet :) - Patrick On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers wrote: > i currently typically do something like this: > > scala> val rdd = sc.parallelize(1 to 10) > scala> import com.twitter.algebird.Operators._ &g

Re: Log analyzer and other Spark tools

2014-03-18 Thread Patrick Wendell
Hey Roman, Ya definitely checkout pull request 42 - one cool thing is this patch now includes information about in-memory storage in the listener interface, so you can see directly which blocks are cached/on-disk etc. - Patrick On Mon, Mar 17, 2014 at 5:34 PM, Matei Zaharia wrote: > Tak

Re: slf4j and log4j loop

2014-03-16 Thread Patrick Wendell
This is not released yet but we're planning to cut a 0.9.1 release very soon (e.g. most likely this week). In the mean time you'll have checkout branch-0.9 of Spark and publish it locally then depend on the snapshot version. Or just wait it out... On Fri, Mar 14, 2014 at 2:01 PM, Adrian Mocanu wr

Re: Maximum memory limits

2014-03-16 Thread Patrick Wendell
Sean - was this merged into the 0.9 branch as well (it seems so based on the message from rxin). If so it might make sense to try out the head of branch-0.9 as well. Unless there are *also* other changes relevant to this in master. - Patrick On Sun, Mar 16, 2014 at 12:24 PM, Sean Owen wrote

Help vote for Spark talks at the Hadoop Summit

2014-03-13 Thread Patrick Wendell
(Data Science Track) Recent Developments in Spark MLlib and Beyond bit.ly/1hgZW5D (The Future of Apache Hadoop Track) Cheers, - Patrick

Re: best practices for pushing an RDD into a database

2014-03-13 Thread Patrick Wendell
/core/index.html#org.apache.spark.rdd.JdbcRDD https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala#L73 - Patrick On Thu, Mar 13, 2014 at 2:05 PM, Nicholas Chammas wrote: > My fellow welders, > > (Can we make that a thing? Let's mak

Re: Round Robin Partitioner

2014-03-13 Thread Patrick Wendell
ss the RDD itself and override getPreferredLocations. Keep in mind this is tricky because the set of executors might change during the lifetime of a Spark job. - Patrick On Thu, Mar 13, 2014 at 11:50 AM, David Thomas wrote: > Is it possible to parition the RDD elements in a round robin fashion?

Re: Changing number of workers for benchmarking purposes

2014-03-12 Thread Patrick Wendell
is: for slave in `cat "$HOSTLIST"|sed "s/#.*$//;/^$/d"`; do to this for slave in `cat "$HOSTLIST"| head -n $NUM_SLAVES | sed "s/#.*$//;/^$/d"`; do Then you could just set NUM_SLAVES before you stop/start. Not sure if this helps much but maybe it'

Re: building Spark docs

2014-03-12 Thread Patrick Wendell
Dianna I'm forwarding this to the dev list since it might be useful there as well. On Wed, Mar 12, 2014 at 11:39 AM, Diana Carroll wrote: > Hi all. I needed to build the Spark docs. The basic instructions to do > this are in spark/docs/README.md but it took me quite a bit of playing > around to

Re: Block

2014-03-11 Thread Patrick Wendell
A block is an internal construct that isn't directly exposed to users. Internally though, each partition of an RDD is mapped to one block. - Patrick On Mon, Mar 10, 2014 at 11:06 PM, David Thomas wrote: > What is the concept of Block and BlockManager in Spark? How is a Block > r

Re: "Too many open files" exception on reduceByKey

2014-03-10 Thread Patrick Wendell
x27;t change so it won't help the ulimit problem. This means you'll have to use fewer reducers (e.g. pass reduceByKey a number of reducers) or use fewer cores on each machine. - Patrick On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah wrote: > Hi everyone, > > My team (cc'

Re: [External] Re: no stdout output from worker

2014-03-10 Thread Patrick Wendell
Hey Sen, Suarav is right, and I think all of your print statements are inside of the driver program rather than inside of a closure. How are you running your program (i.e. what do you run that starts this job)? Where you run the driver you should expect to see the output. - Patrick On Mon, Mar

Re: no stdout output from worker

2014-03-09 Thread Patrick Wendell
hines. If you see stderr but not stdout that's a bit of a puzzler since they both go through the same mechanism. - Patrick On Sun, Mar 9, 2014 at 2:32 PM, Sen, Ranjan [USA] wrote: > Hi > I have some System.out.println in my Java code that is working ok in a local > environment. But

Re: Kryo serialization does not compress

2014-03-06 Thread Patrick Wendell
Hey There, This is interesting... thanks for sharing this. If you are storing in MEMORY_ONLY then you are just directly storing Java objects in the JVM. So they can't be compressed because they aren't really stored in a known format it's just left up to the JVM. To answer you other question, it's

Re: Python 2.7 + numpy break sortByKey()

2014-03-06 Thread Patrick Wendell
The difference between your two jobs is that take() is optimized and only runs on the machine where you are using the shell, whereas sortByKey requires using many machines. It seems like maybe python didn't get upgraded correctly on one of the slaves. I would look in the /root/spark/work/ folder (f

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Patrick Wendell
ssic/1.1.1 - Patrick On Wed, Mar 5, 2014 at 1:52 PM, Sergey Parhomenko wrote: > Hi Patrick, > > Thanks for the patch. I tried building a patched version of > spark-core_2.10-0.9.0-incubating.jar but the Maven build fails: > [ERROR] > /home/das/Work/thx/incubator-spark/core/src

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Patrick Wendell
Spark with this batch and seeing if it works that would be great. Thanks, Patrick On Wed, Mar 5, 2014 at 10:26 AM, Paul Brown wrote: > > Hi, Sergey -- > > Here's my recipe, implemented via Maven; YMMV if you need to do it via sbt, > etc., but it should

Re: spark-ec2 login expects at least 1 slave

2014-03-01 Thread Patrick Wendell
Yep, currently it only supports running at least 1 slave. On Sat, Mar 1, 2014 at 4:47 PM, nicholas.chammas wrote: > I successfully launched a Spark EC2 "cluster" with 0 slaves using spark-ec2. > When trying to login to the master node with spark-ec2 login, I get the > following: > > Searching for

<    1   2   3   4