Re: QR decomposition in Spark ALS

2014-03-06 Thread Matei Zaharia
Xt*X should mathematically always be positive semi-definite, so the only way this might be bad is if it’s not invertible due to linearly dependent rows. This might happen due to the initialization or possibly due to numerical issues, though it seems unlikely. Maybe it also happens if some users

Re: QR decomposition in Spark ALS

2014-03-06 Thread Matei Zaharia
definite. -- Sean Owen | Director, Data Science | London On Thu, Mar 6, 2014 at 5:39 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Xt*X should mathematically always be positive semi-definite, so the only way this might be bad is if it’s not invertible due to linearly dependent rows

Re: QR decomposition in Spark ALS

2014-03-06 Thread Matei Zaharia
, If the data has linearly dependent rows ALS should have a failback mechanism. Either remove the rows and then call BLAS posv or call BLAS gesv or Breeze QR decomposition. I can share the analysis over email. Thanks. Deb On Thu, Mar 6, 2014 at 9:39 AM, Matei Zaharia matei.zaha

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-03-14 Thread Matei Zaharia
I like the pom-reader approach as well — in particular, that it lets you add extra stuff in your SBT build after loading the dependencies from the POM. Profiles would be the one thing missing to be able to pass options through. Matei On Mar 14, 2014, at 10:03 AM, Patrick Wendell

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-18 Thread Matei Zaharia
Hey Nick, I’m curious, have you been doing any further development on this? It would be good to get expanded InputFormat support in Spark 1.0. To start with we don’t have to do SequenceFiles in particular, we can do stuff like Avro (if it’s easy to read in Python) or some kind of

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-19 Thread Matei Zaharia
at 12:15 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Nick, I’m curious, have you been doing any further development on this? It would be good to get expanded InputFormat support in Spark 1.0. To start with we don’t have to do SequenceFiles in particular, we can do stuff like Avro

Re: new Catalyst/SQL component merged into master

2014-03-21 Thread Matei Zaharia
Congrats Michael and all for getting this so far. Spark SQL and Catalyst will make it much easier to use structured data in Spark, and open the door for some very cool extensions later. Matei On Mar 20, 2014, at 11:15 PM, Heiko Braun ike.br...@googlemail.com wrote: Congrats! That's a really

Re: [VOTE] Release Apache Spark 0.9.1 (rc1)

2014-03-25 Thread Matei Zaharia
+1 looks good to me. I tried both the source and CDH4 versions and looked at the new streaming docs. The release notes seem slightly incomplete, but I guess you’re still working on them? Anyway those don’t go into the release tarball so it’s okay. Matei On Mar 24, 2014, at 2:01 PM, Tathagata

Re: [VOTE] Release Apache Spark 0.9.1 (rc1)

2014-03-25 Thread Matei Zaharia
it on a Linux cluster. I opened https://spark-project.atlassian.net/browse/SPARK-1326 to track it. We can put it in another RC if we find bigger issues. Matei On Mar 25, 2014, at 10:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 looks good to me. I tried both the source and CDH4 versions

Re: Scala 2.10.4

2014-03-27 Thread Matei Zaharia
Sounds good. Feel free to send a PR even though it’s a small change (it leads to better Git history and such). Matei On Mar 27, 2014, at 4:15 PM, Mark Hamstra m...@clearstorydata.com wrote: FYI, Spark master does build cleanly and the tests do run successfully with Scala version set to

Re: Scala 2.10.4

2014-03-28 Thread Matei Zaharia
worth the annoyance of everyone needing to download a new version of Scala, making yet another version of the AMIs, etc. -Kay On Thu, Mar 27, 2014 at 4:33 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Sounds good. Feel free to send a PR even though it's a small change (it leads

Re: ephemeral storage level in spark ?

2014-04-06 Thread Matei Zaharia
The off-heap storage level is currently tied to Tachyon, but it might support other forms of off-heap storage later. However it’s not really designed to be mixed with the other ones. For this use case you may want to rely on memory locality and have some custom code to push the data to the

Re: Contributing to Spark

2014-04-07 Thread Matei Zaharia
I’d suggest looking for the issues labeled “Starter” on JIRA. You can find them here: https://issues.apache.org/jira/browse/SPARK-1438?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened) Matei On Apr 7, 2014, at 9:45 PM,

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Matei Zaharia
The wiki is actually maintained separately in https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We restricted editing of the wiki because bots would automatically add stuff. I’ve given you permissions now. Matei On Apr 21, 2014, at 6:22 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

Re: Jekyll documentation generation error

2014-04-23 Thread Matei Zaharia
Try doing “gem install kramdown”. The maruku gem for Markdown throws these errors, but Kramdown doesn’t. Matei On Apr 22, 2014, at 11:31 PM, DB Tsai dbt...@dbtsai.com wrote: This is the trace. Conversion error: There was an error converting 'docs/cluster-overview.md '.

Re: Mailing list

2014-05-03 Thread Matei Zaharia
Hi Nicolas, Good catches on these things. Your website seems a little bit incomplete. I have found this page [1] with list the two main mailing lists, users and dev. But I see a reference to a mailing list about issues which tracks the sparks issues when it was hosted at Atlassian. I

Re: Spark on Scala 2.11

2014-05-11 Thread Matei Zaharia
We do want to support it eventually, possibly as early as Spark 1.1 (which we’d cross-build on Scala 2.10 and 2.11). If someone wants to look at it before, feel free to do so! Scala 2.11 is very close to 2.10 so I think things will mostly work, except for possibly the REPL (which has require

Re: Bug is KryoSerializer under Mesos [work-around included]

2014-05-12 Thread Matei Zaharia
Hey Soren, are you sure that the JAR you used on the executors is for the right version of Spark? Maybe they’re running an older version. The Kryo serializer should be initialized the same way on both. Matei On May 12, 2014, at 10:39 AM, Soren Macbeth so...@yieldbot.com wrote: I finally

Re: Kryo not default?

2014-05-12 Thread Matei Zaharia
It was just because it might not work with some user data types that are Serializable. But we should investigate it, as it’s the easiest thing one can enable to improve performance. Matei On May 12, 2014, at 2:47 PM, Anand Avati av...@gluster.org wrote: Hi, Can someone share the reason why

Re: Updating docs for running on Mesos

2014-05-13 Thread Matei Zaharia
I’ll ask the Mesos folks about this. Unfortunately it might be tough to link only to a company’s builds; but we can perhaps include them in addition to instructions for building Mesos from Apache. Matei On May 12, 2014, at 11:55 PM, Gerard Maas gerard.m...@gmail.com wrote: Andrew,

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-15 Thread Matei Zaharia
SHA-1 is being end-of-lived so I’d actually say switch to 512 for all of them instead. On May 13, 2014, at 6:49 AM, Sean Owen so...@cloudera.com wrote: On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell pwend...@gmail.com wrote: The release files, including signatures, digests, etc. can be

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Matei Zaharia
As others have said, the 1.0 milestone is about API stability, not about saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner users can confidently build on Spark, knowing that the application they build today will still run on Spark 1.9.9 three years from now. This is

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-17 Thread Matei Zaharia
BTW for what it’s worth I agree this is a good option to add, the only tricky thing will be making sure the checkpoint blocks are not garbage-collected by the block store. I don’t think they will be though. Matei On May 17, 2014, at 2:20 PM, Matei Zaharia matei.zaha...@gmail.com wrote: We do

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-17 Thread Matei Zaharia
We do actually have replicated StorageLevels in Spark. You can use MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom replication factor. BTW you guys should probably have this discussion on the JIRA rather than the dev list; I think the replies somehow ended up on the

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Matei Zaharia
stability can be addressed in minor releases if found, but behavioral change and/or interface changes would be a much more invasive issue for our users. Regards Mridul On 18-May-2014 2:19 am, Matei Zaharia matei.zaha...@gmail.com wrote: As others have said, the 1.0 milestone is about API

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-18 Thread Matei Zaharia
I took the always fun task of testing it on Windows, and unfortunately, I found some small problems with the prebuilt packages due to recent changes to the launch scripts: bin/spark-class2.cmd looks in ./jars instead of ./lib for the assembly JAR, and bin/run-example2.cmd doesn’t quite match

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Matei Zaharia
if it's a common approach to have discussions in JIRA not here. I don't think it's the ASF way. Pozdrawiam, Jacek Laskowski http://blog.japila.pl 17 maj 2014 23:55 Matei Zaharia matei.zaha...@gmail.com napisał(a): We do actually have replicated StorageLevels in Spark. You can use

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Matei Zaharia
it was the easiest way for people to continue. Matei On May 18, 2014, at 4:01 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Ah, maybe it’s just different in other Apache projects. All the ones I’ve participated in have had their design discussions on JIRA. For example take a look at https

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-18 Thread Matei Zaharia
on a different version of org.apache.commons than Hadoop 2, but it needs investigation. Tom, any thoughts on this? Matei On May 18, 2014, at 12:33 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I took the always fun task of testing it on Windows, and unfortunately, I found some small problems

Re: queston about Spark repositories in GitHub

2014-05-19 Thread Matei Zaharia
“master” is where development happens, while branch-1.0, branch-0.9, etc are for maintenance releases in those versions. Most likely if you want to contribute you should use master. Some of the other named branches were for big features in the past, but none are actively used now. Matei On

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Matei Zaharia
+1 Tested it on both Windows and Mac OS X, with both Scala and Python. Confirmed that the issues in the previous RC were fixed. Matei On May 20, 2014, at 5:28 PM, Marcelo Vanzin van...@cloudera.com wrote: +1 (non-binding) I have: - checked signatures and checksums of the files - built

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-26 Thread Matei Zaharia
+1 Tested on Mac OS X and Windows. Matei On May 26, 2014, at 7:38 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few important bug fixes on top of rc10: SPARK-1900 and SPARK-1918:

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-26 Thread Matei Zaharia
I think the question for me would be does this only happen when you call partitionBy, or always? And how common do you expect calls to partitionBy to be? If we can wait for 1.0.1 then I’d wait on this one. Matei On May 26, 2014, at 10:47 PM, Patrick Wendell pwend...@gmail.com wrote: Hey

Re: About JIRA SPARK-1825

2014-05-27 Thread Matei Zaharia
Hei Taeyun, have you sent a pull request for this fix? We can review it there. It’s too late to merge anything but blockers for 1.0.0 but we can do it for 1.0.1 or 1.1, depending how big the patch is. Matei On May 27, 2014, at 5:25 PM, innowireless TaeYun Kim taeyun@innowireless.co.kr

Re: Suggestion: RDD cache depth

2014-05-29 Thread Matei Zaharia
This is a pretty cool idea — instead of cache depth I’d call it something like reference counting. Would you mind opening a JIRA issue about it? The issue of really composing together libraries that use RDDs nicely isn’t fully explored, but this is certainly one thing that would help with it.

Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Matei Zaharia
are the totals: +1: (13 votes) Matei Zaharia* Mark Hamstra* Holden Karau Nick Pentreath* Will Benton Henry Saputra Sean McNamara* Xiangrui Meng* Andy Konwinski* Krishna Sankar Kevin Markey Patrick Wendell* Tathagata Das* 0: (1 vote) Ankur Dave* -1: (0 vote) Please hold off

Re: ClassTag in Serializer in 1.0.0 makes non-scala callers sad panda

2014-06-01 Thread Matei Zaharia
Why do you need to call Serializer from your own program? It’s an internal developer API so ideally it would only be called to extend Spark. Are you looking to implement a custom Serializer? Matei On Jun 1, 2014, at 3:40 PM, Soren Macbeth so...@yieldbot.com wrote:

Re: ClassTag in Serializer in 1.0.0 makes non-scala callers sad panda

2014-06-01 Thread Matei Zaharia
BTW passing a ClassTag tells the Serializer what the type of object being serialized is when you compile your program, which will allow for more efficient serializers (especially on streams). Matei On Jun 1, 2014, at 4:24 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Why do you need

Re: ClassTag in Serializer in 1.0.0 makes non-scala callers sad panda

2014-06-01 Thread Matei Zaharia
it by making ClassTag object in clojure, but it's less than ideal. On Sun, Jun 1, 2014 at 4:25 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW passing a ClassTag tells the Serializer what the type of object being serialized is when you compile your program, which will allow for more

Re: ClassTag in Serializer in 1.0.0 makes non-scala callers sad panda

2014-06-01 Thread Matei Zaharia
, 2014 at 5:10 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Ah, got it. In general it will always be safe to pass the ClassTag for java.lang.Object here — this is what our Java API does to say that type info is not known. So you can always pass that. Look at the Java code for how to get

Re: Eclipse Scala IDE/Scala test and Wiki

2014-06-02 Thread Matei Zaharia
Madhu, can you send me your Wiki username? (Sending it just to me is fine.) I can add you to the list to edit it. Matei On Jun 2, 2014, at 6:27 PM, Reynold Xin r...@databricks.com wrote: I tried but didn't find where I could add you. You probably need Matei to help out with this. On

Re: Add my JIRA username (hsaputra) to Spark's contributor's list

2014-06-03 Thread Matei Zaharia
Done. Looks like this was lost in the JIRA import. Matei On Jun 3, 2014, at 11:33 AM, Henry Saputra henry.sapu...@gmail.com wrote: Hi, Could someone with right karma kindly add my username (hsaputra) to Spark's contributor list? I was added before but somehow now I can no longer assign

Re: collectAsMap doesn't return a multiMap?

2014-06-03 Thread Matei Zaharia
Yup, it’s meant to be just a Map. You should probably use collect() and build a multimap instead if you’d like that. Matei On Jun 3, 2014, at 2:08 PM, Doris Xin doris.s@gmail.com wrote: Hey guys, Just wanted to check real quick if collectAsMap was by design not to return a multimap

Re: Building Spark against Scala 2.10.1 virtualized

2014-06-05 Thread Matei Zaharia
You can modify project/SparkBuild.scala and build Spark with sbt instead of Maven. On Jun 5, 2014, at 12:36 PM, Meisam Fathi meisam.fa...@gmail.com wrote: Hi community, How should I change sbt to compile spark core with a different version of Scala? I see maven pom files define

Re: Compression with DISK_ONLY persistence

2014-06-11 Thread Matei Zaharia
Yes, actually even if you don’t set it to true, on-disk data is compressed. (This setting only affects serialized data in memory). Matei On Jun 11, 2014, at 2:56 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Hi, Will spark.rdd.compress=true enable compression when using

Fwd: ApacheCon CFP closes June 25

2014-06-12 Thread Matei Zaharia
(I’m forwarding this message on behalf of the ApacheCon organizers, who’d like to see involvement from every Apache project!) As you may be aware, ApacheCon will be held this year in Budapest, on November 17-23. (See http://apachecon.eu for more info.) The Call For Papers for that conference

Re: RFC: [SPARK-529] Create constants for known config variables.

2014-06-23 Thread Matei Zaharia
Hey Marcelo, When we did the configuration pull request, we actually avoided having a big list of defaults in one class file, because this creates a file that all the components in the project depend on. For example, since we have some settings specific to streaming and the REPL, do we want

Re: [VOTE] Release Apache Spark 1.0.1 (RC1)

2014-06-27 Thread Matei Zaharia
+1 Tested it out on Mac OS X and Windows, looked through docs. Matei On Jun 26, 2014, at 7:06 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.1! The tag to be voted on is v1.0.1-rc1 (commit 7feeda3):

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-06 Thread Matei Zaharia
+1 Tested on Mac OS X. Matei On Jul 6, 2014, at 1:54 AM, Andrew Or and...@databricks.com wrote: +1, verified that the UI bug is in fact fixed in https://github.com/apache/spark/pull/1255. 2014-07-05 20:01 GMT-07:00 Soren Macbeth so...@yieldbot.com: +1 On Sat, Jul 5, 2014 at 7:41

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Matei Zaharia
Unless you can diagnose the problem quickly, Gary, I think we need to go ahead with this release as is. This release didn't touch the Mesos support as far as I know, so the problem might be a nondeterministic issue with your application. But on the other hand the release does fix some critical

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Matei Zaharia
I haven't seen issues using the JVM's own tools (jstack, jmap, hprof and such), so maybe there's a problem in YourKit or in your release of the JVM. Otherwise I'd suggest increasing the heap size of the unit tests a bit (you can do this in the SBT build file). Maybe they are very close to full

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Matei Zaharia
Yeah, I'd just add a spark-util that has these things. Matei On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Matei Zaharia
You can actually turn off shuffle compression by setting spark.shuffle.compress to false. Try that out, there will still be some buffers for the various OutputStreams, but they should be smaller. Matei On Jul 14, 2014, at 3:30 PM, Stephen Haberman stephen.haber...@gmail.com wrote: Just a

Re: Catalyst dependency on Spark Core

2014-07-15 Thread Matei Zaharia
Yeah, that seems like something we can inline :). On Jul 15, 2014, at 7:30 PM, Baofeng Zhang pelickzh...@qq.com wrote: Is Matei following this? Catalyst uses the Utils to get the ClassLoader which loaded Spark. Can Catalyst directly do getClass.getClassLoader to avoid the dependency on

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Matei Zaharia
Hey Reynold, just to clarify, users will still have to manually broadcast objects that they want to use *across* operations (e.g. in multiple iterations of an algorithm, or multiple map functions, or stuff like that). But they won't have to broadcast something they only use once. Matei On Jul

Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-27 Thread Matei Zaharia
+1 Tested this on Mac OS X. Matei On Jul 25, 2014, at 4:08 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.2. This release fixes a number of bugs in Spark 1.0.1. Some of the notable ones are - SPARK-2452:

Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-27 Thread Matei Zaharia
or somesuch, but testing for A will give an incorrect answer, and the code can't be expected to look for everyone's A+X versions. Actually inspecting the code is more robust if a bit messier. On Sun, Jul 27, 2014 at 9:50 PM, Matei Zaharia matei.zaha...@gmail.com wrote: For this particular issue

Re: JIRA content request

2014-07-29 Thread Matei Zaharia
I agree as well. FWIW sometimes I've seen this happen due to language barriers, i.e. contributors whose primary language is not English, but we need more motivation for each change. On July 29, 2014 at 5:12:01 PM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: +1 on using JIRA workflows

Re: log overloaded in SparkContext/ Spark 1.0.x

2014-08-04 Thread Matei Zaharia
Hah, weird. log should be protected actually (look at trait Logging). Is your class extending SparkContext or somehow being placed in the org.apache.spark package? Or maybe the Scala compiler looks at it anyway.. in that case we can rename it. Please open a JIRA for it if that's the case. On

Welcoming two new committers

2014-08-08 Thread Matei Zaharia
Hi everyone, The PMC recently voted to add two new committers and PMC members: Joey Gonzalez and Andrew Or. Both have been huge contributors in the past year -- Joey on much of GraphX as well as quite a bit of the initial work in MLlib, and Andrew on Spark Core. Join me in welcoming them as

Re: Unit tests in 5 minutes

2014-08-08 Thread Matei Zaharia
Just as a note, when you're developing stuff, you can use test-only in sbt, or the equivalent feature in Maven, to run just some of the tests. This is what I do, I don't wait for Jenkins to run things. 90% of the time if it passes the tests that I know could break stuff, it will pass all of

Re: Mesos/Spark Deadlock

2014-08-24 Thread Matei Zaharia
, coarse-grained mode would be a challenge as we have to constantly remind people to kill their shells as soon as their queries finish.   Am I correct in viewing Mesos in coarse-grained mode as being similar to Spark Standalone's cpu allocation behavior? On Sat, Aug 23, 2014 at 7:16 PM, Matei

Re: Mesos/Spark Deadlock

2014-08-25 Thread Matei Zaharia
. This is on nodes with ~15G of memory, on which we have successfully run 8G jobs. On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW it seems to me that even without that patch, you should be getting tasks launched as long as you leave at least 32 MB of memory

Re: Mesos/Spark Deadlock

2014-08-25 Thread Matei Zaharia
for this one. Matei On August 25, 2014 at 1:07:15 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: This is kind of weird then, seems perhaps unrelated to this issue (or at least to the way I understood it). Is the problem maybe that Mesos saw 0 MB being freed and didn't re-offer the machine *even

Re: saveAsTextFile to s3 on spark does not work, just hangs

2014-08-25 Thread Matei Zaharia
Was the original issue with Spark 1.1 (i.e. master branch) or an earlier release? One possibility is that your S3 bucket is in a remote Amazon region, which would make it very slow. In my experience though saveAsTextFile has worked even for pretty large datasets in that situation, so maybe

Re: Mesos/Spark Deadlock

2014-08-25 Thread Matei Zaharia
Chen (tnac...@gmail.com) wrote: Hi Matei, I'm going to investigate from both Mesos and Spark side will hopefully have a good long term solution. In the mean time having a work around to start with is going to unblock folks. Tim On Mon, Aug 25, 2014 at 1:08 PM, Matei Zaharia matei.zaha

Re: saveAsTextFile to s3 on spark does not work, just hangs

2014-08-25 Thread Matei Zaharia
the synthetic operation and see if I get the same results or not. Amnon On Mon, Aug 25, 2014 at 11:26 PM, Matei Zaharia [via Apache Spark Developers List] ml-node+s1001551n8000...@n3.nabble.com wrote: Was the original issue with Spark 1.1 (i.e. master branch) or an earlier release? One

Re: Handling stale PRs

2014-08-25 Thread Matei Zaharia
Hey Nicholas, In general we've been looking at these periodically (at least I have) and asking people to close out of date ones, but it's true that the list has gotten fairly large. We should probably have an expiry time of a few months and close them automatically. I agree that it's daunting

Re: spark-ec2 1.0.2 creates EC2 cluster at wrong version

2014-08-26 Thread Matei Zaharia
This shouldn't be a chicken-and-egg problem, since the script fetches the AMI from a known URL. Seems like an issue in publishing this release. On August 26, 2014 at 1:24:45 PM, Shivaram Venkataraman (shiva...@eecs.berkeley.edu) wrote: This is a chicken and egg problem in some sense. We can't

Re: Update on Pig on Spark initiative

2014-08-27 Thread Matei Zaharia
Awesome to hear this, Mayur! Thanks for putting this together. Matei On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com) wrote: Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins other

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Matei Zaharia
Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen

Re: Run the Big Data Benchmark for new releases

2014-09-01 Thread Matei Zaharia
Hi Nicholas, At Databricks we already run https://github.com/databricks/spark-perf for each release, which is a more comprehensive performance test suite. Matei On September 1, 2014 at 8:22:05 PM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: What do people think of running the Big

Re: NullWritable not serializable

2014-09-12 Thread Matei Zaharia
Hi Du, I don't think NullWritable has ever been serializable, so you must be doing something differently from your previous program. In this case though, just use a map() to turn your Writables to serializable types (e.g. null and String). Matie On September 12, 2014 at 8:48:36 PM, Du Li

Re: NullWritable not serializable

2014-09-15 Thread Matei Zaharia
.count(). As you can see, count() does not need to serialize and ship data while the other three methods do. Do you recall any difference between spark 1.0 and 1.1 that might cause this problem? Thanks, Du From: Matei Zaharia matei.zaha...@gmail.com Date: Friday, September 12, 2014 at 9:10 PM

Re: A couple questions about shared variables

2014-09-20 Thread Matei Zaharia
Hey Sandy, On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote: Hey All,  A couple questions came up about shared variables recently, and I wanted to  confirm my understanding and update the doc to be a little more clear.  *Broadcast variables*  Now that tasks data

Re: A couple questions about shared variables

2014-09-21 Thread Matei Zaharia
:10 AM, Matei Zaharia wrote: Hey Sandy, On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote: Hey All,  A couple questions came up about shared variables recently, and I wanted to  confirm my understanding and update the doc to be a little more clear.  *Broadcast

Re: Impact of input format on timing

2014-10-05 Thread Matei Zaharia
Hi Tom, HDFS and Spark don't actually have a minimum block size -- so in that first dataset, the files won't each be costing you 64 MB. However, the main reason for difference in performance here is probably the number of RDD partitions. In the first case, Spark will create an RDD with 1

Re: Jython importing pyspark?

2014-10-05 Thread Matei Zaharia
PySpark doesn't attempt to support Jython at present. IMO while it might be a bit faster, it would lose a lot of the benefits of Python, which are the very strong data processing libraries (NumPy, SciPy, Pandas, etc). So I'm not sure it's worth supporting unless someone demonstrates a really

Re: TorrentBroadcast slow performance

2014-10-07 Thread Matei Zaharia
Maybe there is a firewall issue that makes it slow for your nodes to connect through the IP addresses they're configured with. I see there's this 10 second pause between Updated info of block broadcast_84_piece1 and ensureFreeSpace(4194304) called (where it actually receives the block). HTTP

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Matei Zaharia
I'm pretty sure inner joins on Spark SQL already build only one of the sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer joins do both, and it seems like we could optimize it for those that are not full. Matei On Oct 7, 2014, at 11:04 PM, Haopu Wang

Re: TorrentBroadcast slow performance

2014-10-09 Thread Matei Zaharia
Thanks for the feedback. For 1, there is an open patch: https://github.com/apache/spark/pull/2659. For 2, broadcast blocks actually use MEMORY_AND_DISK storage, so they will spill to disk if you have low memory, but they're faster to access otherwise. Matei On Oct 9, 2014, at 12:11 PM,

Re: TorrentBroadcast slow performance

2014-10-09 Thread Matei Zaharia
Oops I forgot to add, for 2, maybe we can add a flag to use DISK_ONLY for TorrentBroadcast, or if the broadcasts are bigger than some size. Matei On Oct 9, 2014, at 3:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Thanks for the feedback. For 1, there is an open patch: https

Re: reading/writing parquet decimal type

2014-10-12 Thread Matei Zaharia
Hi Michael, I've been working on this in my repo: https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests with these features soon, but meanwhile you can try this branch. See https://github.com/mateiz/spark/compare/decimal for the individual commits that went into it. It

Re: reading/writing parquet decimal type

2014-10-12 Thread Matei Zaharia
the values as a parquet binary type. Why not write them using the int64 parquet type instead? Cheers, Michael On Oct 12, 2014, at 3:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Michael, I've been working on this in my repo: https://github.com/mateiz/spark/tree/decimal. I'll make

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Matei Zaharia
of issues. Thanks in advance! On Oct 10, 2014 10:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark to break MapReduce's 100 TB

Re: Raise Java dependency from 6 to 7

2014-10-18 Thread Matei Zaharia
I'd also wait a bit until these are gone. Jetty is unfortunately a much hairier topic by the way, because the Hadoop libraries also depend on Jetty. I think it will be hard to update. However, a patch that shades Jetty might be nice to have, if that doesn't require shading a lot of other stuff.

Submissions open for Spark Summit East 2015

2014-10-18 Thread Matei Zaharia
After successful events in the past two years, the Spark Summit conference has expanded for 2015, offering both an event in New York on March 18-19 and one in San Francisco on June 15-17. The conference is a great chance to meet people from throughout the Spark community and see the latest

Re: Submissions open for Spark Summit East 2015

2014-10-19 Thread Matei Zaharia
BTW several people asked about registration and student passes. Registration will open in a few weeks, and like in previous Spark Summits, I expect there to be a special pass for students. Matei On Oct 18, 2014, at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: After successful

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Matei Zaharia
Hi Stephen, How did you generate your Maven workspace? You need to make sure the Hive profile is enabled for it. For example sbt/sbt -Phive gen-idea. Matei On Oct 28, 2014, at 7:42 PM, Stephen Boesch java...@gmail.com wrote: I have run on the command line via maven and it is fine: mvn

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer files. So I'd suggest trying that too. Matei On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote: Hey Matt, There's some prior work that compares

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
(BTW this had a bug with negative hash codes in 1.1.0 so you should try branch-1.1 for it). Matei On Nov 3, 2014, at 6:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote: In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Matei Zaharia
this happen. Updated blog post: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you

[VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Matei Zaharia
Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Matei Zaharia
need a maintainer for Mesos, and I wonder if there is someone that can be added to that? Tim On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote

Re: Surprising Spark SQL benchmark

2014-11-05 Thread Matei Zaharia
Yup, the Hadoop nodes were from 2013, each with 64 GB RAM, 12 cores, 10 Gbps Ethernet and 12 disks. For 100 TB of data, the intermediate data could fit in memory on this cluster, which can make shuffle much faster than with intermediate data on SSDs. You can find the specs in

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Matei Zaharia
, 2014 at 1:31 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Matei Zaharia
traffic, and be very active in design API discussions. That leads to better consistency and long-term design choices. Cheers, bc On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Matei Zaharia
On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Matei Zaharia
Alright, Greg, I think I understand how Subversion's model is different, which is that the PMC members are all full committers. However, I still think that the model proposed here is purely organizational (how the PMC and committers organize themselves), and in no way changes peoples' ownership

[RESULT] [VOTE] Designating maintainers for some Spark components

2014-11-08 Thread Matei Zaharia
is just to have a better structure for reviewing and minimize the chance of errors. Here is a tally of the votes: Binding votes (from PMC): 17 +1, no 0 or -1 Matei Zaharia Michael Armbrust Reynold Xin Patrick Wendell Andrew Or Prashant Sharma Mark Hamstra Xiangrui Meng Ankur Dave Imran Rashid Jason

  1   2   3   >