Xt*X should mathematically always be positive semi-definite, so the only way
this might be bad is if it’s not invertible due to linearly dependent rows.
This might happen due to the initialization or possibly due to numerical
issues, though it seems unlikely. Maybe it also happens if some users
definite.
--
Sean Owen | Director, Data Science | London
On Thu, Mar 6, 2014 at 5:39 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Xt*X should mathematically always be positive semi-definite, so the only way
this might be bad is if it’s not invertible due to linearly dependent rows
,
If the data has linearly dependent rows ALS should have a failback
mechanism. Either remove the rows and then call BLAS posv or call BLAS gesv
or Breeze QR decomposition.
I can share the analysis over email.
Thanks.
Deb
On Thu, Mar 6, 2014 at 9:39 AM, Matei Zaharia matei.zaha
I like the pom-reader approach as well — in particular, that it lets you add
extra stuff in your SBT build after loading the dependencies from the POM.
Profiles would be the one thing missing to be able to pass options through.
Matei
On Mar 14, 2014, at 10:03 AM, Patrick Wendell
Hey Nick, I’m curious, have you been doing any further development on this? It
would be good to get expanded InputFormat support in Spark 1.0. To start with
we don’t have to do SequenceFiles in particular, we can do stuff like Avro (if
it’s easy to read in Python) or some kind of
at 12:15 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hey Nick, I’m curious, have you been doing any further development on this?
It would be good to get expanded InputFormat support in Spark 1.0. To start
with we don’t have to do SequenceFiles in particular, we can do stuff like
Avro
Congrats Michael and all for getting this so far. Spark SQL and Catalyst will
make it much easier to use structured data in Spark, and open the door for some
very cool extensions later.
Matei
On Mar 20, 2014, at 11:15 PM, Heiko Braun ike.br...@googlemail.com wrote:
Congrats! That's a really
+1 looks good to me. I tried both the source and CDH4 versions and looked at
the new streaming docs.
The release notes seem slightly incomplete, but I guess you’re still working on
them? Anyway those don’t go into the release tarball so it’s okay.
Matei
On Mar 24, 2014, at 2:01 PM, Tathagata
it on a Linux cluster.
I opened https://spark-project.atlassian.net/browse/SPARK-1326 to track it. We
can put it in another RC if we find bigger issues.
Matei
On Mar 25, 2014, at 10:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
+1 looks good to me. I tried both the source and CDH4 versions
Sounds good. Feel free to send a PR even though it’s a small change (it leads
to better Git history and such).
Matei
On Mar 27, 2014, at 4:15 PM, Mark Hamstra m...@clearstorydata.com wrote:
FYI, Spark master does build cleanly and the tests do run successfully with
Scala version set to
worth the
annoyance of everyone needing to download a new version of Scala, making
yet another version of the AMIs, etc.
-Kay
On Thu, Mar 27, 2014 at 4:33 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
Sounds good. Feel free to send a PR even though it's a small change (it
leads
The off-heap storage level is currently tied to Tachyon, but it might support
other forms of off-heap storage later. However it’s not really designed to be
mixed with the other ones. For this use case you may want to rely on memory
locality and have some custom code to push the data to the
I’d suggest looking for the issues labeled “Starter” on JIRA. You can find them
here:
https://issues.apache.org/jira/browse/SPARK-1438?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)
Matei
On Apr 7, 2014, at 9:45 PM,
The wiki is actually maintained separately in
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We restricted
editing of the wiki because bots would automatically add stuff. I’ve given you
permissions now.
Matei
On Apr 21, 2014, at 6:22 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
Try doing “gem install kramdown”. The maruku gem for Markdown throws these
errors, but Kramdown doesn’t.
Matei
On Apr 22, 2014, at 11:31 PM, DB Tsai dbt...@dbtsai.com wrote:
This is the trace.
Conversion error: There was an error converting 'docs/cluster-overview.md
'.
Hi Nicolas,
Good catches on these things.
Your website seems a little bit incomplete. I have found this page [1] with
list the two main mailing lists, users and dev. But I see a reference to a
mailing list about issues which tracks the sparks issues when it was hosted
at Atlassian. I
We do want to support it eventually, possibly as early as Spark 1.1 (which we’d
cross-build on Scala 2.10 and 2.11). If someone wants to look at it before,
feel free to do so! Scala 2.11 is very close to 2.10 so I think things will
mostly work, except for possibly the REPL (which has require
Hey Soren, are you sure that the JAR you used on the executors is for the right
version of Spark? Maybe they’re running an older version. The Kryo serializer
should be initialized the same way on both.
Matei
On May 12, 2014, at 10:39 AM, Soren Macbeth so...@yieldbot.com wrote:
I finally
It was just because it might not work with some user data types that are
Serializable. But we should investigate it, as it’s the easiest thing one can
enable to improve performance.
Matei
On May 12, 2014, at 2:47 PM, Anand Avati av...@gluster.org wrote:
Hi,
Can someone share the reason why
I’ll ask the Mesos folks about this. Unfortunately it might be tough to link
only to a company’s builds; but we can perhaps include them in addition to
instructions for building Mesos from Apache.
Matei
On May 12, 2014, at 11:55 PM, Gerard Maas gerard.m...@gmail.com wrote:
Andrew,
SHA-1 is being end-of-lived so I’d actually say switch to 512 for all of them
instead.
On May 13, 2014, at 6:49 AM, Sean Owen so...@cloudera.com wrote:
On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell pwend...@gmail.com wrote:
The release files, including signatures, digests, etc. can be
As others have said, the 1.0 milestone is about API stability, not about saying
“we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner users can
confidently build on Spark, knowing that the application they build today will
still run on Spark 1.9.9 three years from now. This is
BTW for what it’s worth I agree this is a good option to add, the only tricky
thing will be making sure the checkpoint blocks are not garbage-collected by
the block store. I don’t think they will be though.
Matei
On May 17, 2014, at 2:20 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
We do
We do actually have replicated StorageLevels in Spark. You can use
MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom
replication factor.
BTW you guys should probably have this discussion on the JIRA rather than the
dev list; I think the replies somehow ended up on the
stability can be addressed in minor releases if found, but
behavioral change and/or interface changes would be a much more invasive
issue for our users.
Regards
Mridul
On 18-May-2014 2:19 am, Matei Zaharia matei.zaha...@gmail.com wrote:
As others have said, the 1.0 milestone is about API
I took the always fun task of testing it on Windows, and unfortunately, I found
some small problems with the prebuilt packages due to recent changes to the
launch scripts: bin/spark-class2.cmd looks in ./jars instead of ./lib for the
assembly JAR, and bin/run-example2.cmd doesn’t quite match
if it's a common approach to have discussions in JIRA not here.
I don't think it's the ASF way.
Pozdrawiam,
Jacek Laskowski
http://blog.japila.pl
17 maj 2014 23:55 Matei Zaharia matei.zaha...@gmail.com napisał(a):
We do actually have replicated StorageLevels in Spark. You can use
it was the easiest way
for people to continue.
Matei
On May 18, 2014, at 4:01 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Ah, maybe it’s just different in other Apache projects. All the ones I’ve
participated in have had their design discussions on JIRA. For example take a
look at https
on a different version of org.apache.commons than Hadoop 2, but it
needs investigation. Tom, any thoughts on this?
Matei
On May 18, 2014, at 12:33 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
I took the always fun task of testing it on Windows, and unfortunately, I
found some small problems
“master” is where development happens, while branch-1.0, branch-0.9, etc are
for maintenance releases in those versions. Most likely if you want to
contribute you should use master. Some of the other named branches were for big
features in the past, but none are actively used now.
Matei
On
+1
Tested it on both Windows and Mac OS X, with both Scala and Python. Confirmed
that the issues in the previous RC were fixed.
Matei
On May 20, 2014, at 5:28 PM, Marcelo Vanzin van...@cloudera.com wrote:
+1 (non-binding)
I have:
- checked signatures and checksums of the files
- built
+1
Tested on Mac OS X and Windows.
Matei
On May 26, 2014, at 7:38 AM, Tathagata Das tathagata.das1...@gmail.com wrote:
Please vote on releasing the following candidate as Apache Spark version
1.0.0!
This has a few important bug fixes on top of rc10:
SPARK-1900 and SPARK-1918:
I think the question for me would be does this only happen when you call
partitionBy, or always? And how common do you expect calls to partitionBy to
be? If we can wait for 1.0.1 then I’d wait on this one.
Matei
On May 26, 2014, at 10:47 PM, Patrick Wendell pwend...@gmail.com wrote:
Hey
Hei Taeyun, have you sent a pull request for this fix? We can review it there.
It’s too late to merge anything but blockers for 1.0.0 but we can do it for
1.0.1 or 1.1, depending how big the patch is.
Matei
On May 27, 2014, at 5:25 PM, innowireless TaeYun Kim
taeyun@innowireless.co.kr
This is a pretty cool idea — instead of cache depth I’d call it something like
reference counting. Would you mind opening a JIRA issue about it?
The issue of really composing together libraries that use RDDs nicely isn’t
fully explored, but this is certainly one thing that would help with it.
are the totals:
+1: (13 votes)
Matei Zaharia*
Mark Hamstra*
Holden Karau
Nick Pentreath*
Will Benton
Henry Saputra
Sean McNamara*
Xiangrui Meng*
Andy Konwinski*
Krishna Sankar
Kevin Markey
Patrick Wendell*
Tathagata Das*
0: (1 vote)
Ankur Dave*
-1: (0 vote)
Please hold off
Why do you need to call Serializer from your own program? It’s an internal
developer API so ideally it would only be called to extend Spark. Are you
looking to implement a custom Serializer?
Matei
On Jun 1, 2014, at 3:40 PM, Soren Macbeth so...@yieldbot.com wrote:
BTW passing a ClassTag tells the Serializer what the type of object being
serialized is when you compile your program, which will allow for more
efficient serializers (especially on streams).
Matei
On Jun 1, 2014, at 4:24 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Why do you need
it by making ClassTag
object in clojure, but it's less than ideal.
On Sun, Jun 1, 2014 at 4:25 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
BTW passing a ClassTag tells the Serializer what the type of object being
serialized is when you compile your program, which will allow for more
, 2014 at 5:10 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Ah, got it. In general it will always be safe to pass the ClassTag for
java.lang.Object here — this is what our Java API does to say that type
info is not known. So you can always pass that. Look at the Java code for
how to get
Madhu, can you send me your Wiki username? (Sending it just to me is fine.) I
can add you to the list to edit it.
Matei
On Jun 2, 2014, at 6:27 PM, Reynold Xin r...@databricks.com wrote:
I tried but didn't find where I could add you. You probably need Matei to
help out with this.
On
Done. Looks like this was lost in the JIRA import.
Matei
On Jun 3, 2014, at 11:33 AM, Henry Saputra henry.sapu...@gmail.com wrote:
Hi,
Could someone with right karma kindly add my username (hsaputra) to
Spark's contributor list?
I was added before but somehow now I can no longer assign
Yup, it’s meant to be just a Map. You should probably use collect() and build a
multimap instead if you’d like that.
Matei
On Jun 3, 2014, at 2:08 PM, Doris Xin doris.s@gmail.com wrote:
Hey guys,
Just wanted to check real quick if collectAsMap was by design not to
return a multimap
You can modify project/SparkBuild.scala and build Spark with sbt instead of
Maven.
On Jun 5, 2014, at 12:36 PM, Meisam Fathi meisam.fa...@gmail.com wrote:
Hi community,
How should I change sbt to compile spark core with a different version
of Scala? I see maven pom files define
Yes, actually even if you don’t set it to true, on-disk data is compressed.
(This setting only affects serialized data in memory).
Matei
On Jun 11, 2014, at 2:56 PM, Surendranauth Hiraman suren.hira...@velos.io
wrote:
Hi,
Will spark.rdd.compress=true enable compression when using
(I’m forwarding this message on behalf of the ApacheCon organizers, who’d like
to see involvement from every Apache project!)
As you may be aware, ApacheCon will be held this year in Budapest, on November
17-23. (See http://apachecon.eu for more info.)
The Call For Papers for that conference
Hey Marcelo,
When we did the configuration pull request, we actually avoided having a big
list of defaults in one class file, because this creates a file that all the
components in the project depend on. For example, since we have some settings
specific to streaming and the REPL, do we want
+1
Tested it out on Mac OS X and Windows, looked through docs.
Matei
On Jun 26, 2014, at 7:06 PM, Patrick Wendell pwend...@gmail.com wrote:
Please vote on releasing the following candidate as Apache Spark version
1.0.1!
The tag to be voted on is v1.0.1-rc1 (commit 7feeda3):
+1
Tested on Mac OS X.
Matei
On Jul 6, 2014, at 1:54 AM, Andrew Or and...@databricks.com wrote:
+1, verified that the UI bug is in fact fixed in
https://github.com/apache/spark/pull/1255.
2014-07-05 20:01 GMT-07:00 Soren Macbeth so...@yieldbot.com:
+1
On Sat, Jul 5, 2014 at 7:41
Unless you can diagnose the problem quickly, Gary, I think we need to go ahead
with this release as is. This release didn't touch the Mesos support as far as
I know, so the problem might be a nondeterministic issue with your application.
But on the other hand the release does fix some critical
I haven't seen issues using the JVM's own tools (jstack, jmap, hprof and such),
so maybe there's a problem in YourKit or in your release of the JVM. Otherwise
I'd suggest increasing the heap size of the unit tests a bit (you can do this
in the SBT build file). Maybe they are very close to full
Yeah, I'd just add a spark-util that has these things.
Matei
On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com wrote:
Yeah, sadly this dependency was introduced when someone consolidated the
logging infrastructure. However, the dependency should be very small and
thus
You can actually turn off shuffle compression by setting spark.shuffle.compress
to false. Try that out, there will still be some buffers for the various
OutputStreams, but they should be smaller.
Matei
On Jul 14, 2014, at 3:30 PM, Stephen Haberman stephen.haber...@gmail.com
wrote:
Just a
Yeah, that seems like something we can inline :).
On Jul 15, 2014, at 7:30 PM, Baofeng Zhang pelickzh...@qq.com wrote:
Is Matei following this?
Catalyst uses the Utils to get the ClassLoader which loaded Spark.
Can Catalyst directly do getClass.getClassLoader to avoid the dependency
on
Hey Reynold, just to clarify, users will still have to manually broadcast
objects that they want to use *across* operations (e.g. in multiple iterations
of an algorithm, or multiple map functions, or stuff like that). But they won't
have to broadcast something they only use once.
Matei
On Jul
+1
Tested this on Mac OS X.
Matei
On Jul 25, 2014, at 4:08 PM, Tathagata Das tathagata.das1...@gmail.com wrote:
Please vote on releasing the following candidate as Apache Spark version
1.0.2.
This release fixes a number of bugs in Spark 1.0.1.
Some of the notable ones are
- SPARK-2452:
or somesuch, but
testing for A will give an incorrect answer, and the code can't be
expected to look for everyone's A+X versions. Actually inspecting
the code is more robust if a bit messier.
On Sun, Jul 27, 2014 at 9:50 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
For this particular issue
I agree as well. FWIW sometimes I've seen this happen due to language barriers,
i.e. contributors whose primary language is not English, but we need more
motivation for each change.
On July 29, 2014 at 5:12:01 PM, Nicholas Chammas (nicholas.cham...@gmail.com)
wrote:
+1 on using JIRA workflows
Hah, weird. log should be protected actually (look at trait Logging). Is your
class extending SparkContext or somehow being placed in the org.apache.spark
package? Or maybe the Scala compiler looks at it anyway.. in that case we can
rename it. Please open a JIRA for it if that's the case.
On
Hi everyone,
The PMC recently voted to add two new committers and PMC members: Joey Gonzalez
and Andrew Or. Both have been huge contributors in the past year -- Joey on
much of GraphX as well as quite a bit of the initial work in MLlib, and Andrew
on Spark Core. Join me in welcoming them as
Just as a note, when you're developing stuff, you can use test-only in sbt,
or the equivalent feature in Maven, to run just some of the tests. This is what
I do, I don't wait for Jenkins to run things. 90% of the time if it passes the
tests that I know could break stuff, it will pass all of
, coarse-grained mode would be a challenge as we have to
constantly remind people to kill their shells as soon as their queries finish.
Am I correct in viewing Mesos in coarse-grained mode as being similar to Spark
Standalone's cpu allocation behavior?
On Sat, Aug 23, 2014 at 7:16 PM, Matei
.
This is on nodes with ~15G of memory, on which we have successfully run 8G jobs.
On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
BTW it seems to me that even without that patch, you should be getting tasks
launched as long as you leave at least 32 MB of memory
for this one.
Matei
On August 25, 2014 at 1:07:15 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote:
This is kind of weird then, seems perhaps unrelated to this issue (or at least
to the way I understood it). Is the problem maybe that Mesos saw 0 MB being
freed and didn't re-offer the machine *even
Was the original issue with Spark 1.1 (i.e. master branch) or an earlier
release?
One possibility is that your S3 bucket is in a remote Amazon region, which
would make it very slow. In my experience though saveAsTextFile has worked even
for pretty large datasets in that situation, so maybe
Chen (tnac...@gmail.com) wrote:
Hi Matei,
I'm going to investigate from both Mesos and Spark side will hopefully
have a good long term solution. In the mean time having a work around
to start with is going to unblock folks.
Tim
On Mon, Aug 25, 2014 at 1:08 PM, Matei Zaharia matei.zaha
the synthetic operation and see if I get the same results or not.
Amnon
On Mon, Aug 25, 2014 at 11:26 PM, Matei Zaharia [via Apache Spark
Developers List] ml-node+s1001551n8000...@n3.nabble.com wrote:
Was the original issue with Spark 1.1 (i.e. master branch) or an earlier
release?
One
Hey Nicholas,
In general we've been looking at these periodically (at least I have) and
asking people to close out of date ones, but it's true that the list has gotten
fairly large. We should probably have an expiry time of a few months and close
them automatically. I agree that it's daunting
This shouldn't be a chicken-and-egg problem, since the script fetches the AMI
from a known URL. Seems like an issue in publishing this release.
On August 26, 2014 at 1:24:45 PM, Shivaram Venkataraman
(shiva...@eecs.berkeley.edu) wrote:
This is a chicken and egg problem in some sense. We can't
Awesome to hear this, Mayur! Thanks for putting this together.
Matei
On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com)
wrote:
Hi,
We have migrated Pig functionality on top of Spark passing 100% e2e for success
cases in pig test suite. That means UDF, Joins other
Personally I'd actually consider putting CDH4 back if there are still users on
it. It's always better to be inclusive, and the convenience of a one-click
download is high. Do we have a sense on what % of CDH users still use CDH4?
Matei
On August 28, 2014 at 11:31:13 PM, Sean Owen
Hi Nicholas,
At Databricks we already run https://github.com/databricks/spark-perf for each
release, which is a more comprehensive performance test suite.
Matei
On September 1, 2014 at 8:22:05 PM, Nicholas Chammas
(nicholas.cham...@gmail.com) wrote:
What do people think of running the Big
Hi Du,
I don't think NullWritable has ever been serializable, so you must be doing
something differently from your previous program. In this case though, just use
a map() to turn your Writables to serializable types (e.g. null and String).
Matie
On September 12, 2014 at 8:48:36 PM, Du Li
.count(). As you can
see, count() does not need to serialize and ship data while the other three
methods do.
Do you recall any difference between spark 1.0 and 1.1 that might cause this
problem?
Thanks,
Du
From: Matei Zaharia matei.zaha...@gmail.com
Date: Friday, September 12, 2014 at 9:10 PM
Hey Sandy,
On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote:
Hey All,
A couple questions came up about shared variables recently, and I wanted to
confirm my understanding and update the doc to be a little more clear.
*Broadcast variables*
Now that tasks data
:10 AM, Matei Zaharia wrote:
Hey Sandy,
On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote:
Hey All,
A couple questions came up about shared variables recently, and I wanted to
confirm my understanding and update the doc to be a little more clear.
*Broadcast
Hi Tom,
HDFS and Spark don't actually have a minimum block size -- so in that first
dataset, the files won't each be costing you 64 MB. However, the main reason
for difference in performance here is probably the number of RDD partitions. In
the first case, Spark will create an RDD with 1
PySpark doesn't attempt to support Jython at present. IMO while it might be a
bit faster, it would lose a lot of the benefits of Python, which are the very
strong data processing libraries (NumPy, SciPy, Pandas, etc). So I'm not sure
it's worth supporting unless someone demonstrates a really
Maybe there is a firewall issue that makes it slow for your nodes to connect
through the IP addresses they're configured with. I see there's this 10 second
pause between Updated info of block broadcast_84_piece1 and
ensureFreeSpace(4194304) called (where it actually receives the block). HTTP
I'm pretty sure inner joins on Spark SQL already build only one of the sides.
Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer
joins do both, and it seems like we could optimize it for those that are not
full.
Matei
On Oct 7, 2014, at 11:04 PM, Haopu Wang
Thanks for the feedback. For 1, there is an open patch:
https://github.com/apache/spark/pull/2659. For 2, broadcast blocks actually use
MEMORY_AND_DISK storage, so they will spill to disk if you have low memory, but
they're faster to access otherwise.
Matei
On Oct 9, 2014, at 12:11 PM,
Oops I forgot to add, for 2, maybe we can add a flag to use DISK_ONLY for
TorrentBroadcast, or if the broadcasts are bigger than some size.
Matei
On Oct 9, 2014, at 3:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Thanks for the feedback. For 1, there is an open patch:
https
Hi Michael,
I've been working on this in my repo:
https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests with
these features soon, but meanwhile you can try this branch. See
https://github.com/mateiz/spark/compare/decimal for the individual commits that
went into it. It
the
values as a parquet binary type. Why not write them using the int64 parquet
type instead?
Cheers,
Michael
On Oct 12, 2014, at 3:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hi Michael,
I've been working on this in my repo:
https://github.com/mateiz/spark/tree/decimal. I'll make
of issues. Thanks in advance!
On Oct 10, 2014 10:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hi folks,
I interrupt your regularly scheduled user / dev list to bring you some pretty
cool news for the project, which is that we've been able to use Spark to
break MapReduce's 100 TB
I'd also wait a bit until these are gone. Jetty is unfortunately a much hairier
topic by the way, because the Hadoop libraries also depend on Jetty. I think it
will be hard to update. However, a patch that shades Jetty might be nice to
have, if that doesn't require shading a lot of other stuff.
After successful events in the past two years, the Spark Summit conference has
expanded for 2015, offering both an event in New York on March 18-19 and one in
San Francisco on June 15-17. The conference is a great chance to meet people
from throughout the Spark community and see the latest
BTW several people asked about registration and student passes. Registration
will open in a few weeks, and like in previous Spark Summits, I expect there to
be a special pass for students.
Matei
On Oct 18, 2014, at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
After successful
Hi Stephen,
How did you generate your Maven workspace? You need to make sure the Hive
profile is enabled for it. For example sbt/sbt -Phive gen-idea.
Matei
On Oct 28, 2014, at 7:42 PM, Stephen Boesch java...@gmail.com wrote:
I have run on the command line via maven and it is fine:
mvn
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have
better performance while creating fewer files. So I'd suggest trying that too.
Matei
On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote:
Hey Matt,
There's some prior work that compares
(BTW this had a bug with negative hash codes in 1.1.0 so you should try
branch-1.1 for it).
Matei
On Nov 3, 2014, at 6:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have
better performance while creating fewer
this
happen.
Updated blog post:
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi folks,
I interrupt your regularly scheduled user / dev list to bring you
Hi all,
I wanted to share a discussion we've been having on the PMC list, as well as
call for an official vote on it on a public list. Basically, as the Spark
project scales up, we need to define a model to make sure there is still great
oversight of key components (in particular internal
need a maintainer for Mesos, and I wonder if there
is someone that can be added to that?
Tim
On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hi all,
I wanted to share a discussion we've been having on the PMC list, as well as
call for an official vote
Yup, the Hadoop nodes were from 2013, each with 64 GB RAM, 12 cores, 10 Gbps
Ethernet and 12 disks. For 100 TB of data, the intermediate data could fit in
memory on this cluster, which can make shuffle much faster than with
intermediate data on SSDs. You can find the specs in
, 2014 at 1:31 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hi all,
I wanted to share a discussion we've been having on the PMC list, as well as
call for an official vote on it on a public list. Basically, as the Spark
project scales up, we need to define a model to make sure there is still
traffic, and be very active in design API discussions.
That leads to better consistency and long-term design choices.
Cheers,
bc
On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com
mailto:matei.zaha...@gmail.com wrote:
Hi all,
I wanted to share a discussion we've
On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
Hi all,
I wanted to share a discussion we've been having on the PMC list, as well
as call for an official vote on it on a public list. Basically, as the
Spark project scales up, we need to define a model to make sure
Alright, Greg, I think I understand how Subversion's model is different, which
is that the PMC members are all full committers. However, I still think that
the model proposed here is purely organizational (how the PMC and committers
organize themselves), and in no way changes peoples' ownership
is just to have a better
structure for reviewing and minimize the chance of errors.
Here is a tally of the votes:
Binding votes (from PMC): 17 +1, no 0 or -1
Matei Zaharia
Michael Armbrust
Reynold Xin
Patrick Wendell
Andrew Or
Prashant Sharma
Mark Hamstra
Xiangrui Meng
Ankur Dave
Imran Rashid
Jason
1 - 100 of 274 matches
Mail list logo