Re: Joining the spark dev community

2014-10-19 Thread Henry Saputra
Hi Saurabh,

Good way to start is to use Spark with your applications and file
issues you might have found and maybe provide patch for those or
existing ones.

Please take a look at Spark's how to contribute page [1] to help you
get started.

Hope this helps.

- Henry


[1] https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

On Sat, Oct 18, 2014 at 1:46 PM, Saurabh Wadhawan
saurabh.wadha...@guavus.com wrote:
 How can I become a spark contributor.
 What's the good path that I can follow to become an active code submitter for 
 spark from a newbie.

 Regards
 - Saurabh


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Breaking the previous large-scale sort record with Spark

2014-10-11 Thread Henry Saputra
Congrats to Reynold et al leading this effort!

- Henry

On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some pretty 
 cool news for the project, which is that we've been able to use Spark to 
 break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x 
 fewer nodes. There's a detailed writeup at 
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 
 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few weeks, 
 along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In 
 addition, we'd really like to thank Amazon's EC2 team for providing the 
 machines to make this possible. Finally, this result would of course not be 
 possible without the many many other contributions, testing and feature 
 requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to 
 100-millisecond streaming and interactive queries is quite uncommon, and it's 
 thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-04 Thread Henry Saputra
LICENSE and NOTICE files are good
Hash files are good
Signature files are good
No 3rd parties executables
Source compiled
Run local and standalone tests
Test persist off heap with Tachyon looks good

+1

- Henry

On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.1.0!

 The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc4/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1031/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/

 Please vote on releasing this package as Apache Spark 1.1.0!

 The vote is open until Saturday, September 06, at 08:30 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.1.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == Regressions fixed since RC3 ==
 SPARK-3332 - Issue with tagging in EC2 scripts
 SPARK-3358 - Issue with regression for m3.XX instances

 == What justifies a -1 vote for this release? ==
 This vote is happening very late into the QA period compared with
 previous votes, so -1 votes should only occur for significant
 regressions from 1.0.2. Bugs already present in 1.0.X will not block
 this release.

 == What default changes should I be aware of? ==
 1. The default value of spark.io.compression.codec is now snappy
 -- Old behavior can be restored by switching to lzf

 2. PySpark now performs external spilling during aggregations.
 -- Old behavior can be restored by setting spark.shuffle.spill to false.

 3. PySpark uses a new heuristic for determining the parallelism of
 shuffle operations.
 -- Old behavior can be restored by setting
 spark.default.parallelism to the number of cores in the cluster.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Henry Saputra
Welcome Shane =)


- Henry

On Tue, Sep 2, 2014 at 10:35 AM, shane knapp skn...@berkeley.edu wrote:
 so, i had a meeting w/the databricks guys on friday and they recommended i
 send an email out to the list to say 'hi' and give you guys a quick intro.
  :)

 hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
 time getting the jenkins build infrastructure up to production quality.
  much of this will be 'under the covers' work, like better system level
 auth, backups, etc, but some will definitely be user facing:  timely
 jenkins updates, debugging broken build infrastructure and some plugin
 support.

 i've been working in the bay area now since 1997 at many different
 companies, and my last 10 years has been split between google and palantir.
  i'm a huge proponent of OSS, and am really happy to be able to help with
 the work you guys are doing!

 if anyone has any requests/questions/comments, feel free to drop me a line!

 shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [Spark SQL] off-heap columnar store

2014-08-25 Thread Henry Saputra
Hi Michael,

This is great news.
Any initial proposal or design about the caching to Tachyon that you
can share so far?

I don't think there is a JIRA ticket open to track this feature yet.

- Henry

On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust
mich...@databricks.com wrote:

 What is the plan for getting Tachyon/off-heap support for the columnar
 compressed store?  It's not in 1.1 is it?


 It is not in 1.1 and there are not concrete plans for adding it at this
 point.  Currently, there is more engineering investment going into caching
 parquet data in Tachyon instead.  This approach is going to have much
 better support for nested data, leverages other work being done on parquet,
 and alleviates your concerns about wire format compatibility.

 That said, if someone really wants to try and implement it, I don't think
 it would be very hard.  The primary issue is going to be designing a clean
 interface that is not too tied to this one implementation.


 Also, how likely is the wire format for the columnar compressed data
 to change?  That would be a problem for write-through or persistence.


 We aren't making any guarantees at the moment that it won't change.  Its
 currently only intended for temporary caching of data.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark Contribution

2014-08-21 Thread Henry Saputra
The Apache Spark wiki on how to contribute should be great place to
start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

- Henry

On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns maisnam...@gmail.com wrote:
 Hi,

 Can someone help me with some links on how to contribute for Spark

 Regards
 mns

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-28 Thread Henry Saputra
NOTICE and LICENSE files look good
Hashes and sigs look good
No executable in the source distribution
Compile source and run standalone

+1

- Henry

On Fri, Jul 25, 2014 at 4:08 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.0.2.

 This release fixes a number of bugs in Spark 1.0.1.
 Some of the notable ones are
 - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
 SPARK-1199. The fix was reverted for 1.0.2.
 - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
 HDFS CSV file.
 The full list is at http://s.apache.org/9NJ

 The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f

 The release files, including signatures, digests, etc can be found at:
 http://people.apache.org/~tdas/spark-1.0.2-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1024/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.0.2!

 The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.
 [ ] +1 Release this package as Apache Spark 1.0.2
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/


Re: -1s on pull requests?

2014-07-21 Thread Henry Saputra
There is ASF guidelines about Voting, including code review for
patches: http://www.apache.org/foundation/voting.html

Some ASF project do three +1 votes are required (to the issues like
JIRA or Github PR in this case) for a patch unless it is tagged with
lazy consensus [1] of like 48 hours.
For patches that are not critical, waiting for a while to let some
time for additional committers to review would be the best way to go.

Another thing is that all contributors need to be patience once their
patches have been submitted and pending reviewed. This is part of
being in open community.

Hope this helps.


- Henry

[1] http://www.apache.org/foundation/glossary.html#LazyConsensus

On Mon, Jul 21, 2014 at 1:59 PM, Patrick Wendell pwend...@gmail.com wrote:
 I've always operated under the assumption that if a commiter makes a
 comment on a PR, and that's not addressed, that should block the PR
 from being merged (even without a specific -1). I don't know of any
 cases where this has intentionally been violated, but I do think this
 happens accidentally some times.

 Unfortunately, we are not allowed to use those github hooks because of
 the way the ASF github integration works.

 I've lately been using a custom-made tool to help review pull
 requests. One thing I could do is add a feature here saying which
 committers have said LGTM on a PR (vs the ones that have commented).
 We could also indicate the latest test status as Green/Yellow/Red
 based on the Jenkins comments:

 http://pwendell.github.io/spark-github-shim/

 As a warning to potential users, my tool might crash your browser.

 - Patrick

 On Mon, Jul 21, 2014 at 1:44 PM, Kay Ousterhout k...@eecs.berkeley.edu 
 wrote:
 Hi all,

 As the number of committers / contributors on Spark has increased, there
 are cases where pull requests get merged before all the review comments
 have been addressed. This happens say when one committer points out a
 problem with the pull request, and another committer doesn't see the
 earlier comment and merges the PR before the comment has been addressed.
  This is especially tricky for pull requests with a large number of
 comments, because it can be difficult to notice early comments describing
 blocking issues.

 This also happens when something accidentally gets merged after the tests
 have started but before tests have passed.

 Do folks have ideas on how we can handle this issue? Are there other
 projects that have good ways of handling this? It looks like for Hadoop,
 people can -1 / +1 on the JIRA.

 -Kay


Re: Announcing Spark 1.0.1

2014-07-11 Thread Henry Saputra
Congrats to the Spark community !

On Friday, July 11, 2014, Patrick Wendell pwend...@gmail.com wrote:

 I am happy to announce the availability of Spark 1.0.1! This release
 includes contributions from 70 developers. Spark 1.0.0 includes fixes
 across several areas of Spark, including the core API, PySpark, and
 MLlib. It also includes new features in Spark's (alpha) SQL library,
 including support for JSON data and performance and stability fixes.

 Visit the release notes[1] to read about this release or download[2]
 the release today.

 [1] http://spark.apache.org/releases/spark-release-1-0-1.html
 [2] http://spark.apache.org/downloads.html



Re: Run ScalaTest inside Intellij IDEA

2014-06-17 Thread Henry Saputra
I got stuck on this one too after did git pull from master.

Have not been able to resolve it yet =(


- Henry

On Wed, Jun 11, 2014 at 6:51 AM, Yijie Shen henry.yijies...@gmail.com wrote:
 Thx Qiuzhuang, the problems disappeared after I add assembly jar at the head 
 of list dependencies in *.iml, but while running test in Spark 
 SQL(SQLQuerySuite in sql-core), another two error occurs:

 Error 1:
 Error:scalac:
  while compiling: 
 /Users/yijie/code/apache.spark.master/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala
 during phase: jvm
  library version: version 2.10.4
 compiler version: version 2.10.4
   reconstructed args: -Xmax-classfile-name 120 -deprecation 
 -P:genjavadoc:out=/Users/yijie/code/apache.spark.master/sql/core/target/java 
 -feature -classpath 
 /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/ant-javafx.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/javafx-doclet.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/javafx-mx.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/tools.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Conte…
 …
 ...
 /Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/classes:/Users/yijie/code/apache.spark.master/lib_managed/jars/scala-library-2.10.4.jar
  
 -Xplugin:/Users/yijie/code/apache.spark.master/lib_managed/jars/genjavadoc-plugin_2.10.4-0.5.jar
  
 -Xplugin:/Users/yijie/code/apache.spark.master/lib_managed/jars/genjavadoc-plugin_2.10.4-0.5.jar
   last tree to typer: Literal(Constant(parquet.io.api.Converter))
   symbol: null
symbol definition: null
  tpe: Class(classOf[parquet.io.api.Converter])
symbol owners:
   context owners: object TestSQLContext - package test
 == Enclosing template or block ==
 Template( // val local TestSQLContext: notype in object TestSQLContext, 
 tree.tpe=org.apache.spark.sql.test.TestSQLContext.type
   org.apache.spark.sql.SQLContext // parents
   ValDef(
 private
 _
 tpt
 empty
   )
   // 2 statements
   DefDef( // private def readResolve(): Object in object TestSQLContext
 method private synthetic
 readResolve
 []
 List(Nil)
 tpt // tree.tpe=Object
 test.this.TestSQLContext // object TestSQLContext in package test, 
 tree.tpe=org.apache.spark.sql.test.TestSQLContext.type
   )
   DefDef( // def init(): org.apache.spark.sql.test.TestSQLContext.type in 
 object TestSQLContext
 method
 init
 []
 List(Nil)
 tpt // tree.tpe=org.apache.spark.sql.test.TestSQLContext.type
 Block( // tree.tpe=Unit
   Apply( // def init(sparkContext: org.apache.spark.SparkContext): 
 org.apache.spark.sql.SQLContext in class SQLContext, 
 tree.tpe=org.apache.spark.sql.SQLContext
 TestSQLContext.super.init // def init(sparkContext: 
 org.apache.spark.SparkContext): org.apache.spark.sql.SQLContext in class 
 SQLContext, tree.tpe=(sparkContext: 
 org.apache.spark.SparkContext)org.apache.spark.sql.SQLContext
 Apply( // def init(master: String,appName: String,conf: 
 org.apache.spark.SparkConf): org.apache.spark.SparkContext in class 
 SparkContext, tree.tpe=org.apache.spark.SparkContext
   new org.apache.spark.SparkContext.init // def init(master: 
 String,appName: String,conf: org.apache.spark.SparkConf): 
 org.apache.spark.SparkContext in class SparkContext, tree.tpe=(master: 
 String, appName: String, conf: 
 org.apache.spark.SparkConf)org.apache.spark.SparkContext
   // 3 arguments
   local
   TestSQLContext
   Apply( // def init(): org.apache.spark.SparkConf in class 
 SparkConf, tree.tpe=org.apache.spark.SparkConf
 new org.apache.spark.SparkConf.init // def init(): 
 org.apache.spark.SparkConf in class SparkConf, 
 tree.tpe=()org.apache.spark.SparkConf
 Nil
   )
 )
   )
   ()
 )
   )
 )
 == Expanded type of tree ==
 ConstantType(value = Constant(parquet.io.api.Converter))
 uncaught exception during compilation: java.lang.AssertionError

 Error 2:

 Error:scalac: Error: assertion failed: List(object package$DebugNode, object 
 package$DebugNode)
 java.lang.AssertionError: assertion failed: List(object package$DebugNode, 
 object package$DebugNode)
 at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1678)
 at 
 scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:2988)
 

Re: Emergency maintenace on jenkins

2014-06-10 Thread Henry Saputra
Thanks for letting us know Patrick.

- Henry

On Monday, June 9, 2014, Patrick Wendell pwend...@gmail.com wrote:

 Just a heads up - due to an outage at UCB we've lost several of the
 Jenkins slaves. I'm trying to spin up new slaves on EC2 in order to
 compensate, but this might fail some ongoing builds.

 The good news is if we do get it working with EC2 workers, then we
 will have burst capability in the future - e.g. on release deadlines.
 So it's not all bad!

 - Patrick



Removing spark-debugger.md file from master?

2014-06-03 Thread Henry Saputra
Hi All,

Seemed like the spark-debugger.md is no longer accurate (see
http://spark.apache.org/docs/latest/spark-debugger.html) and since it
was originally written Spark has evolved that makes the doc obsolete.

There are already work pending for new replay debugging (I could not
find the PR links for it) so I

With version control we could always reinstate the old doc if needed,
but as of today the doc is no longer reflect the current state of
Spark's RDD.

If no objection I could send PR to remove the md file in master.

Thoughts?

- Henry


Re: Removing spark-debugger.md file from master?

2014-06-03 Thread Henry Saputra
Cool, thanks Ankur, sounds good. PR is coming.

- Henry

On Tue, Jun 3, 2014 at 11:11 AM, Ankur Dave ankurd...@gmail.com wrote:
 I agree, let's go ahead and remove it.

 Ankur http://www.ankurdave.com/


Add my JIRA username (hsaputra) to Spark's contributor's list

2014-06-03 Thread Henry Saputra
Hi,

Could someone with right karma kindly add my username (hsaputra) to
Spark's contributor list?

I was added before but somehow now I can no longer assign ticket to
myself nor update tickets I am working on.


Thanks,

- Henry


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Henry Saputra
NOTICE and LICENSE files look good
Signatures look good.
Hashes look good
No external executables in the source distributions
Source compiled with sbt
Run local and standalone examples look good.

+1


- Henry

On Mon, May 26, 2014 at 7:38 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.0.0!

 This has a few important bug fixes on top of rc10:
 SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
 SPARK-1870: https://github.com/apache/spark/pull/848
 SPARK-1897: https://github.com/apache/spark/pull/849

 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc11/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1019/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Thursday, May 29, at 16:00 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 Changes to ML vector specification:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10

 Changes to the Java API:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 Changes to the streaming API:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 Changes to the GraphX API:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 Other changes:
 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior


Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Henry Saputra
Looks like SPARK-1900 is a blocker for YARN and might as well add
SPARK-1870 while at it.

TD or Patrick, could you kindly send [CANCEL] prefixed in the subject
email out for the RC10 Vote to help people follow the active VOTE
threads? The VOTE emails are getting a bit hard to follow.


- Henry


On Thu, May 22, 2014 at 2:05 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 Hey all,

 On further testing, I came across a bug that breaks execution of
 pyspark scripts on YARN.
 https://issues.apache.org/jira/browse/SPARK-1900
 This is a blocker and worth cutting a new RC.

 We also found a fix for a known issue that prevents additional jar
 files to be specified through spark-submit on YARN.
 https://issues.apache.org/jira/browse/SPARK-1870
 The has been fixed and will be in the next RC.

 We are canceling this vote for now. We will post RC11 shortly. Thanks
 everyone for testing!

 TD



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-16 Thread Henry Saputra
HI Sandy,

Just curious if the Vote is for rc5 or rc6? Gmail shows me that you
replied to the rc5 thread.

Thanks,

- Henry

On Wed, May 14, 2014 at 1:28 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 +1 (non-binding)

 * Built the release from source.
 * Compiled Java and Scala apps that interact with HDFS against it.
 * Ran them in local mode.
 * Ran them against a pseudo-distributed YARN cluster in both yarn-client
 mode and yarn-cluster mode.


 On Tue, May 13, 2014 at 9:09 PM, witgo wi...@qq.com wrote:

 You need to set:
 spark.akka.frameSize 5
 spark.default.parallelism1





 -- Original --
 From:  Madhu;ma...@madhu.com;
 Date:  Wed, May 14, 2014 09:15 AM
 To:  devd...@spark.incubator.apache.org;

 Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)



 I just built rc5 on Windows 7 and tried to reproduce the problem described
 in

 https://issues.apache.org/jira/browse/SPARK-1712

 It works on my machine:

 14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at console:17) finished
 in 4.548 s
 14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
 have all completed, from pool
 14/05/13 21:06:47 INFO SparkContext: Job finished: sum at console:17,
 took
 4.814991993 s
 res1: Double = 5.05E11

 I used all defaults, no config files were changed.
 Not sure if that makes a difference...



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 .


Re: minor optimizations to get my feet wet

2014-04-10 Thread Henry Saputra
HI Ignacio,

Thank you for your contribution.

Just a friendly reminder, in case you have not contributed to Apache
Software Foundation projects before please submit ASF ICLA form [1] or
if you are sponsored by your company also ask the company to send CCLA
[2] to clear the intellectual property for your contributions.

You can ignore preferred Apache id section for now.


Thank you,

Henry Saputra

[1] https://www.apache.org/licenses/icla.txt
[2] http://www.apache.org/licenses/cla-corporate.txt


On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas
ignacio.zendejas...@gmail.com wrote:
 Hi, all -

 First off, I want to say that I love spark and am very excited about
 MLBase. I'd love to contribute now that I have some time, but before I do
 that I'd like to familiarize myself with the process.

 In looking for a few projects and settling on one which I'll discuss in
 another thread, I found some very minor optimizations I could contribute,
 again, as part of this first step.

 Before I initiate a PR, I've gone ahead and tested style, ran tests, etc
 per the instructions, but I'd still like to have someone quickly glance
 over it and ensure that these are JIRA worthy.

 Commit:
 https://github.com/izendejas/spark/commit/81065aed9987c1b08cd5784b7a6153e26f3f7402

 To summarize:

 * I got rid of some SeqLike.reverse calls when sorting by descending order
 * replaced slice(1, length) calls with the much safer (avoids IOOBEs) and
 more readable .tail calls
 * used a foldleft to avoid using mutable variables in NaiveBayes code

 This last one is meant to understand what's valued more between idiomatic
 Scala development or readability. I'm personally a fan of foldLefts where
 applicable, but do think they're a bit less readable.

 Thanks,
 Ignacio


Re: minor optimizations to get my feet wet

2014-04-10 Thread Henry Saputra
You are welcome, thanks again for contributing =)

- Henry

On Thu, Apr 10, 2014 at 3:17 PM, Ignacio Zendejas
ignacio.zendejas...@gmail.com wrote:
 I don't think there's a noticeable performance hit by the use of reverse in
 those cases. It was a quick set of changes and it helped understand what
 you look for. I didn't intend to nitpick, so I'll leave as is. I could have
 used a scala.Ordering implicitly/explicitly also, but seems overkill and
 don't want to necessarily start a discussion about what's best--unless one
 of the admins deems this important.

 I'll only keep the use of take and tail over using slice and switch over to
 math.min where indicated.

 This after I follow Henry's timely advice--thanks, Henry.

 cheers.




 On Thu, Apr 10, 2014 at 2:10 PM, Reynold Xin r...@databricks.com wrote:

 Thanks for contributing!

 I think often unless the feature is gigantic, you can send a pull request
 directly for discussion. One rule of thumb in the Spark code base is that
 we typically prefer readability over conciseness, and thus we tend to avoid
 using too much Scala magic or operator overloading.

 In this specific case, do you know if using - instead of reverse improve
 performance? I personally find it slightly awkward to use underscore right
 after negation ...


 The tail change looks good to me.

 For foldLeft, I agree with you that the old way is more readable (although
 less idiomatic scala).




 On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas 
 ignacio.zendejas...@gmail.com wrote:

  Hi, all -
 
  First off, I want to say that I love spark and am very excited about
  MLBase. I'd love to contribute now that I have some time, but before I do
  that I'd like to familiarize myself with the process.
 
  In looking for a few projects and settling on one which I'll discuss in
  another thread, I found some very minor optimizations I could contribute,
  again, as part of this first step.
 
  Before I initiate a PR, I've gone ahead and tested style, ran tests, etc
  per the instructions, but I'd still like to have someone quickly glance
  over it and ensure that these are JIRA worthy.
 
  Commit:
 
 
 https://github.com/izendejas/spark/commit/81065aed9987c1b08cd5784b7a6153e26f3f7402
 
  To summarize:
 
  * I got rid of some SeqLike.reverse calls when sorting by descending
 order
  * replaced slice(1, length) calls with the much safer (avoids IOOBEs) and
  more readable .tail calls
  * used a foldleft to avoid using mutable variables in NaiveBayes code
 
  This last one is meant to understand what's valued more between idiomatic
  Scala development or readability. I'm personally a fan of foldLefts where
  applicable, but do think they're a bit less readable.
 
  Thanks,
  Ignacio
 



Re: JIRA. github and asf updates

2014-03-29 Thread Henry Saputra
With the speed of comments updates in Jira by Spark dev community +1 for
issues@ list

- Henry

On Saturday, March 29, 2014, Patrick Wendell pwend...@gmail.com wrote:

 Ah sorry I see - Jira updates are going to the dev list. Maybe that's not
 desirable. I think we should send them to the issues@ list.


 On Sat, Mar 29, 2014 at 11:16 AM, Patrick Wendell 
 pwend...@gmail.comjavascript:;
 wrote:

  Mridul,
 
  You can unsubscribe yourself from any of these sources, right?
 
  - Patrick
 
 
  On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan 
  mri...@gmail.comjavascript:;
 wrote:
 
  Hi,
 
So we are now receiving updates from three sources for each change to
  the PR.
  While each of them handles a corner case which others might miss,
  would be great if we could minimize the volume of duplicated
  communication.
 
 
  Regards,
  Mridul
 
 
 



Re: Largest input data set observed for Spark.

2014-03-20 Thread Henry Saputra
Reynold, just curious did you guys ran it in AWS?

- Henry

On Thu, Mar 20, 2014 at 11:08 AM, Reynold Xin r...@databricks.com wrote:
 Actually we just ran a job with 70TB+ compressed data on 28 worker nodes -
 I didn't count the size of the uncompressed data, but I am guessing it is
 somewhere between 200TB to 700TB.



 On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani us...@platfora.com wrote:

 All,
 What is the largest input data set y'all have come across that has been
 successfully processed in production using spark. Ball park?



Re: Announcing the official Spark Job Server repo

2014-03-18 Thread Henry Saputra
W00t!

Thanks for releasing this, Evan.

- Henry

On Tue, Mar 18, 2014 at 1:51 PM, Evan Chan e...@ooyala.com wrote:
 Dear Spark developers,

 Ooyala is happy to announce that we have pushed our official, Spark
 0.9.0 / Scala 2.10-compatible, job server as a github repo:

 https://github.com/ooyala/spark-jobserver

 Complete with unit tests, deploy scripts, and examples.

 The original PR (#222) on incubator-spark is now closed.

 Please have a look; pull requests are very welcome.
 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |