Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-04 Thread Kan Zhang
+1

Compiled, ran newly-introduced PySpark Hadoop input/output examples.


On Thu, Sep 4, 2014 at 1:10 PM, Egor Pahomov  wrote:

> +1
>
> Compiled, ran on yarn-hadoop-2.3 simple job.
>
>
> 2014-09-04 22:22 GMT+04:00 Henry Saputra :
>
> > LICENSE and NOTICE files are good
> > Hash files are good
> > Signature files are good
> > No 3rd parties executables
> > Source compiled
> > Run local and standalone tests
> > Test persist off heap with Tachyon looks good
> >
> > +1
> >
> > - Henry
> >
> > On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell 
> > wrote:
> > > Please vote on releasing the following candidate as Apache Spark
> version
> > 1.1.0!
> > >
> > > The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd):
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > http://people.apache.org/~pwendell/spark-1.1.0-rc4/
> > >
> > > Release artifacts are signed with the following key:
> > > https://people.apache.org/keys/committer/pwendell.asc
> > >
> > > The staging repository for this release can be found at:
> > >
> https://repository.apache.org/content/repositories/orgapachespark-1031/
> > >
> > > The documentation corresponding to this release can be found at:
> > > http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/
> > >
> > > Please vote on releasing this package as Apache Spark 1.1.0!
> > >
> > > The vote is open until Saturday, September 06, at 08:30 UTC and passes
> if
> > > a majority of at least 3 +1 PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Spark 1.1.0
> > > [ ] -1 Do not release this package because ...
> > >
> > > To learn more about Apache Spark, please see
> > > http://spark.apache.org/
> > >
> > > == Regressions fixed since RC3 ==
> > > SPARK-3332 - Issue with tagging in EC2 scripts
> > > SPARK-3358 - Issue with regression for m3.XX instances
> > >
> > > == What justifies a -1 vote for this release? ==
> > > This vote is happening very late into the QA period compared with
> > > previous votes, so -1 votes should only occur for significant
> > > regressions from 1.0.2. Bugs already present in 1.0.X will not block
> > > this release.
> > >
> > > == What default changes should I be aware of? ==
> > > 1. The default value of "spark.io.compression.codec" is now "snappy"
> > > --> Old behavior can be restored by switching to "lzf"
> > >
> > > 2. PySpark now performs external spilling during aggregations.
> > > --> Old behavior can be restored by setting "spark.shuffle.spill" to
> > "false".
> > >
> > > 3. PySpark uses a new heuristic for determining the parallelism of
> > > shuffle operations.
> > > --> Old behavior can be restored by setting
> > > "spark.default.parallelism" to the number of cores in the cluster.
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>
>
> --
>
>
>
> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>


Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Kan Zhang
+1

Verified PySpark InputFormat/OutputFormat examples.


On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin  wrote:

> +1
>
>
> On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian  wrote:
>
> > +1
> >
> >- Tested Thrift server and SQL CLI locally on OSX 10.9.
> >- Checked datanucleus dependencies in distribution tarball built by
> >make-distribution.sh without SPARK_HIVE defined.
> >
> > ​
> >
> >
> > On Tue, Sep 2, 2014 at 2:30 PM, Will Benton  wrote:
> >
> > > +1
> > >
> > > Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle
> > JDK
> > > 8).
> > >
> > >
> > > best,
> > > wb
> > >
> > >
> > > - Original Message -
> > > > From: "Patrick Wendell" 
> > > > To: dev@spark.apache.org
> > > > Sent: Saturday, August 30, 2014 5:07:52 PM
> > > > Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)
> > > >
> > > > Please vote on releasing the following candidate as Apache Spark
> > version
> > > > 1.1.0!
> > > >
> > > > The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
> > > >
> > > > The release files, including signatures, digests, etc. can be found
> at:
> > > > http://people.apache.org/~pwendell/spark-1.1.0-rc3/
> > > >
> > > > Release artifacts are signed with the following key:
> > > > https://people.apache.org/keys/committer/pwendell.asc
> > > >
> > > > The staging repository for this release can be found at:
> > > >
> > https://repository.apache.org/content/repositories/orgapachespark-1030/
> > > >
> > > > The documentation corresponding to this release can be found at:
> > > > http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
> > > >
> > > > Please vote on releasing this package as Apache Spark 1.1.0!
> > > >
> > > > The vote is open until Tuesday, September 02, at 23:07 UTC and passes
> > if
> > > > a majority of at least 3 +1 PMC votes are cast.
> > > >
> > > > [ ] +1 Release this package as Apache Spark 1.1.0
> > > > [ ] -1 Do not release this package because ...
> > > >
> > > > To learn more about Apache Spark, please see
> > > > http://spark.apache.org/
> > > >
> > > > == Regressions fixed since RC1 ==
> > > > - Build issue for SQL support:
> > > > https://issues.apache.org/jira/browse/SPARK-3234
> > > > - EC2 script version bump to 1.1.0.
> > > >
> > > > == What justifies a -1 vote for this release? ==
> > > > This vote is happening very late into the QA period compared with
> > > > previous votes, so -1 votes should only occur for significant
> > > > regressions from 1.0.2. Bugs already present in 1.0.X will not block
> > > > this release.
> > > >
> > > > == What default changes should I be aware of? ==
> > > > 1. The default value of "spark.io.compression.codec" is now "snappy"
> > > > --> Old behavior can be restored by switching to "lzf"
> > > >
> > > > 2. PySpark now performs external spilling during aggregations.
> > > > --> Old behavior can be restored by setting "spark.shuffle.spill" to
> > > "false".
> > > >
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > > For additional commands, e-mail: dev-h...@spark.apache.org
> > > >
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > >
> > >
> >
>


Re: Markdown viewer for the docs

2014-08-18 Thread Kan Zhang
If you are willing to compile it, "The markdown code can be compiled to
HTML using the [Jekyll tool](http://jekyllrb.com)." More in docs/README.md.


On Mon, Aug 18, 2014 at 9:00 AM, Stephen Boesch  wrote:

> Which viewer is capable of seeing all of the content in the spark docs
> -including the (apparent) extensions?
>
> An example page:
> https://github.com/apache/spark/blob/master/docs/mllib-linear-methods.md
>
>
> Local MD viewers/editors that I have  tried include:   mdcharm,  retext and
> haroopad: one of these handle the TOC,  the math symbols, or proper
> formatting of the scala code
>
>  Even directly opening the md file from  github.com with the browser those
> same issues appear: no TOC, math, or proper code formatting.   I am tried
> both FF and chrome (on ubuntu 12.0.4)
>
>
> Any tips from  the creators/maintainers of these pages  Thanks!
>


Re: Calling Scala/Java methods which operates on RDD

2014-07-11 Thread Kan Zhang
Hi Jai,

Your suspicion is correct. In general, Python RDDs are pickled into byte
arrays and stored in Java land as RDDs of byte arrays. union/zip operates
on byte arrays directly without deserializing. Currently, Python byte
arrays only get unpickled into Java objects in special cases, like SQL
functions or saving to Sequence Files (upcoming).

Hope it helps.

Kan


On Fri, Jul 11, 2014 at 5:04 AM, Jai Kumar Singh 
wrote:

> HI,
>   I want to write some common utility function in Scala and want to call
> the same from Java/Python Spark API ( may be add some wrapper code around
> scala calls). Calling Scala functions from Java works fine. I was reading
> pyspark rdd code and find out that pyspark is able to call JavaRDD function
> like union/zip to get same for pyspark RDD and deserializing the output and
> everything works fine. But somehow I am
> not able to work out really simple example. I think I am missing some
> serialization/deserialization.
>
> Can someone confirm that is it even possible to do so? Or, would it be much
> easier to pass RDD data files around instead of RDD directly (from pyspark
> to java/scala)?
>
> For example, below code just add 1 to each element of RDD containing
> Integers.
>
> package flukebox.test;
>
> object TestClass{
>
> def testFunc(data:RDD[Int])={
>
>   data.map(x => x+1)
>
> }
>
> }
>
> Calling from python,
>
> from pyspark import RDD
>
> from py4j.java_gateway import java_import
>
> java_import(sc._gateway.jvm, "flukebox.test")
>
>
> data = sc.parallelize([1,2,3,4,5,6,7,8,9])
>
> sc._jvm.flukebox.test.TestClass.testFunc(data._jrdd.rdd())
>
>
> *This fails because testFunc get any RDD of type Byte Array.*
>
>
> Any help/pointer would be highly appreciated.
>
>
> Thanks & Regards,
>
> Jai K Singh
>


Re: Add my JIRA username (hsaputra) to Spark's contributor's list

2014-06-03 Thread Kan Zhang
Same here please, username (kzhang). Thanks!


On Tue, Jun 3, 2014 at 11:39 AM, Henry Saputra 
wrote:

> Thanks Matei!
>
> - Henry
>
> On Tue, Jun 3, 2014 at 11:36 AM, Matei Zaharia 
> wrote:
> > Done. Looks like this was lost in the JIRA import.
> >
> > Matei
> >
> > On Jun 3, 2014, at 11:33 AM, Henry Saputra 
> wrote:
> >
> >> Hi,
> >>
> >> Could someone with right karma kindly add my username (hsaputra) to
> >> Spark's contributor list?
> >>
> >> I was added before but somehow now I can no longer assign ticket to
> >> myself nor update tickets I am working on.
> >>
> >>
> >> Thanks,
> >>
> >> - Henry
> >
>


Re: Why does spark REPL not embed scala REPL?

2014-05-30 Thread Kan Zhang
One reason is standard Scala REPL uses object based wrappers and their
static initializers will be run on remote worker nodes, which may fail due
to differences between driver and worker nodes. See discussion here
https://groups.google.com/d/msg/scala-internals/h27CFLoJXjE/JoobM6NiUMQJ


On Fri, May 30, 2014 at 1:12 AM, Aniket  wrote:

> My apologies in advance if this is a dev mailing list topic. I am working
> on
> a small project to provide web interface to spark REPL. The interface will
> allow people to use spark REPL and perform exploratory analysis on the
> data.
> I already have a play application running that provides web interface to
> standard scala REPL and I am just looking to extend it to optionally
> include
> support for spark REPL. My initial idea was to include spark dependencies
> in
> the project, create a new instance of SparkContext and bind it to a
> variable
> (lets say 'sc') using imain.bind("sc", sparkContext). While theoretically
> this may work, I am trying to understand why spark REPL takes a different
> path by creating it's own SparkILoop, SparkIMain, etc. Can anyone help me
> understand why there was a need to provide custom versions of IMain, ILoop,
> etc instead of embedding the standard scala REPL and binding SparkContext
> instance?
>
> Here is my analysis so far:
> 1. ExecutorClassLoader - I understand this is need to load classes from
> HDFS. Perhaps this could have been plugged into the standard scala REPL
> using settings.embeddedDefaults(classLoaderInstance). Also, it's not clear
> what ConstructorCleaner does.
>
> 2. SparkCommandLine & SparkRunnerSettings - Allow for providing an extra -i
> file argument to the REPL. The standard sourcepath wouldn't have sufficed?
>
> 3. SparkExprTyper - The only difference between standard ExprTyper and
> SparkExprTyper is that repldbg is replaced with logDebug. Not sure if this
> was intentional/needed.
>
> 4. SparkILoop - Has a few deviations from standard ILoop class but this
> could have been managed by extending or wrapping ILoop class or using
> settings. Not sure what triggered the need to copy the source code and make
> edits.
>
> 5. SparkILoopInit - Changes the welcome message and binds spark context in
> the interpreter. Welcome message could have been changed by extending
> ILoopInit.
>
> 6. SparkIMain - Contains quiet a few changes around class
> loading/logging/etc but I found it very hard to figure out if extension of
> IMain was an option and what exactly didn't work/will not work with IMain.
>
> Rest of the classes seem very similar to their standard counterparts. I
> have
> a feeling the spark REPL can be refactored to embed standard scala REPL. I
> know refactoring would not help Spark project as such but would help people
> embed the spark REPL much in the same way it's done with standard scala
> REPL. Thoughts?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-does-spark-REPL-not-embed-scala-REPL-tp6871.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Kan Zhang
+1 on the running commentary here, non-binding of course :-)


On Sat, May 17, 2014 at 8:44 AM, Andrew Ash  wrote:

> +1 on the next release feeling more like a 0.10 than a 1.0
> On May 17, 2014 4:38 AM, "Mridul Muralidharan"  wrote:
>
> > I had echoed similar sentiments a while back when there was a discussion
> > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
> > changes, add missing functionality, go through a hardening release before
> > 1.0
> >
> > But the community preferred a 1.0 :-)
> >
> > Regards,
> > Mridul
> >
> > On 17-May-2014 3:19 pm, "Sean Owen"  wrote:
> > >
> > > On this note, non-binding commentary:
> > >
> > > Releases happen in local minima of change, usually created by
> > > internally enforced code freeze. Spark is incredibly busy now due to
> > > external factors -- recently a TLP, recently discovered by a large new
> > > audience, ease of contribution enabled by Github. It's getting like
> > > the first year of mainstream battle-testing in a month. It's been very
> > > hard to freeze anything! I see a number of non-trivial issues being
> > > reported, and I don't think it has been possible to triage all of
> > > them, even.
> > >
> > > Given the high rate of change, my instinct would have been to release
> > > 0.10.0 now. But won't it always be very busy? I do think the rate of
> > > significant issues will slow down.
> > >
> > > Version ain't nothing but a number, but if it has any meaning it's the
> > > semantic versioning meaning. 1.0 imposes extra handicaps around
> > > striving to maintain backwards-compatibility. That may end up being
> > > bent to fit in important changes that are going to be required in this
> > > continuing period of change. Hadoop does this all the time
> > > unfortunately and gets away with it, I suppose -- minor version
> > > releases are really major. (On the other extreme, HBase is at 0.98 and
> > > quite production-ready.)
> > >
> > > Just consider this a second vote for focus on fixes and 1.0.x rather
> > > than new features and 1.x. I think there are a few steps that could
> > > streamline triage of this flood of contributions, and make all of this
> > > easier, but that's for another thread.
> > >
> > >
> > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra  >
> > wrote:
> > > > +1, but just barely.  We've got quite a number of outstanding bugs
> > > > identified, and many of them have fixes in progress.  I'd hate to see
> > those
> > > > efforts get lost in a post-1.0.0 flood of new features targeted at
> > 1.1.0 --
> > > > in other words, I'd like to see 1.0.1 retain a high priority relative
> > to
> > > > 1.1.0.
> > > >
> > > > Looking through the unresolved JIRAs, it doesn't look like any of the
> > > > identified bugs are show-stoppers or strictly regressions (although I
> > will
> > > > note that one that I have in progress, SPARK-1749, is a bug that we
> > > > introduced with recent work -- it's not strictly a regression because
> > we
> > > > had equally bad but different behavior when the DAGScheduler
> exceptions
> > > > weren't previously being handled at all vs. being slightly
> mis-handled
> > > > now), so I'm not currently seeing a reason not to release.
> >
>