Re: A proposal for Spark 2.0

2015-11-25 Thread Sandy Ryza
I see.  My concern is / was that cluster operators will be reluctant to
upgrade to 2.0, meaning that developers using those clusters need to stay
on 1.x, and, if they want to move to DataFrames, essentially need to port
their app twice.

I misunderstood and thought part of the proposal was to drop support for
2.10 though.  If your broad point is that there aren't changes in 2.0 that
will make it less palatable to cluster administrators than releases in the
1.x line, then yes, 2.0 as the next release sounds fine to me.

-Sandy


On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> What are the other breaking changes in 2.0 though? Note that we're not
> removing Scala 2.10, we're just making the default build be against Scala
> 2.11 instead of 2.10. There seem to be very few changes that people would
> worry about. If people are going to update their apps, I think it's better
> to make the other small changes in 2.0 at the same time than to update once
> for Dataset and another time for 2.0.
>
> BTW just refer to Reynold's original post for the other proposed API
> changes.
>
> Matei
>
> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:
>
> I think that Kostas' logic still holds.  The majority of Spark users, and
> likely an even vaster majority of people running vaster jobs, are still on
> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
> to upgrade to the stable version of the Dataset / DataFrame API so they
> don't need to do so twice.  Requiring that they absorb all the other ways
> that Spark breaks compatibility in the move to 2.0 makes it much more
> difficult for them to make this transition.
>
> Using the same set of APIs also means that it will be easier to backport
> critical fixes to the 1.x line.
>
> It's not clear to me that avoiding breakage of an experimental API in the
> 1.x line outweighs these issues.
>
> -Sandy
>
> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> I actually think the next one (after 1.6) should be Spark 2.0. The reason
>> is that I already know we have to break some part of the DataFrame/Dataset
>> API as part of the Dataset design. (e.g. DataFrame.map should return
>> Dataset rather than RDD). In that case, I'd rather break this sooner (in
>> one release) than later (in two releases). so the damage is smaller.
>>
>> I don't think whether we call Dataset/DataFrame experimental or not
>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>> then mark them as stable in 2.1. Despite being "experimental", there has
>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>
>>
>>
>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>>
>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>> fixing.  We're on the same page now.
>>>
>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <kos...@cloudera.com>
>>> wrote:
>>>
>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in
>>>> z releases. The Dataset API is experimental and so we might be changing the
>>>> APIs before we declare it stable. This is why I think it is important to
>>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>>
>>>> Kostas
>>>>
>>>>
>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <m...@clearstorydata.com
>>>> > wrote:
>>>>
>>>>> Why does stabilization of those two features require a 1.7 release
>>>>> instead of 1.6.1?
>>>>>
>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <kos...@cloudera.com
>>>>> > wrote:
>>>>>
>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
>>>>>> to propose we have one more 1.x release after Spark 1.6. This will allow 
>>>>>> us
>>>>>> to stabilize a few of the new features that were added in 1.6:
>>>>>>
>>>>>> 1) the experimental Datasets API
>>>>>> 2) the new unified memory manager.
>>>>>>
>

Re: A proposal for Spark 2.0

2015-11-24 Thread Sandy Ryza
I think that Kostas' logic still holds.  The majority of Spark users, and
likely an even vaster majority of people running vaster jobs, are still on
RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
to upgrade to the stable version of the Dataset / DataFrame API so they
don't need to do so twice.  Requiring that they absorb all the other ways
that Spark breaks compatibility in the move to 2.0 makes it much more
difficult for them to make this transition.

Using the same set of APIs also means that it will be easier to backport
critical fixes to the 1.x line.

It's not clear to me that avoiding breakage of an experimental API in the
1.x line outweighs these issues.

-Sandy

On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin  wrote:

> I actually think the next one (after 1.6) should be Spark 2.0. The reason
> is that I already know we have to break some part of the DataFrame/Dataset
> API as part of the Dataset design. (e.g. DataFrame.map should return
> Dataset rather than RDD). In that case, I'd rather break this sooner (in
> one release) than later (in two releases). so the damage is smaller.
>
> I don't think whether we call Dataset/DataFrame experimental or not
> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
> then mark them as stable in 2.1. Despite being "experimental", there has
> been no breaking changes to DataFrame from 1.3 to 1.6.
>
>
>
> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
> wrote:
>
>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>> fixing.  We're on the same page now.
>>
>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
>> wrote:
>>
>>> A 1.6.x release will only fix bugs - we typically don't change APIs in z
>>> releases. The Dataset API is experimental and so we might be changing the
>>> APIs before we declare it stable. This is why I think it is important to
>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>
>>> Kostas
>>>
>>>
>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra 
>>> wrote:
>>>
 Why does stabilization of those two features require a 1.7 release
 instead of 1.6.1?

 On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis 
 wrote:

> We have veered off the topic of Spark 2.0 a little bit here - yes we
> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
> to propose we have one more 1.x release after Spark 1.6. This will allow 
> us
> to stabilize a few of the new features that were added in 1.6:
>
> 1) the experimental Datasets API
> 2) the new unified memory manager.
>
> I understand our goal for Spark 2.0 is to offer an easy transition but
> there will be users that won't be able to seamlessly upgrade given what we
> have discussed as in scope for 2.0. For these users, having a 1.x release
> with these new features/APIs stabilized will be very beneficial. This 
> might
> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>
> Any thoughts on this timeline?
>
> Kostas Sakellis
>
>
>
> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao 
> wrote:
>
>> Agree, more features/apis/optimization need to be added in DF/DS.
>>
>>
>>
>> I mean, we need to think about what kind of RDD APIs we have to
>> provide to developer, maybe the fundamental API is enough, like, the
>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as
>> we can do the same thing easily with DF/DS, even better performance.
>>
>>
>>
>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>> *Sent:* Friday, November 13, 2015 11:23 AM
>> *To:* Stephen Boesch
>>
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Hmmm... to me, that seems like precisely the kind of thing that
>> argues for retaining the RDD API but not as the first thing presented to
>> new Spark developers: "Here's how to use groupBy with DataFrames 
>> Until
>> the optimizer is more fully developed, that won't always get you the best
>> performance that can be obtained.  In these particular circumstances, 
>> ...,
>> you may want to use the low-level RDD API while setting
>> preservesPartitioning to true.  Like this"
>>
>>
>>
>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
>> wrote:
>>
>> My understanding is that  the RDD's presently have more support for
>> complete control of partitioning which is a key 

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Sandy Ryza
To answer your fourth question from Cloudera's perspective, we would never
support a customer running Spark 2.0 on a Hadoop version < 2.6.

-Sandy

On Fri, Nov 20, 2015 at 1:39 PM, Reynold Xin  wrote:

> OK I'm not exactly asking for a vote here :)
>
> I don't think we should look at it from only maintenance point of view --
> because in that case the answer is clearly supporting as few versions as
> possible (or just rm -rf spark source code and call it a day). It is a
> tradeoff between the number of users impacted and the maintenance burden.
>
> So a few questions for those more familiar with Hadoop:
>
> 1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3?
>
> 2. If the answer to 1 is yes, are there known, major issues with backward
> compatibility?
>
> 3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?
>
> 4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below
> stop? To what extent do you care about running Spark on older Hadoop
> clusters.
>
>
>
> On Fri, Nov 20, 2015 at 7:52 AM, Steve Loughran 
> wrote:
>
>>
>> On 20 Nov 2015, at 14:28, ches...@alpinenow.com wrote:
>>
>> Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months
>> away.
>>
>> customer will need to upgrade the new Hadoop clusters to Apache 2.6 or
>> later to leverage new spark 2.0 in one year. I think this possible as
>> latest release on cdh5.x,  HDP 2.x are both on Apache 2.6.0 already.
>> Company will have enough time to upgrade cluster.
>>
>> +1 for me as well
>>
>> Chester
>>
>>
>> now, if you are looking that far ahead, the other big issue is "when to
>> retire Java 7 support".?
>>
>> That's a tough decision for all projects. Hadoop 3.x will be Java 8 only,
>> but nobody has committed the patch to the trunk codebase to force a java 8
>> build; + most of *todays* hadoop clusters are Java 7. But as you can't even
>> download a Java 7 JDK for the desktop from oracle any more today, 2016 is a
>> time to look at the language support and decide what is the baseline
>> version
>>
>> Commentary from Twitter here -as they point out, it's not just the server
>> farm that matters, it's all the apps that talk to it
>>
>>
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201503.mbox/%3ccab7mwte+kefcxsr6n46-ztcs19ed7cwc9vobtr1jqewdkye...@mail.gmail.com%3E
>>
>> -Steve
>>
>
>


Re: A proposal for Spark 2.0

2015-11-10 Thread Sandy Ryza
Another +1 to Reynold's proposal.

Maybe this is obvious, but I'd like to advocate against a blanket removal
of deprecated / developer APIs.  Many APIs can likely be removed without
material impact (e.g. the SparkContext constructor that takes preferred
node location data), while others likely see heavier usage (e.g. I wouldn't
be surprised if mapPartitionsWithContext was baked into a number of apps)
and merit a little extra consideration.

Maybe also obvious, but I think a migration guide with API equivlents and
the like would be incredibly useful in easing the transition.

-Sandy

On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:

> Echoing Shivaram here. I don't think it makes a lot of sense to add more
> features to the 1.x line. We should still do critical bug fixes though.
>
>
> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> +1
>>
>> On a related note I think making it lightweight will ensure that we
>> stay on the current release schedule and don't unnecessarily delay 2.0
>> to wait for new features / big architectural changes.
>>
>> In terms of fixes to 1.x, I think our current policy of back-porting
>> fixes to older releases would still apply. I don't think developing
>> new features on both 1.x and 2.x makes a lot of sense as we would like
>> users to switch to 2.x.
>>
>> Shivaram
>>
>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
>> wrote:
>> > +1 on a lightweight 2.0
>> >
>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>> If not
>> > terminated, how will we determine what goes into each major version
>> line?
>> > Will 1.x only be for stability fixes?
>> >
>> > Thanks,
>> > Kostas
>> >
>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
>> wrote:
>> >>
>> >> I also feel the same as Reynold. I agree we should minimize API breaks
>> and
>> >> focus on fixing things around the edge that were mistakes (e.g.
>> exposing
>> >> Guava and Akka) rather than any overhaul that could fragment the
>> community.
>> >> Ideally a major release is a lightweight process we can do every
>> couple of
>> >> years, with minimal impact for users.
>> >>
>> >> - Patrick
>> >>
>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>> >>  wrote:
>> >>>
>> >>> > For this reason, I would *not* propose doing major releases to break
>> >>> > substantial API's or perform large re-architecting that prevent
>> users from
>> >>> > upgrading. Spark has always had a culture of evolving architecture
>> >>> > incrementally and making changes - and I don't think we want to
>> change this
>> >>> > model.
>> >>>
>> >>> +1 for this. The Python community went through a lot of turmoil over
>> the
>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>> painful
>> >>> for too long. The Spark community will benefit greatly from our
>> explicitly
>> >>> looking to avoid a similar situation.
>> >>>
>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>> >>> > enormous assembly jar in order to run Spark.
>> >>>
>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> >>> distribution means.
>> >>>
>> >>> Nick
>> >>>
>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
>> wrote:
>> 
>>  I’m starting a new thread since the other one got intermixed with
>>  feature requests. Please refrain from making feature request in this
>> thread.
>>  Not that we shouldn’t be adding features, but we can always add
>> features in
>>  1.7, 2.1, 2.2, ...
>> 
>>  First - I want to propose a premise for how to think about Spark 2.0
>> and
>>  major releases in Spark, based on discussion with several members of
>> the
>>  community: a major release should be low overhead and minimally
>> disruptive
>>  to the Spark community. A major release should not be very different
>> from a
>>  minor release and should not be gated based on new features. The main
>>  purpose of a major release is an opportunity to fix things that are
>> broken
>>  in the current API and remove certain deprecated APIs (examples
>> follow).
>> 
>>  For this reason, I would *not* propose doing major releases to break
>>  substantial API's or perform large re-architecting that prevent
>> users from
>>  upgrading. Spark has always had a culture of evolving architecture
>>  incrementally and making changes - and I don't think we want to
>> change this
>>  model. In fact, we’ve released many architectural changes on the 1.X
>> line.
>> 
>>  If the community likes the above model, then to me it seems
>> reasonable
>>  to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>> immediately
>>  after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>> cadence of
>>  major releases every 2 years seems doable within the above 

Re: A proposal for Spark 2.0

2015-11-10 Thread Sandy Ryza
Oh and another question - should Spark 2.0 support Java 7?

On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Another +1 to Reynold's proposal.
>
> Maybe this is obvious, but I'd like to advocate against a blanket removal
> of deprecated / developer APIs.  Many APIs can likely be removed without
> material impact (e.g. the SparkContext constructor that takes preferred
> node location data), while others likely see heavier usage (e.g. I wouldn't
> be surprised if mapPartitionsWithContext was baked into a number of apps)
> and merit a little extra consideration.
>
> Maybe also obvious, but I think a migration guide with API equivlents and
> the like would be incredibly useful in easing the transition.
>
> -Sandy
>
> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>> features to the 1.x line. We should still do critical bug fixes though.
>>
>>
>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1
>>>
>>> On a related note I think making it lightweight will ensure that we
>>> stay on the current release schedule and don't unnecessarily delay 2.0
>>> to wait for new features / big architectural changes.
>>>
>>> In terms of fixes to 1.x, I think our current policy of back-porting
>>> fixes to older releases would still apply. I don't think developing
>>> new features on both 1.x and 2.x makes a lot of sense as we would like
>>> users to switch to 2.x.
>>>
>>> Shivaram
>>>
>>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <kos...@cloudera.com>
>>> wrote:
>>> > +1 on a lightweight 2.0
>>> >
>>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>>> If not
>>> > terminated, how will we determine what goes into each major version
>>> line?
>>> > Will 1.x only be for stability fixes?
>>> >
>>> > Thanks,
>>> > Kostas
>>> >
>>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pwend...@gmail.com>
>>> wrote:
>>> >>
>>> >> I also feel the same as Reynold. I agree we should minimize API
>>> breaks and
>>> >> focus on fixing things around the edge that were mistakes (e.g.
>>> exposing
>>> >> Guava and Akka) rather than any overhaul that could fragment the
>>> community.
>>> >> Ideally a major release is a lightweight process we can do every
>>> couple of
>>> >> years, with minimal impact for users.
>>> >>
>>> >> - Patrick
>>> >>
>>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>> >> <nicholas.cham...@gmail.com> wrote:
>>> >>>
>>> >>> > For this reason, I would *not* propose doing major releases to
>>> break
>>> >>> > substantial API's or perform large re-architecting that prevent
>>> users from
>>> >>> > upgrading. Spark has always had a culture of evolving architecture
>>> >>> > incrementally and making changes - and I don't think we want to
>>> change this
>>> >>> > model.
>>> >>>
>>> >>> +1 for this. The Python community went through a lot of turmoil over
>>> the
>>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>>> painful
>>> >>> for too long. The Spark community will benefit greatly from our
>>> explicitly
>>> >>> looking to avoid a similar situation.
>>> >>>
>>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> >>> > enormous assembly jar in order to run Spark.
>>> >>>
>>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> >>> distribution means.
>>> >>>
>>> >>> Nick
>>> >>>
>>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <r...@databricks.com>
>>> wrote:
>>> >>>>
>>> >>>> I’m starting a new thread since the other one got intermixed with
>>> >>>> feature requests. Please refrain from making feature request in
>>> this thread.
>>> >>>> Not that we shouldn’t be add

Re: Info about Dataset

2015-11-03 Thread Sandy Ryza
Hi Justin,

The Dataset API proposal is available here:
https://issues.apache.org/jira/browse/SPARK-.

-Sandy

On Tue, Nov 3, 2015 at 1:41 PM, Justin Uang  wrote:

> Hi,
>
> I was looking through some of the PRs slated for 1.6.0 and I noted
> something called a Dataset, which looks like a new concept based off of the
> scaladoc for the class. Can anyone point me to some references/design_docs
> regarding the choice to introduce the new concept? I presume it is probably
> something to do with performance optimizations?
>
> Thanks!
>
> Justin
>


Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-30 Thread Sandy Ryza
+1 (non-binding)
built from source and ran some jobs against YARN

-Sandy

On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan vaquar.k...@gmail.com wrote:


 +1 (1.5.0 RC2)Compiled on Windows with YARN.

 Regards,
 Vaquar khan
 +1 (non-binding, of course)

 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
 2. Tested pyspark, mllib
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. Scala - MLlib
 3.1. statistics (min,max,mean,Pearson,Spearman) OK
 3.2. LinearRegressionWithSGD OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 3.6. saveAsParquetFile OK
 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
 registerTempTable, sql OK
 3.8. result = sqlContext.sql(SELECT
 OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
 JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK
 4.0. Spark SQL from Python OK
 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK
 5.0. Packages
 5.1. com.databricks.spark.csv - read/write OK
 (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
 com.databricks:spark-csv_2.11:1.2.0 worked)
 6.0. DataFrames
 6.1. cast,dtypes OK
 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
 6.3. joins,sql,set operations,udf OK

 Cheers
 k/

 On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin r...@databricks.com wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.5.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/


 The tag to be voted on is v1.5.0-rc2:

 https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release (published as 1.5.0-rc2) can be
 found at:
 https://repository.apache.org/content/repositories/orgapachespark-1141/

 The staging repository for this release (published as 1.5.0) can be found
 at:
 https://repository.apache.org/content/repositories/orgapachespark-1140/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/


 ===
 How can I help test this release?
 ===
 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.


 
 What justifies a -1 vote for this release?
 
 This vote is happening towards the end of the 1.5 QA period, so -1 votes
 should only occur for significant regressions from 1.4. Bugs already
 present in 1.4, minor regressions, or bugs related to new features will not
 block this release.


 ===
 What should happen to JIRA tickets still targeting 1.5.0?
 ===
 1. It is OK for documentation patches to target 1.5.0 and still go into
 branch-1.5, since documentations will be packaged separately from the
 release.
 2. New features for non-alpha-modules should target 1.6+.
 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
 version.


 ==
 Major changes to help you focus your testing
 ==

 As of today, Spark 1.5 contains more than 1000 commits from 220+
 contributors. I've curated a list of important changes for 1.5. For the
 complete list, please refer to Apache JIRA changelog.

 RDD/DataFrame/SQL APIs

 - New UDAF interface
 - DataFrame hints for broadcast join
 - expr function for turning a SQL expression into DataFrame column
 - Improved support for NaN values
 - StructType now supports ordering
 - TimestampType precision is reduced to 1us
 - 100 new built-in expressions, including date/time, string, math
 - memory and local disk only checkpointing

 DataFrame/SQL Backend Execution

 - Code generation on by default
 - Improved join, 

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-24 Thread Sandy Ryza
I see that there's an 1.5.0-rc2 tag in github now.  Is that the official
RC2 tag to start trying out?

-Sandy

On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen so...@cloudera.com wrote:

 PS Shixiong Zhu is correct that this one has to be fixed:
 https://issues.apache.org/jira/browse/SPARK-10168

 For example you can see assemblies like this are nearly empty:

 https://repository.apache.org/content/repositories/orgapachespark-1137/org/apache/spark/spark-streaming-flume-assembly_2.10/1.5.0-rc1/

 Just a publishing glitch but worth a few more eyes on.

 On Fri, Aug 21, 2015 at 5:28 PM, Sean Owen so...@cloudera.com wrote:
  Signatures, license, etc. look good. I'm getting some fairly
  consistent failures using Java 7 + Ubuntu 15 + -Pyarn -Phive
  -Phive-thriftserver -Phadoop-2.6 -- does anyone else see these? they
  are likely just test problems, but worth asking. Stack traces are at
  the end.
 
  There are currently 79 issues targeted for 1.5.0, of which 19 are
  bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.)
  That's significantly better than at the last release. I presume a lot
  of what's still targeted is not critical and can now be
  untargeted/retargeted.
 
  It occurs to me that the flurry of planning that took place at the
  start of the 1.5 QA cycle a few weeks ago was quite helpful, and is
  the kind of thing that would be even more useful at the start of a
  release cycle. So would be great to do this for 1.6 in a few weeks.
  Indeed there are already 267 issues targeted for 1.6.0 -- a decent
  roadmap already.
 
 
  Test failures:
 
  Core
 
  - Unpersisting TorrentBroadcast on executors and driver in distributed
  mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 2 executors before
  1 milliseconds elapsed
at
 org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561)
at
 org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313)
at org.apache.spark.broadcast.BroadcastSuite.org
 $apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287)
at
 org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165)
at
 org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
at
 org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
at
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
...
 
  Streaming
 
  - stop slow receiver gracefully *** FAILED ***
0 was not greater than 0 (StreamingContextSuite.scala:324)
 
  Kafka
 
  - offset recovery *** FAILED ***
The code passed to eventually never returned normally. Attempted 191
  times over 10.043196973 seconds. Last failure message:
  strings.forall({
  ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem))
}) was false. (DirectKafkaStreamSuite.scala:249)
 
  On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark version
  1.5.0!
 
  The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a
  majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.5.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see http://spark.apache.org/
 
 
  The tag to be voted on is v1.5.0-rc1:
 
 https://github.com/apache/spark/tree/4c56ad772637615cc1f4f88d619fac6c372c8552
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-bin/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1137/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-docs/
 
 
  ===
  == How can I help test this release? ==
  ===
  If you are a Spark user, you can help us test this release by taking an
  existing Spark workload and running on this release candidate, then
  reporting any regressions.
 
 
  
  == What justifies a -1 vote for this release? ==
  
  This vote is happening towards the end of the 1.5 QA period, so -1 votes
  should only occur for significant regressions from 1.4. Bugs already
 present
  in 1.4, minor regressions, or bugs related to new features will not
 block
  this 

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-24 Thread Sandy Ryza
Cool, thanks!

On Mon, Aug 24, 2015 at 2:07 PM, Reynold Xin r...@databricks.com wrote:

 Nope --- I cut that last Friday but had an error. I will remove it and cut
 a new one.


 On Mon, Aug 24, 2015 at 2:06 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 I see that there's an 1.5.0-rc2 tag in github now.  Is that the official
 RC2 tag to start trying out?

 -Sandy

 On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen so...@cloudera.com wrote:

 PS Shixiong Zhu is correct that this one has to be fixed:
 https://issues.apache.org/jira/browse/SPARK-10168

 For example you can see assemblies like this are nearly empty:

 https://repository.apache.org/content/repositories/orgapachespark-1137/org/apache/spark/spark-streaming-flume-assembly_2.10/1.5.0-rc1/

 Just a publishing glitch but worth a few more eyes on.

 On Fri, Aug 21, 2015 at 5:28 PM, Sean Owen so...@cloudera.com wrote:
  Signatures, license, etc. look good. I'm getting some fairly
  consistent failures using Java 7 + Ubuntu 15 + -Pyarn -Phive
  -Phive-thriftserver -Phadoop-2.6 -- does anyone else see these? they
  are likely just test problems, but worth asking. Stack traces are at
  the end.
 
  There are currently 79 issues targeted for 1.5.0, of which 19 are
  bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.)
  That's significantly better than at the last release. I presume a lot
  of what's still targeted is not critical and can now be
  untargeted/retargeted.
 
  It occurs to me that the flurry of planning that took place at the
  start of the 1.5 QA cycle a few weeks ago was quite helpful, and is
  the kind of thing that would be even more useful at the start of a
  release cycle. So would be great to do this for 1.6 in a few weeks.
  Indeed there are already 267 issues targeted for 1.6.0 -- a decent
  roadmap already.
 
 
  Test failures:
 
  Core
 
  - Unpersisting TorrentBroadcast on executors and driver in distributed
  mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 2 executors before
  1 milliseconds elapsed
at
 org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561)
at
 org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313)
at org.apache.spark.broadcast.BroadcastSuite.org
 $apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287)
at
 org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165)
at
 org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
at
 org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
at
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
...
 
  Streaming
 
  - stop slow receiver gracefully *** FAILED ***
0 was not greater than 0 (StreamingContextSuite.scala:324)
 
  Kafka
 
  - offset recovery *** FAILED ***
The code passed to eventually never returned normally. Attempted 191
  times over 10.043196973 seconds. Last failure message:
  strings.forall({
  ((elem: Any) =
 DirectKafkaStreamSuite.collectedData.contains(elem))
}) was false. (DirectKafkaStreamSuite.scala:249)
 
  On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark
 version
  1.5.0!
 
  The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes
 if a
  majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.5.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see http://spark.apache.org/
 
 
  The tag to be voted on is v1.5.0-rc1:
 
 https://github.com/apache/spark/tree/4c56ad772637615cc1f4f88d619fac6c372c8552
 
  The release files, including signatures, digests, etc. can be found
 at:
 
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-bin/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
 
 https://repository.apache.org/content/repositories/orgapachespark-1137/
 
  The documentation corresponding to this release can be found at:
 
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-docs/
 
 
  ===
  == How can I help test this release? ==
  ===
  If you are a Spark user, you can help us test this release by taking
 an
  existing Spark workload and running on this release candidate, then
  reporting any regressions.
 
 
  
  == What justifies a -1 vote for this release

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
Edit: the first line should read:

  val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)

On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:

 This functionality already basically exists in Spark.  To create the
 grouped RDD, one can run:

   val groupedRdd = rdd.reduceByKey(_ + _)

 To get it back into the original form:

   groupedRdd.flatMap(x = List.fill(x._1)(x._2))

 -Sandy

 -Sandy

 On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман sergliho...@gmail.com
 wrote:

 Hi,

 I am looking for suitable issue for Master Degree project(it sounds like
 scalability problems and improvements for spark streaming) and seems like
 introduction of grouped RDD(for example: don't store
 Spark, Spark, Spark, instead store (Spark, 3)) can:

 1. Reduce memory needed for RDD (roughly, used memory will be:  % of uniq
 messages)
 2. Improve performance(no need to apply function several times for the
 same message).

 Can I create ticket and introduce API for grouped RDDs? Is it make sense?
 Also I will be very appreciated for critic and ideas





Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
The user gets to choose what they want to reside in memory.  If they call
rdd.cache() on the original RDD, it will be in memory.  If they call
rdd.cache() on the compact RDD, it will be in memory.  If cache() is called
on both, they'll both be in memory.

-Sandy

On Sun, Jul 19, 2015 at 11:09 AM, Сергей Лихоман sergliho...@gmail.com
wrote:

 Thanks for answer! Could you please answer for one more question? Will we
 have in memory original rdd and grouped rdd in the same time?

 2015-07-19 21:04 GMT+03:00 Sandy Ryza sandy.r...@cloudera.com:

 Edit: the first line should read:

   val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)

 On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 This functionality already basically exists in Spark.  To create the
 grouped RDD, one can run:

   val groupedRdd = rdd.reduceByKey(_ + _)

 To get it back into the original form:

   groupedRdd.flatMap(x = List.fill(x._1)(x._2))

 -Sandy

 -Sandy

 On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман sergliho...@gmail.com
 wrote:

 Hi,

 I am looking for suitable issue for Master Degree project(it sounds
 like scalability problems and improvements for spark streaming) and seems
 like introduction of grouped RDD(for example: don't store
 Spark, Spark, Spark, instead store (Spark, 3)) can:

 1. Reduce memory needed for RDD (roughly, used memory will be:  % of
 uniq messages)
 2. Improve performance(no need to apply function several times for the
 same message).

 Can I create ticket and introduce API for grouped RDDs? Is it make
 sense? Also I will be very appreciated for critic and ideas







Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
This functionality already basically exists in Spark.  To create the
grouped RDD, one can run:

  val groupedRdd = rdd.reduceByKey(_ + _)

To get it back into the original form:

  groupedRdd.flatMap(x = List.fill(x._1)(x._2))

-Sandy

-Sandy

On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман sergliho...@gmail.com
wrote:

 Hi,

 I am looking for suitable issue for Master Degree project(it sounds like
 scalability problems and improvements for spark streaming) and seems like
 introduction of grouped RDD(for example: don't store
 Spark, Spark, Spark, instead store (Spark, 3)) can:

 1. Reduce memory needed for RDD (roughly, used memory will be:  % of uniq
 messages)
 2. Improve performance(no need to apply function several times for the
 same message).

 Can I create ticket and introduce API for grouped RDDs? Is it make sense?
 Also I will be very appreciated for critic and ideas



Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
In the Spark model, constructing an RDD does not mean storing all its
contents in memory.  Rather, an RDD is a description of a dataset that
enables iterating over its contents, record by record (in parallel).  The
only time the full contents of an RDD are stored in memory is when a user
explicitly calls cache or persist on it.

-Sandy

On Sun, Jul 19, 2015 at 11:41 AM, Сергей Лихоман sergliho...@gmail.com
wrote:

 Sorry, maybe I am saying something completely wrong...  we have a stream,
 we digitize it to created rdd. rdd in this case will be just array of any.
 than we apply transformation to create new grouped rdd and GC should remove
 original rdd from memory(if we won't persist it). Will we have GC step in  val
 groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) ? my suggestion is to
 remove creation and reclaiming of unneeded rdd and create already grouped
 one

 2015-07-19 21:26 GMT+03:00 Sandy Ryza sandy.r...@cloudera.com:

 The user gets to choose what they want to reside in memory.  If they call
 rdd.cache() on the original RDD, it will be in memory.  If they call
 rdd.cache() on the compact RDD, it will be in memory.  If cache() is called
 on both, they'll both be in memory.

 -Sandy

 On Sun, Jul 19, 2015 at 11:09 AM, Сергей Лихоман sergliho...@gmail.com
 wrote:

 Thanks for answer! Could you please answer for one more question? Will
 we have in memory original rdd and grouped rdd in the same time?

 2015-07-19 21:04 GMT+03:00 Sandy Ryza sandy.r...@cloudera.com:

 Edit: the first line should read:

   val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)

 On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 This functionality already basically exists in Spark.  To create the
 grouped RDD, one can run:

   val groupedRdd = rdd.reduceByKey(_ + _)

 To get it back into the original form:

   groupedRdd.flatMap(x = List.fill(x._1)(x._2))

 -Sandy

 -Sandy

 On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман 
 sergliho...@gmail.com wrote:

 Hi,

 I am looking for suitable issue for Master Degree project(it sounds
 like scalability problems and improvements for spark streaming) and seems
 like introduction of grouped RDD(for example: don't store
 Spark, Spark, Spark, instead store (Spark, 3)) can:

 1. Reduce memory needed for RDD (roughly, used memory will be:  % of
 uniq messages)
 2. Improve performance(no need to apply function several times for
 the same message).

 Can I create ticket and introduce API for grouped RDDs? Is it make
 sense? Also I will be very appreciated for critic and ideas









Re: How to Read Excel file in Spark 1.4

2015-07-13 Thread Sandy Ryza
Hi Su,

Spark can't read excel files directly.  Your best best is probably to
export the contents as a CSV and use the csvFile API.

-Sandy

On Mon, Jul 13, 2015 at 9:22 AM, spark user spark_u...@yahoo.com.invalid
wrote:

 Hi

 I need your help to save excel data in hive .


1. how to read excel file in spark using spark 1.4
2. How to save using data frame

 If you have some sample code pls send

 Thanks

 su



Re: External Shuffle service over yarn

2015-06-26 Thread Sandy Ryza
Hi Yash,

One of the main advantages is that, if you turn dynamic allocation on, and
executors are discarded, your application is still able to get at the
shuffle data that they wrote out.

-Sandy

On Thu, Jun 25, 2015 at 11:08 PM, yash datta sau...@gmail.com wrote:

 Hi devs,

 Can someone point out if there are any distinct advantages of using
 external shuffle service over yarn (runs on node manager  as an auxiliary
 service

 https://issues.apache.org/jira/browse/SPARK-3797)  instead of the default
 execution in the executor containers ?

 Please also mention if you have seen any differences having used both ways
 ?

 Thanks and Best Regards
 Yash

 --
 When events unfold with calm and ease
 When the winds that blow are merely breeze
 Learn from nature, from birds and bees
 Live your life in love, and let joy not cease.



Re: Increase partition count (repartition) without shuffle

2015-06-18 Thread Sandy Ryza
Hi Alexander,

There is currently no way to create an RDD with more partitions than its
parent RDD without causing a shuffle.

However, if the files are splittable, you can set the Hadoop configurations
that control split size to something smaller so that the HadoopRDD ends up
with more partitions.

-Sandy

On Thu, Jun 18, 2015 at 2:26 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

  Hi,



 Is there a way to increase the amount of partition of RDD without causing
 shuffle? I’ve found JIRA issue
 https://issues.apache.org/jira/browse/SPARK-5997 however there is no
 implementation yet.



 Just in case, I am reading data from ~300 big binary files, which results
 in 300 partitions, then I need to sort my RDD, but it crashes with
 outofmemory exception. If I change the number of partitions to 2000, sort
 works OK, but repartition itself takes a lot of time due to shuffle.



 Best regards, Alexander



Re: [SparkScore] Performance portal for Apache Spark

2015-06-17 Thread Sandy Ryza
This looks really awesome.

On Tue, Jun 16, 2015 at 10:27 AM, Huang, Jie jie.hu...@intel.com wrote:

  Hi All

 We are happy to announce Performance portal for Apache Spark
 http://01org.github.io/sparkscore/ !

 The Performance Portal for Apache Spark provides performance data on the
 Spark upsteam to the community to help identify issues, better understand
 performance differentials between versions, and help Spark customers get
 across the finish line faster. The Performance Portal generates two
 reports, regular (weekly) report and release based regression test report.
 We are currently using two benchmark suites which include HiBench (
 http://github.com/intel-bigdata/HiBench) and Spark-perf (
 https://github.com/databricks/spark-perf ). We welcome and look forward
 to your suggestions and feedbacks. More information and details provided
 below
 Abount Performance Portal for Apache Spark

 Our goal is to work with the Apache Spark community to further enhance the
 scalability and reliability of the Apache Spark. The data available on this
 site allows community members and potential Spark customers to closely
 track performance trend of the Apache Spark. Ultimately, we hope that this
 project will help community to fix performance issue quickly, thus
 providing better Apache spark code to end customers. The current workloads
 used in the benchmarking include HiBench (a benchmark suite to evaluate big
 data framework like Hadoop MR, Spark from Intel) and Spark-perf (a
 performance testing framework for Apache Spark from Databricks). Additional
 benchmarks will be added as they become available
 Description
 --

 Each data point represents each workload runtime percent compared with the
 previous week. Different lines represents different workloads running on
 spark yarn-client mode.
 Hardware
 --

 CPU type: Intel® Xeon® CPU E5-2697 v2 @ 2.70GHz
 Memory: 128GB
 NIC: 10GbE
 Disk(s): 8 x 1TB SATA HDD
 Software
 --

 JAVA ver sion: 1.8.0_25
 Hadoop version: 2.5.0-CDH5.3.2
 HiBench version: 4.0
 Spark on yarn-client mode
 Cluster
 --

 1 node for Master
 10 nodes for Slave
 Summary

 The lower percent the better performance.
  --

 *Group*

 *ww19 *

 *ww20 *

 *ww22 *

 *ww23 *

 *ww24 *

 *ww25 *

 HiBench

 9.1%

 6.6%

 6.0%

 7.9%

 -6.5%

 -3.1%

 spark-perf

 4.1%

 4.4%

 -1.8%

 4.1%

 -4.7%

 -4.6%


 *Y-Axis: normalized completion time; X-Axis: Work Week. *

 * The commit number can be found in the result table. The performance
 score for each workload is normalized based on the elapsed time for 1.2
 release.The lower the better.*
 HiBench
 --

 *JOB*

 *ww19 *

 *ww20 *

 *ww22 *

 *ww23 *

 *ww24 *

 *ww25 *

 *commit*

 *489700c8 *

 *8e3822a0 *

 *530efe3e *

 *90c60692 *

 *db81b9d8 *

 *4eb48ed1 *

 sleep

 %

 %

 -2.1%

 -2.9%

 -4.1%

 12.8%

 wordcount

 17.6%

 11.4%

 8.0%

 8.3%

 -18.6%

 -10.9%

 kmeans

 92.1%

 61.5%

 72.1%

 92.9%

 86.9%

 95.8%

 scan

 -4.9%

 -7.2%

 %

 -1.1%

 -25.5%

 -21.0%

 bayes

 -24.3%

 -20.1%

 -18.3%

 -11.1%

 -29.7%

 -31.3%

 aggregation

 5.6%

 10.5%

 %

 9.2%

 -15.3%

 -15.0%

 join

 4.5%

 1.2%

 %

 1.0%

 -12.7%

 -13.9%

 sort

 -3.3%

 -0.5%

 -11.9%

 -12.5%

 -17.5%

 -17.3%

 pagerank

 2.2%

 3.2%

 4.0%

 2.9%

 -11.4%

 -13.0%

 terasort

 -7.1%

 -0.2%

 -9.5%

 -7.3%

 -16.7%

 -17.0%

 Comments: null means no such workload running or workload failed in this
 time.


 *Y-Axis: normalized completion time; X-Axis: Work Week. *

 * The commit number can be found in the result table. The performance
 score for each workload is normalized based on the elapsed time for 1.2
 release.The lower the better.*
 spark-perf
 --

 *JOB*

 *ww19 *

 *ww20 *

 *ww22 *

 *ww23 *

 *ww24 *

 *ww25 *

 *commit*

 *489700c8 *

 *8e3822a0 *

 *530efe3e *

 *90c60692 *

 *db81b9d8 *

 *4eb48ed1 *

 agg

 13.2%

 7.0%

 %

 18.3%

 5.2%

 2.5%

 agg-int

 16.4%

 21.2%

 %

 9.6%

 4.0%

 8.2%

 agg-naive

 4.3%

 -2.4%

 %

 -0.8%

 -6.7%

 -6.8 %

 scheduling

 -6.1%

 -8.9%

 -14.5%

 -2.1%

 -6.4%

 -6.5%

 count-filter

 4.1%

 1.0%

 6.6%

 6.8%

 -10.2%

 -10.4%

 count

 4.8%

 4.6%

 6.7%

 8.0%

 -7.3%

 -7.0%

 sort

 -8.1%

 -2.5%

 -6.2%

 -7.0%

 -14.6%

 -14.4%

 sort-int

 4.5%

 15.3%

 -1.6%

 -0.1%

 -1.5%

 -2.2%

 Comments: null means no such workload running or workload failed in this
 time.


 *Y-Axis: normalized completion time; X-Axis: Work Week. *

 * The commit number can be found in the result table. The pe rformance
 score for each workload is normalized based on the elapsed time for 1.2
 release.The lower the better.*
  Release
 Summary

 The lower percent the better performance.
  --

 *Group*

 *1.2.1 *

 *1.3.0 *

 *1.3.1 *

 *1.4.0 *

 HiBench

 -1.0%

 10.5%

 8.4%

 8.6%

 spark-perf

 3.2%

Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-05 Thread Sandy Ryza
+1 (non-binding)

Built from source and ran some jobs against a pseudo-distributed YARN
cluster.

-Sandy

On Fri, Jun 5, 2015 at 11:05 AM, Ram Sriharsha sriharsha@gmail.com
wrote:

 +1 , tested  with hadoop 2.6/ yarn on centos 6.5 after building  w/ -Pyarn
 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver and ran a
 few SQL tests and the ML examples

 On Fri, Jun 5, 2015 at 10:55 AM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:

 +1. Build looks good, ran a couple apps on YARN


 Thanks,
 Hari

 On Fri, Jun 5, 2015 at 10:52 AM, Yin Huai yh...@databricks.com wrote:

 Sean,

 Can you add -Phive -Phive-thriftserver and try those Hive tests?

 Thanks,

 Yin

 On Fri, Jun 5, 2015 at 5:19 AM, Sean Owen so...@cloudera.com wrote:

 Everything checks out again, and the tests pass for me on Ubuntu +
 Java 7 with '-Pyarn -Phadoop-2.6', except that I always get
 SparkSubmitSuite errors like ...

 - success sanity check *** FAILED ***
   java.lang.RuntimeException: [download failed:
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed:
 commons-net#commons-net;3.1!commons-net.jar]
   at
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)
   at
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
   ...

 I also can't get hive tests to pass. Is anyone else seeing anything
 like this? if not I'll assume this is something specific to the env --
 or that I don't have the build invocation just right. It's puzzling
 since it's so consistent, but I presume others' tests pass and Jenkins
 does.


 On Wed, Jun 3, 2015 at 5:53 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark
 version 1.4.0!
 
  The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
  22596c534a38cfdda91aef18aa9037ab101e4251
 
  The release files, including signatures, digests, etc. can be found
 at:
 
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  [published as version: 1.4.0]
 
 https://repository.apache.org/content/repositories/orgapachespark-/
  [published as version: 1.4.0-rc4]
 
 https://repository.apache.org/content/repositories/orgapachespark-1112/
 
  The documentation corresponding to this release can be found at:
 
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/
 
  Please vote on releasing this package as Apache Spark 1.4.0!
 
  The vote is open until Saturday, June 06, at 05:00 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.4.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == What has changed since RC3 ==
  In addition to may smaller fixes, three blocker issues were fixed:
  4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make
  metadataHive get constructed too early
  6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
  78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be
 singleton
 
  == How can I help test this release? ==
  If you are a Spark user, you can help us test this release by
  taking a Spark 1.3 workload and running on this release candidate,
  then reporting any regressions.
 
  == What justifies a -1 vote for this release? ==
  This vote is happening towards the end of the 1.4 QA period,
  so -1 votes should only occur for significant regressions from 1.3.1.
  Bugs already present in 1.3.X, minor regressions, or bugs related
  to new features will not block this release.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org







Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-05-31 Thread Sandy Ryza
+1 (non-binding)

Launched against a pseudo-distributed YARN cluster running Hadoop 2.6.0 and
ran some jobs.

-Sandy

On Sat, May 30, 2015 at 3:44 PM, Krishna Sankar ksanka...@gmail.com wrote:

 +1 (non-binding, of course)

 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min
  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
 -Dhadoop.version=2.6.0 -DskipTests
 2. Tested pyspark, mlib - running as well as compare results with 1.3.1
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. Scala - MLlib
 3.1. statistics (min,max,mean,Pearson,Spearman) OK
 3.2. LinearRegressionWithSGD OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 3.6. saveAsParquetFile OK
 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
 registerTempTable, sql OK
 3.8. result = sqlContext.sql(SELECT
 OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
 JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK
 4.0. Spark SQL from Python OK
 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK

 Cheers
 k/

 On Fri, May 29, 2015 at 4:40 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.4.0!

 The tag to be voted on is v1.4.0-rc3 (commit dd109a8):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.0]
 https://repository.apache.org/content/repositories/orgapachespark-1109/
 [published as version: 1.4.0-rc3]
 https://repository.apache.org/content/repositories/orgapachespark-1110/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-docs/

 Please vote on releasing this package as Apache Spark 1.4.0!

 The vote is open until Tuesday, June 02, at 00:32 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == What has changed since RC1 ==
 Below is a list of bug fixes that went into this RC:
 http://s.apache.org/vN

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.3 workload and running on this release candidate,
 then reporting any regressions.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.4 QA period,
 so -1 votes should only occur for significant regressions from 1.3.1.
 Bugs already present in 1.3.X, minor regressions, or bugs related
 to new features will not block this release.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Sandy Ryza
Wow, I hadn't noticed this, but 5 seconds is really long.  It's true that
it's configurable, but I think we need to provide a decent out-of-the-box
experience.  For comparison, the MapReduce equivalent is 1 second.

I filed https://issues.apache.org/jira/browse/SPARK-7533 for this.

-Sandy

On Mon, May 11, 2015 at 9:03 AM, Mridul Muralidharan mri...@gmail.com
wrote:

 For tiny/small clusters (particularly single tenet), you can set it to
 lower value.
 But for anything reasonably large or multi-tenet, the request storm
 can be bad if large enough number of applications start aggressively
 polling RM.
 That is why the interval is set to configurable.

 - Mridul


 On Mon, May 11, 2015 at 6:54 AM, Zoltán Zvara zoltan.zv...@gmail.com
 wrote:
  Isn't this issue something that should be improved? Based on the
 following
  discussion, there are two places were YARN's heartbeat interval is
  respected on job start-up, but do we really need to respect it on
 start-up?
 
  On Fri, May 8, 2015 at 12:14 PM Taeyun Kim taeyun@innowireless.com
  wrote:
 
  I think so.
 
  In fact, the flow is: allocator.allocateResources() - sleep -
  allocator.allocateResources() - sleep …
 
  But I guess that on the first allocateResources() the allocation is not
  fulfilled. So sleep occurs.
 
 
 
  *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com]
  *Sent:* Friday, May 08, 2015 4:25 PM
 
 
  *To:* Taeyun Kim; u...@spark.apache.org
  *Subject:* Re: YARN mode startup takes too long (10+ secs)
 
 
 
  So is this sleep occurs before allocating resources for the first few
  executors to start the job?
 
 
 
  On Fri, May 8, 2015 at 6:23 AM Taeyun Kim taeyun@innowireless.com
  wrote:
 
  I think I’ve found the (maybe partial, but major) reason.
 
 
 
  It’s between the following lines, (it’s newly captured, but essentially
  the same place that Zoltán Zvara picked:
 
 
 
  15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager
 
  15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor:
  Actor[akka.tcp://sparkExecutor@cluster04
 :55237/user/Executor#-149550753]
  with ID 1
 
 
 
  When I read the logs on cluster side, the following lines were found:
 (the
  exact time is different with above line, but it’s the difference between
  machines)
 
 
 
  15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter
  thread - sleep time : 5000
 
  15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for :
  cluster04:45454
 
 
 
  It seemed that Spark deliberately sleeps 5 secs.
 
  I’ve read the Spark source code, and in
  org.apache.spark.deploy.yarn.ApplicationMaster.scala,
 launchReporterThread()
  had the code for that.
 
  It loops calling allocator.allocateResources() and Thread.sleep().
 
  For sleep, it reads the configuration variable
  spark.yarn.scheduler.heartbeat.interval-ms (the default value is 5000,
  which is 5 secs).
 
  According to the comment, “we want to be reasonably responsive without
  causing too many requests to RM”.
 
  So, unless YARN immediately fulfill the allocation request, it seems
 that
  5 secs will be wasted.
 
 
 
  When I modified the configuration variable to 1000, it only waited for 1
  sec.
 
 
 
  Here is the log lines after the change:
 
 
 
  15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter
  thread - sleep time : 1000
 
  15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for :
  cluster04:45454
 
 
 
  4 secs saved.
 
  So, when one does not want to wait 5 secs, one can change the
  spark.yarn.scheduler.heartbeat.interval-ms.
 
  I hope that the additional overhead it incurs would be negligible.
 
 
 
 
 
  *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com]
  *Sent:* Thursday, May 07, 2015 10:05 PM
  *To:* Taeyun Kim; u...@spark.apache.org
  *Subject:* Re: YARN mode startup takes too long (10+ secs)
 
 
 
  Without considering everything, just a few hints:
 
  You are running on YARN. From 09:18:34 to 09:18:37 your application is
 in
  state ACCEPTED. There is a noticeable overhead introduced due to
  communicating with YARN's ResourceManager, NodeManager and given that
 the
  YARN scheduler needs time to make a decision. I guess somewhere
  from 09:18:38 to 09:18:43 your application JAR gets copied to another
  container requested by the Spark ApplicationMaster deployed on YARN's
  container 0. Deploying an executor needs further resource negotiations
 with
  the ResourceManager usually. Also, as I said, your JAR and Executor's
 code
  requires copying to the container's local directory - execution blocked
  until that is complete.
 
 
 
  On Thu, May 7, 2015 at 3:09 AM Taeyun Kim taeyun@innowireless.com
  wrote:
 
  Hi,
 
 
 
  I’m running a spark application with YARN-client or YARN-cluster mode.
 
  But it seems to take too long to startup.
 
  It takes 10+ seconds to initialize the spark context.
 
  Is this normal? Or can it be optimized?
 
 
 
  The environment is as follows:
 
  - Hadoop: 

Re: Regarding KryoSerialization in Spark

2015-04-30 Thread Sandy Ryza
Hi Twinkle,

Registering the class makes it so that writeClass only writes out a couple
bytes, instead of a full String of the class name.

-Sandy

On Thu, Apr 30, 2015 at 4:13 AM, twinkle sachdeva 
twinkle.sachd...@gmail.com wrote:

 Hi,

 As per the code, KryoSerialization used writeClassAndObject method, which
 internally calls writeClass method, which will write the class of the
 object while serilization.

 As per the documentation in tuning page of spark, it says that registering
 the class will avoid that.

 Am I missing something or there is some issue with the documentation???

 Thanks,
 Twinkle



Re: Using memory mapped file for shuffle

2015-04-29 Thread Sandy Ryza
Spark currently doesn't allocate any memory off of the heap for shuffle
objects.  When the in-memory data gets too large, it will write it out to a
file, and then merge spilled filed later.

What exactly do you mean by store shuffle data in HDFS?

-Sandy

On Tue, Apr 14, 2015 at 10:15 AM, Kannan Rajah kra...@maprtech.com wrote:

 Sandy,
 Can you clarify how it won't cause OOM? Is it anyway to related to memory
 being allocated outside the heap - native space? The reason I ask is that I
 have a use case to store shuffle data in HDFS. Since there is no notion of
 memory mapped files, I need to store it as a byte buffer. I want to make
 sure this will not cause OOM when the file size is large.


 --
 Kannan

 On Tue, Apr 14, 2015 at 9:07 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Kannan,

 Both in MapReduce and Spark, the amount of shuffle data a task produces
 can exceed the tasks memory without risk of OOM.

 -Sandy

 On Tue, Apr 14, 2015 at 6:47 AM, Imran Rashid iras...@cloudera.com
 wrote:

 That limit doesn't have anything to do with the amount of available
 memory.  Its just a tuning parameter, as one version is more efficient
 for
 smaller files, the other is better for bigger files.  I suppose the
 comment
 is a little better in FileSegmentManagedBuffer:


 https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/buffer/FileSegmentManagedBuffer.java#L62

 On Tue, Apr 14, 2015 at 12:01 AM, Kannan Rajah kra...@maprtech.com
 wrote:

  DiskStore.getBytes uses memory mapped files if the length is more than
 a
  configured limit. This code path is used during map side shuffle in
  ExternalSorter. I want to know if its possible for the length to
 exceed the
  limit in the case of shuffle. The reason I ask is in the case of
 Hadoop,
  each map task is supposed to produce only data that can fit within the
  task's configured max memory. Otherwise it will result in OOM. Is the
  behavior same in Spark or the size of data generated by a map task can
  exceed what can be fitted in memory.
 
if (length  minMemoryMapBytes) {
  val buf = ByteBuffer.allocate(length.toInt)
  
} else {
  Some(channel.map(MapMode.READ_ONLY, offset, length))
}
 
  --
  Kannan
 






Re: Design docs: consolidation and discoverability

2015-04-27 Thread Sandy Ryza
:
 
  Only catch there is it requires commit access to the repo. We need a
  way for people who aren't committers to write and collaborate (for
  point #1)
 
  On Fri, Apr 24, 2015 at 3:56 PM, Punyashloka Biswal
  punya.bis...@gmail.com wrote:
  Sandy, doesn't keeping (in-progress) design docs in Git satisfy the
  history
  requirement? Referring back to my Gradle example, it seems that
 
 
 
 https://github.com/gradle/gradle/commits/master/design-docs/build-comparison.md
  is a really good way to see why the design doc evolved the way it
  did.
  When
  keeping the doc in Jira (presumably as an attachment) it's not easy
  to
  see
  what changed between successive versions of the doc.
 
  Punya
 
  On Fri, Apr 24, 2015 at 3:53 PM Sandy Ryza 
 sandy.r...@cloudera.com
  wrote:
 
  I think there are maybe two separate things we're talking about?
 
  1. Design discussions and in-progress design docs.
 
  My two cents are that JIRA is the best place for this.  It allows
  tracking
  the progression of a design across multiple PRs and
 contributors.  A
  piece
  of useful feedback that I've gotten in the past is to make design
  docs
  immutable.  When updating them in response to feedback, post a new
  version
  rather than editing the existing one.  This enables tracking the
  history of
  a design and makes it possible to read comments about previous
  designs
  in
  context.  Otherwise it's really difficult to understand why
  particular
  approaches were chosen or abandoned.
 
  2. Completed design docs for features that we've implemented.
 
  Perhaps less essential to project progress, but it would be really
  lovely
  to have a central repository to all the projects design doc.  If
  anyone
  wants to step up to maintain it, it would be cool to have a wiki
  page
  with
  links to all the final design docs posted on JIRA.
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Design docs: consolidation and discoverability

2015-04-24 Thread Sandy Ryza
I think there are maybe two separate things we're talking about?

1. Design discussions and in-progress design docs.

My two cents are that JIRA is the best place for this.  It allows tracking
the progression of a design across multiple PRs and contributors.  A piece
of useful feedback that I've gotten in the past is to make design docs
immutable.  When updating them in response to feedback, post a new version
rather than editing the existing one.  This enables tracking the history of
a design and makes it possible to read comments about previous designs in
context.  Otherwise it's really difficult to understand why particular
approaches were chosen or abandoned.

2. Completed design docs for features that we've implemented.

Perhaps less essential to project progress, but it would be really lovely
to have a central repository to all the projects design doc.  If anyone
wants to step up to maintain it, it would be cool to have a wiki page with
links to all the final design docs posted on JIRA.

-Sandy

On Fri, Apr 24, 2015 at 12:01 PM, Punyashloka Biswal punya.bis...@gmail.com
 wrote:

 The Gradle dev team keep their design documents  *checked into* their Git
 repository -- see

 https://github.com/gradle/gradle/blob/master/design-docs/build-comparison.md
 for example. The advantages I see to their approach are:

- design docs stay on ASF property (since Github is synced to the
Apache-run Git repository)
- design docs have a lifetime across PRs, but can still be modified and
commented on through the mechanism of PRs
- keeping a central location helps people to find good role models and
converge on conventions

 Sean, I find it hard to use the central Jira as a jumping-off point for
 understanding ongoing design work because a tiny fraction of the tickets
 actually relate to design docs, and it's not easy from the outside to
 figure out which ones are relevant.

 Punya

 On Fri, Apr 24, 2015 at 2:49 PM Sean Owen so...@cloudera.com wrote:

  I think it's OK to have design discussions on github, as emails go to
  ASF lists. After all, loads of PR discussions happen there. It's easy
  for anyone to follow.
 
  I also would rather just discuss on Github, except for all that noise.
 
  It's not great to put discussions in something like Google Docs
  actually; the resulting doc needs to be pasted back to JIRA promptly
  if so. I suppose it's still better than a private conversation or not
  talking at all, but the principle is that one should be able to access
  any substantive decision or conversation by being tuned in to only the
  project systems of record -- mailing list, JIRA.
 
 
 
  On Fri, Apr 24, 2015 at 2:30 PM, Reynold Xin r...@databricks.com
 wrote:
   I'd love to see more design discussions consolidated in a single place
 as
   well. That said, there are many practical challenges to overcome. Some
 of
   them are out of our control:
  
   1. For large features, it is fairly common to open a PR for discussion,
   close the PR taking some feedback into account, and reopen another one.
  You
   sort of lose the discussions that way.
  
   2. With the way Jenkins is setup currently, Jenkins testing introduces
 a
  lot
   of noise to GitHub pull requests, making it hard to differentiate
  legitimate
   comments from noise. This is unfortunately due to the fact that ASF
 won't
   allow our Jenkins bot to have API privilege to post messages.
  
   3. The Apache Way is that all development discussions need to happen on
  ASF
   property, i.e. dev lists and JIRA. As a result, technically we are not
   allowed to have development discussions on GitHub.
  
  
   On Fri, Apr 24, 2015 at 7:09 AM, Cody Koeninger c...@koeninger.org
  wrote:
  
   My 2 cents - I'd rather see design docs in github pull requests (using
   plain text / markdown).  That doesn't require changing access or
 adding
   people, and github PRs already allow for conversation / email
   notifications.
  
   Conversation is already split between jira and github PRs.  Having a
  third
   stream of conversation in Google Docs just leads to things being
  ignored.
  
   On Fri, Apr 24, 2015 at 7:21 AM, Sean Owen so...@cloudera.com
 wrote:
  
That would require giving wiki access to everyone or manually adding
people
any time they make a doc.
   
I don't see how this helps though. They're still docs on the
 internet
and
they're still linked from the central project JIRA, which is what
 you
should follow.
 On Apr 24, 2015 8:14 AM, Punyashloka Biswal 
  punya.bis...@gmail.com
wrote:
   
 Dear Spark devs,

 Right now, design docs are stored on Google docs and linked from
 tickets.
 For someone new to the project, it's hard to figure out what
  subjects
 are
 being discussed, what organization to follow for new feature
 proposals,
 etc.

 Would it make sense to consolidate future design docs in either a
 designated area on the Apache Confluence 

Re: Should we let everyone set Assignee?

2015-04-22 Thread Sandy Ryza
I think one of the benefits of assignee fields that I've seen in other
projects is their potential to coordinate and prevent duplicate work.  It's
really frustrating to put a lot of work into a patch and then find out that
someone has been doing the same.  It's helpful for the project etiquette to
include a way to signal to others that you are working or intend to work on
a patch.  Obviously there are limits to how long someone should be able to
hold on to a JIRA without making progress on it, but a signal is still
useful.  Historically, in other projects, the assignee field serves as this
signal.  If we don't want to use the assignee field for this, I think it's
important to have some alternative, even if it's just encouraging
contributors to comment I'm planning to work on this on JIRA.

-Sandy



On Wed, Apr 22, 2015 at 1:30 PM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:

 As a contributor, I¹ve never felt shut out from the Spark community, nor
 have I seen any examples of territorial behavior. A few times I¹ve
 expressed interest in more challenging work and the response I received
 was generally ³go ahead and give it a shot, just understand that this is
 sensitive code so we may end up modifying the PR substantially.² Honestly,
 that seems fine, and in general, I think it¹s completely fair to go with
 the PR model - e.g. If a JIRA has an open PR then it¹s an active effort,
 otherwise it¹s fair game unless otherwise stated. At the end of the day,
 it¹s about moving the project forward and the only way to do that is to
 have actual code in the pipes -speculation and intent don¹t really help,
 and there¹s nothing preventing an interested party from submitting a PR
 against an issue.

 Thank you,
 Ilya Ganelin






 On 4/22/15, 1:25 PM, Mark Hamstra m...@clearstorydata.com wrote:

 Agreed.  The Spark project and community that Vinod describes do not
 resemble the ones with which I am familiar.
 
 On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Hi Vinod,
 
  Thanks for you thoughts - However, I do not agree with your sentiment
  and implications. Spark is broadly quite an inclusive project and we
  spend a lot of effort culturally to help make newcomers feel welcome.
 
  - Patrick
 
  On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli
  vino...@hortonworks.com wrote:
   Actually what this community got away with is pretty much an
  anti-pattern compared to every other Apache project I have seen. And
 may I
  say in a not so Apache way.
  
   Waiting for a committer to assign a patch to someone leaves it as a
  privilege to a committer. Not alluding to anything fishy in practice,
 but
  this also leaves a lot of open ground for self-interest. Committers
  defining notions of good fit / level of experience do not work, highly
  subjective and lead to group control.
  
   In terms of semantics, here is what most other projects (dare I say
  every Apache project?) that I have seen do
- A new contributor comes in who is not yet added to the JIRA
 project.
  He/she requests one of the project's JIRA admins to add him/her.
- After that, he or she is free to assign tickets to themselves.
- What this means
   -- Assigning a ticket to oneself is a signal to the rest of the
  community that he/she is actively working on the said patch.
   -- If multiple contributors want to work on the same patch, it
 needs
  to resolved amicably through open communication. On JIRA, or on mailing
  lists. Not by the whim of a committer.
- Common issues
   -- Land grabbing: Other contributors can nudge him/her in case of
  inactivity and take them over. Again, amicably instead of a committer
  making subjective decisions.
   -- Progress stalling: One contributor assigns the ticket to
  himself/herself is actively debating but with no real code/docs
  contribution or with any real intention of making progress. Here
 workable,
  reviewable code for review usually wins.
  
   Assigning patches is not a privilege. Contributors at Apache are a
 bunch
  of volunteers, the PMC should let volunteers contribute as they see
 fit. We
  do not assign work at Apache.
  
   +Vinod
  
   On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com
  wrote:
  
   One over arching issue is that it's pretty unclear what Assigned to
   X in JIAR means from a process perspective. Personally I actually
   feel it's better for this to be more historical - i.e. who ended up
   submitting a patch for this feature that was merged - rather than
   creating an exclusive reservation for a particular user to work on
   something.
  
   If an issue is assigned to person X, but some other person Y
 submits
   a great patch for it, I think we have some obligation to Spark users
   and to the community to merge the better patch. So the idea of
   reserving the right to add a feature, it just seems overall off to
 me.
   IMO, its fine if multiple people want to submit competing patches 

Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-08 Thread Sandy Ryza
+1

Built against Hadoop 2.6 and ran some jobs against a pseudo-distributed
YARN cluster.

-Sandy

On Wed, Apr 8, 2015 at 12:49 PM, Patrick Wendell pwend...@gmail.com wrote:

 Oh I see - ah okay I'm guessing it was a transient build error and
 I'll get it posted ASAP.

 On Wed, Apr 8, 2015 at 3:41 PM, Denny Lee denny.g@gmail.com wrote:
  Oh, it appears the 2.4 bits without hive are there but not the 2.4 bits
 with
  hive. Cool stuff on the 2.6.
  On Wed, Apr 8, 2015 at 12:30 Patrick Wendell pwend...@gmail.com wrote:
 
  Hey Denny,
 
  I beleive the 2.4 bits are there. The 2.6 bits I had done specially
  (we haven't merge that into our upstream build script). I'll do it
  again now for RC2.
 
  - Patrick
 
  On Wed, Apr 8, 2015 at 1:53 PM, Timothy Chen tnac...@gmail.com wrote:
   +1 Tested on 4 nodes Mesos cluster with fine-grain and coarse-grain
   mode.
  
   Tim
  
   On Wed, Apr 8, 2015 at 9:32 AM, Denny Lee denny.g@gmail.com
 wrote:
   The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that
 intended
   (they were included in RC1)?
  
  
   On Wed, Apr 8, 2015 at 9:01 AM Tom Graves
   tgraves...@yahoo.com.invalid
   wrote:
  
   +1. Tested spark on yarn against hadoop 2.6.
   Tom
  
  
On Wednesday, April 8, 2015 6:15 AM, Sean Owen
   so...@cloudera.com
   wrote:
  
  
Still a +1 from me; same result (except that now of course the
   UISeleniumSuite test does not fail)
  
   On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com
 
   wrote:
Please vote on releasing the following candidate as Apache Spark
version
   1.3.1!
   
The tag to be voted on is v1.3.1-rc2 (commit 7c4473a):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
   7c4473aa5a7f5de0323394aaedeefbf9738e8eb5
   
The list of fixes present in this release can be found at:
http://bit.ly/1C2nVPY
   
The release files, including signatures, digests, etc. can be
 found
at:
http://people.apache.org/~pwendell/spark-1.3.1-rc2/
   
Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc
   
The staging repository for this release can be found at:
   
   
 https://repository.apache.org/content/repositories/orgapachespark-1083/
   
The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/
   
The patches on top of RC1 are:
   
[SPARK-6737] Fix memory leak in OutputCommitCoordinator
https://github.com/apache/spark/pull/5397
   
[SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
https://github.com/apache/spark/pull/5302
   
[SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with
NoClassDefFoundError
https://github.com/apache/spark/pull/4933
   
Please vote on releasing this package as Apache Spark 1.3.1!
   
The vote is open until Saturday, April 11, at 07:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.
   
[ ] +1 Release this package as Apache Spark 1.3.1
[ ] -1 Do not release this package because ...
   
To learn more about Apache Spark, please see
http://spark.apache.org/
   
   
   
 -
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
   
  
  
 -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
  
  

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: RDD.count

2015-03-28 Thread Sandy Ryza
I definitely see the value in this.  However, I think at this point it
would be an incompatible behavioral change.  People often use count in
Spark to exercise their DAG.  Omitting processing steps that were
previously included would likely mislead many users into thinking their
pipeline was running faster.

It's possible there might be room for something like a new smartCount API
or a new argument to count that allows it to avoid unnecessary
transformations.

-Sandy

On Sat, Mar 28, 2015 at 6:10 AM, Sean Owen so...@cloudera.com wrote:

 No, I'm not saying side effects change the count. But not executing
 the map() function at all certainly has an effect on the side effects
 of that function: the side effects which should take place never do. I
 am not sure that is something to be 'fixed'; it's a legitimate
 question.

 You can persist an RDD if you do not want to compute it twice.

 On Sat, Mar 28, 2015 at 1:05 PM, jimfcarroll jimfcarr...@gmail.com
 wrote:
  Hi Sean,
 
  Thanks for the response.
 
  I can't imagine a case (though my imagination may be somewhat limited)
 where
  even map side effects could change the number of elements in the
 resulting
  map.
 
  I guess count wouldn't officially be an 'action' if it were implemented
  this way. At least it wouldn't ALWAYS be one.
 
  My example was contrived. We're passing RDDs to functions. If that RDD
 is an
  instance of my class, then its count() may take a shortcut. If I
  map/zip/zipWithIndex/mapPartition/etc. first then I'm stuck with a call
 that
  literally takes 100s to 1000s of times longer (seconds vs hours on some
 of
  our datasets) and since my custom RDDs are immutable they cache the count
  call so a second invocation is the cost of a method call's overhead.
 
  I could fix this in Spark if there's any interest in that change.
 Otherwise
  I'll need to overload more RDD methods for my own purposes (like all of
 the
  transformations). Of course, that will be more difficult because those
  intermediate classes (like MappedRDD) are private, so I can't extend
 them.
 
  Jim
 
 
 
 
  --
  View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298p11302.html
  Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: hadoop input/output format advanced control

2015-03-25 Thread Sandy Ryza
Regarding Patrick's question, you can just do new Configuration(oldConf)
to get a cloned Configuration object and add any new properties to it.

-Sandy

On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote:

 Hi Nick,

 I don't remember the exact details of these scenarios, but I think the user
 wanted a lot more control over how the files got grouped into partitions,
 to group the files together by some arbitrary function.  I didn't think
 that was possible w/ CombineFileInputFormat, but maybe there is a way?

 thanks

 On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

  Imran, on your point to read multiple files together in a partition, is
 it
  not simpler to use the approach of copy Hadoop conf and set per-RDD
  settings for min split to control the input size per partition, together
  with something like CombineFileInputFormat?
 
  On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com
  wrote:
 
   I think this would be a great addition, I totally agree that you need
 to
  be
   able to set these at a finer context than just the SparkContext.
  
   Just to play devil's advocate, though -- the alternative is for you
 just
   subclass HadoopRDD yourself, or make a totally new RDD, and then you
  could
   expose whatever you need.  Why is this solution better?  IMO the
 criteria
   are:
   (a) common operations
   (b) error-prone / difficult to implement
   (c) non-obvious, but important for performance
  
   I think this case fits (a)  (c), so I think its still worthwhile.  But
  its
   also worth asking whether or not its too difficult for a user to extend
   HadoopRDD right now.  There have been several cases in the past week
  where
   we've suggested that a user should read from hdfs themselves (eg., to
  read
   multiple files together in one partition) -- with*out* reusing the code
  in
   HadoopRDD, though they would lose things like the metric tracking 
   preferred locations you get from HadoopRDD.  Does HadoopRDD need to
 some
   refactoring to make that easier to do?  Or do we just need a good
  example?
  
   Imran
  
   (sorry for hijacking your thread, Koert)
  
  
  
   On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com
  wrote:
  
see email below. reynold suggested i send it to dev instead of user
   
-- Forwarded message --
From: Koert Kuipers ko...@tresata.com
Date: Mon, Mar 23, 2015 at 4:36 PM
Subject: hadoop input/output format advanced control
To: u...@spark.apache.org u...@spark.apache.org
   
   
currently its pretty hard to control the Hadoop Input/Output formats
  used
in Spark. The conventions seems to be to add extra parameters to all
methods and then somewhere deep inside the code (for example in
PairRDDFunctions.saveAsHadoopFile) all these parameters get
 translated
   into
settings on the Hadoop Configuration object.
   
for example for compression i see codec: Option[Class[_ :
CompressionCodec]] = None added to a bunch of methods.
   
how scalable is this solution really?
   
for example i need to read from a hadoop dataset and i dont want the
   input
(part) files to get split up. the way to do this is to set
mapred.min.split.size. now i dont want to set this at the level of
  the
SparkContext (which can be done), since i dont want it to apply to
  input
formats in general. i want it to apply to just this one specific
 input
dataset i need to read. which leaves me with no options currently. i
   could
go add yet another input parameter to all the methods
(SparkContext.textFile, SparkContext.hadoopFile,
  SparkContext.objectFile,
etc.). but that seems ineffective.
   
why can we not expose a Map[String, String] or some other generic way
  to
manipulate settings for hadoop input/output formats? it would require
adding one more parameter to all methods to deal with hadoop
  input/output
formats, but after that its done. one parameter to rule them all
   
then i could do:
val x = sc.textFile(/some/path, formatSettings =
Map(mapred.min.split.size - 12345))
   
or
rdd.saveAsTextFile(/some/path, formatSettings =
Map(mapred.output.compress - true,
  mapred.output.compression.codec
   -
somecodec))
   
  
 



Re: Spark Executor resources

2015-03-24 Thread Sandy Ryza
That's correct.  What's the reason this information is needed?

-Sandy

On Tue, Mar 24, 2015 at 11:41 AM, Zoltán Zvara zoltan.zv...@gmail.com
wrote:

 Thank you for your response!

 I guess the (Spark)AM, who gives the container leash to the NM (along with
 the executor JAR and command to run) must know how many CPU or RAM that
 container capped, isolated at. There must be a resource vector along the
 encrypted container leash if I'm right that describes this. Or maybe is
 there a way for the ExecutorBackend to fetch this information directly from
 the environment? Then, the ExecutorBackend would be able to hand over this
 information to the actual Executor who creates the TaskRunner.

 Zvara Zoltán



 mail, hangout, skype: zoltan.zv...@gmail.com

 mobile, viber: +36203129543

 bank: 10918001-0021-50480008

 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

 elte: HSKSJZ (ZVZOAAI.ELTE)

 2015-03-24 16:30 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com:

 Hi Zoltan,

 If running on YARN, the YARN NodeManager starts executors.  I don't think
 there's a 100% precise way for the Spark executor way to know how many
 resources are allotted to it.  It can come close by looking at the Spark
 configuration options used to request it (spark.executor.memory and
 spark.yarn.executor.memoryOverhead), but it can't necessarily for the
 amount that YARN has rounded up if those configuration properties
 (yarn.scheduler.minimum-allocation-mb and
 yarn.scheduler.increment-allocation-mb) are not present on the node.

 -Sandy

 -Sandy

 On Mon, Mar 23, 2015 at 5:08 PM, Zoltán Zvara zoltan.zv...@gmail.com
 wrote:

 Let's say I'm an Executor instance in a Spark system. Who started me and
 where, when I run on a worker node supervised by (a) Mesos, (b) YARN? I
 suppose I'm the only one Executor on a worker node for a given framework
 scheduler (driver). If I'm an Executor instance, who is the closest
 object
 to me who can tell me how many resources do I have on (a) Mesos, (b)
 YARN?

 Thank you for your kind input!

 Zvara Zoltán



 mail, hangout, skype: zoltan.zv...@gmail.com

 mobile, viber: +36203129543

 bank: 10918001-0021-50480008

 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

 elte: HSKSJZ (ZVZOAAI.ELTE)






Re: Spark Executor resources

2015-03-24 Thread Sandy Ryza
Hi Zoltan,

If running on YARN, the YARN NodeManager starts executors.  I don't think
there's a 100% precise way for the Spark executor way to know how many
resources are allotted to it.  It can come close by looking at the Spark
configuration options used to request it (spark.executor.memory and
spark.yarn.executor.memoryOverhead), but it can't necessarily for the
amount that YARN has rounded up if those configuration properties
(yarn.scheduler.minimum-allocation-mb and
yarn.scheduler.increment-allocation-mb) are not present on the node.

-Sandy

-Sandy

On Mon, Mar 23, 2015 at 5:08 PM, Zoltán Zvara zoltan.zv...@gmail.com
wrote:

 Let's say I'm an Executor instance in a Spark system. Who started me and
 where, when I run on a worker node supervised by (a) Mesos, (b) YARN? I
 suppose I'm the only one Executor on a worker node for a given framework
 scheduler (driver). If I'm an Executor instance, who is the closest object
 to me who can tell me how many resources do I have on (a) Mesos, (b) YARN?

 Thank you for your kind input!

 Zvara Zoltán



 mail, hangout, skype: zoltan.zv...@gmail.com

 mobile, viber: +36203129543

 bank: 10918001-0021-50480008

 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

 elte: HSKSJZ (ZVZOAAI.ELTE)



Re: Directly broadcasting (sort of) RDDs

2015-03-22 Thread Sandy Ryza
Hi Guillaume,

I've long thought something like this would be useful - i.e. the ability to
broadcast RDDs directly without first pulling data through the driver.  If
I understand correctly, your requirement to block a matrix up and only
fetch the needed parts could be implemented on top of this by splitting an
RDD into a set of smaller RDDs and then broadcasting each one on its own.

Unfortunately nobody is working on this currently (and I couldn't promise
to have bandwidth to review it at the moment either), but I suspect we'll
eventually need to add something like this for map joins in Hive on Spark
and Spark SQL.

-Sandy



On Sat, Mar 21, 2015 at 3:11 AM, Guillaume Pitel guillaume.pi...@exensa.com
 wrote:

  Hi,

 Thanks for your answer. This is precisely the use case I'm interested in,
 but I know it already, I should have mentionned it. Unfortunately this
 implementation of BlockMatrix has (in my opinion) some disadvantages (the
 fact that it split the matrix by range instead of using a modulo is bad for
 block skewness). Besides, and more importantly, as I was writing, it uses
 the join solution (actually a cogroup :
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala,
 line 361). The reduplication of the elements of the dense matrix is thus
 dependent on the block size.

 Actually I'm wondering if what I want to achieve could be made with a
 simple modification to the join, allowing a partition to be weakly cached
 wafter being retrieved.

 Guillaume


  There is block matrix in Spark 1.3 - 
 http://spark.apache.org/docs/latest/mllib-data-types.html#blockmatrix





 However I believe it only supports dense matrix blocks.




 Still, might be possible to use it or exetend




 JIRAs:

 https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3434





 Was based on

 https://github.com/amplab/ml-matrix





 Another lib:

 https://github.com/PasaLab/marlin/blob/master/README.md







 —
 Sent from Mailbox

 On Sat, Mar 21, 2015 at 12:24 AM, Guillaume Pitelguillaume.pi...@exensa.com 
 guillaume.pi...@exensa.com wrote:


  Hi,
 I have an idea that I would like to discuss with the Spark devs. The
 idea comes from a very real problem that I have struggled with since
 almost a year. My problem is very simple, it's a dense matrix * sparse
 matrix  operation. I have a dense matrix RDD[(Int,FloatMatrix)] which is
 divided in X large blocks (one block per partition), and a sparse matrix
 RDD[((Int,Int),Array[Array[(Int,Float)]]] , divided in X * Y blocks. The
 most efficient way to perform the operation is to collectAsMap() the
 dense matrix and broadcast it, then perform the block-local
 mutliplications, and combine the results by column.
 This is quite fine, unless the matrix is too big to fit in memory
 (especially since the multiplication is performed several times
 iteratively, and the broadcasts are not always cleaned from memory as I
 would naively expect).
 When the dense matrix is too big, a second solution is to split the big
 sparse matrix in several RDD, and do several broadcasts. Doing this
 creates quite a big overhead, but it mostly works, even though I often
 face some problems with unaccessible broadcast files, for instance.
 Then there is the terrible but apparently very effective good old join.
 Since X blocks of the sparse matrix use the same block from the dense
 matrix, I suspect that the dense matrix is somehow replicated X times
 (either on disk or in the network), which is the reason why the join
 takes so much time.
 After this bit of a context, here is my idea : would it be possible to
 somehow broadcast (or maybe more accurately, share or serve) a
 persisted RDD which is distributed on all workers, in a way that would,
 a bit like the IndexedRDD, allow a task to access a partition or an
 element of a partition in the closure, with a worker-local memory cache
 . i.e. the information about where each block resides would be
 distributed on the workers, to allow them to access parts of the RDD
 directly. I think that's already a bit how RDD are shuffled ?
 The RDD could stay distributed (no need to collect then broadcast), and
 only necessary transfers would be required.
 Is this a bad idea, is it already implemented somewhere (I would love it
 !) ?or is it something that could add efficiency not only for my use
 case, but maybe for others ? Could someone give me some hint about how I
 could add this possibility to Spark ? I would probably try to extend a
 RDD into a specific SharedIndexedRDD with a special lookup that would be
 allowed from tasks as a special case, and that would try to contact the
 blockManager and reach the corresponding data from the right worker.
 Thanks in advance for your advices
 Guillaume
 --
 eXenSa
   
 *Guillaume PITEL, Président*
 +33(0)626 222 431
 eXenSa S.A.S. http://www.exensa.com/ http://www.exensa.com/
 41, rue Périer - 92120 Montrouge - FRANCE
 Tel +33(0)184 

Re: multi-line comment style

2015-02-09 Thread Sandy Ryza
+1 to what Andrew said, I think both make sense in different situations and
trusting developer discretion here is reasonable.

On Mon, Feb 9, 2015 at 1:48 PM, Andrew Or and...@databricks.com wrote:

 In my experience I find it much more natural to use // for short multi-line
 comments (2 or 3 lines), and /* */ for long multi-line comments involving
 one or more paragraphs. For short multi-line comments, there is no reason
 not to use // if it just so happens that your first line exceeded 100
 characters and you have to wrap it. For long multi-line comments, however,
 using // all the way looks really awkward especially if you have multiple
 paragraphs.

 Thus, I would actually suggest that we don't try to pick a favorite and
 document that both are acceptable. I don't expect developers to follow my
 exact usage (i.e. with a tipping point of 2-3 lines) so I wouldn't enforce
 anything specific either.

 2015-02-09 13:36 GMT-08:00 Reynold Xin r...@databricks.com:

  Why don't we just pick // as the default (by encouraging it in the style
  guide), since it is mostly used, and then do not disallow /* */? I don't
  think it is that big of a deal to have slightly deviations here since it
 is
  dead simple to understand what's going on.
 
 
  On Mon, Feb 9, 2015 at 1:33 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Clearly there isn't a strictly optimal commenting format (pro's and
   cons for both '//' and '/*'). My thought is for consistency we should
   just chose one and put in the style guide.
  
   On Mon, Feb 9, 2015 at 12:25 PM, Xiangrui Meng men...@gmail.com
 wrote:
Btw, I think allowing `/* ... */` without the leading `*` in lines is
also useful. Check this line:
   
  
 
 https://github.com/apache/spark/pull/4259/files#diff-e9dcb3b5f3de77fc31b3aff7831110eaR55
   ,
where we put the R commands that can reproduce the test result. It is
easier if we write in the following style:
   
~~~
/*
 Using the following R code to load the data and train the model
 using
glmnet package.
   
 library(glmnet)
 data - read.csv(path, header=FALSE, stringsAsFactors=FALSE)
 features - as.matrix(data.frame(as.numeric(data$V2),
   as.numeric(data$V3)))
 label - as.numeric(data$V1)
 weights - coef(glmnet(features, label, family=gaussian, alpha =
 0,
lambda = 0))
 */
~~~
   
So people can copy  paste the R commands directly.
   
Xiangrui
   
On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng men...@gmail.com
  wrote:
I like the `/* .. */` style more. Because it is easier for IDEs to
recognize it as a block comment. If you press enter in the comment
block with the `//` style, IDEs won't add `//` for you. -Xiangrui
   
On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com
   wrote:
We should update the style doc to reflect what we have in most
 places
(which I think is //).
   
   
   
On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:
   
FWIW I like the multi-line // over /* */ from a purely style
   standpoint.
The Google Java style guide[1] has some comment about code
  formatting
   tools
working better with /* */ but there doesn't seem to be any strong
   arguments
for one over the other I can find
   
Thanks
Shivaram
   
[1]
   
   
  
 
 https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style
   
On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell 
 pwend...@gmail.com
  
wrote:
   
 Personally I have no opinion, but agree it would be nice to
   standardize.

 - Patrick

 On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen so...@cloudera.com
   wrote:
  One thing Marcelo pointed out to me is that the // style does
  not
  interfere with commenting out blocks of code with /* */, which
  is
   a
  small good thing. I am also accustomed to // style for
  multiline,
   and
  reserve /** */ for javadoc / scaladoc. Meaning, seeing the /*
 */
   style
  inline always looks a little funny to me.
 
  On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout 
kayousterh...@gmail.com
 wrote:
  Hi all,
 
  The Spark Style Guide
  

  
 https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

  says multi-line comments should formatted as:
 
  /*
   * This is a
   * very
   * long comment.
   */
 
  But in my experience, we almost always use // for
 multi-line
comments:
 
  // This is a
  // very
  // long comment.
 
  Here are some examples:
 
 - Recent commit by Reynold, king of style:
 

   
  
 
 https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58
 - RDD.scala:
 

   
  
 
 

Re: Improving metadata in Spark JIRA

2015-02-06 Thread Sandy Ryza
JIRA updates don't go to this list, they go to iss...@spark.apache.org.  I
don't think many are signed up for that list, and those that are probably
have a flood of emails anyway.

So I'd definitely be in favor of any JIRA cleanup that you're up for.

-Sandy

On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen so...@cloudera.com wrote:

 I've wasted no time in wielding the commit bit to complete a number of
 small, uncontroversial changes. I wouldn't commit anything that didn't
 already appear to have review, consensus and little risk, but please
 let me know if anything looked a little too bold, so I can calibrate.


 Anyway, I'd like to continue some small house-cleaning by improving
 the state of JIRA's metadata, in order to let it give us a little
 clearer view on what's happening in the project:

 a. Add Component to every (open) issue that's missing one
 b. Review all Critical / Blocker issues to de-escalate ones that seem
 obviously neither
 c. Correct open issues that list a Fix version that has already been
 released
 d. Close all issues Resolved for a release that has already been released

 The problem with doing so is that it will create a tremendous amount
 of email to the list, like, several hundred. It's possible to make
 bulk changes and suppress e-mail though, which could be done for all
 but b.

 Better to suppress the emails when making such changes? or just not
 bother on some of these?

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Issue with repartition and cache

2015-01-21 Thread Sandy Ryza
Hi Dirceu,

Does the issue not show up if you run map(f =
f(1).asInstanceOf[Int]).sum on the train RDD?  It appears that f(1) is
an String, not an Int.  If you're looking to parse and convert it, toInt
should be used instead of asInstanceOf.

-Sandy

On Wed, Jan 21, 2015 at 8:43 AM, Dirceu Semighini Filho 
dirceu.semigh...@gmail.com wrote:

 Hi guys, have anyone find something like this?
 I have a training set, and when I repartition it, if I call cache it throw
 a classcastexception when I try to execute anything that access it

 val rep120 = train.repartition(120)
 val cached120 = rep120.cache
 cached120.map(f = f(1).asInstanceOf[Int]).sum

 Cell Toolbar:
In [1]:

 ClusterSettings.executorMemory=Some(28g)

 ClusterSettings.maxResultSize = 20g

 ClusterSettings.resume=true

 ClusterSettings.coreInstanceType=r3.xlarge

 ClusterSettings.coreInstanceCount = 30

 ClusterSettings.clusterName=UberdataContextCluster-Dirceu

 uc.applyDateFormat(YYMMddHH)

 Searching for existing cluster UberdataContextCluster-Dirceu ...
 Spark standalone cluster started at
 http://ec2-54-68-91-64.us-west-2.compute.amazonaws.com:8080
 Found 1 master(s), 30 slaves
 Ganglia started at
 http://ec2-54-68-91-64.us-west-2.compute.amazonaws.com:5080/ganglia

 In [37]:

 import org.apache.spark.sql.catalyst.types._

 import eleflow.uberdata.util.IntStringImplicitTypeConverter._

 import eleflow.uberdata.enums.SupportedAlgorithm._

 import eleflow.uberdata.data._

 import org.apache.spark.mllib.tree.DecisionTree

 import eleflow.uberdata.enums.DateSplitType._

 import org.apache.spark.mllib.regression.LabeledPoint

 import org.apache.spark.mllib.linalg.Vectors

 import org.apache.spark.mllib.classification._

 import eleflow.uberdata.model._

 import eleflow.uberdata.data.stat.Statistics

 import eleflow.uberdata.enums.ValidationMethod._

 import org.apache.spark.rdd.RDD

 In [5]:

 val train =
 uc.load(uc.toHDFSURI(/tmp/data/input/train_rev4.csv)).applyColumnTypes(Seq(DecimalType(),
 LongType,TimestampType, StringType,


  StringType, StringType, StringType, StringType,


   StringType, StringType, StringType, StringType,


   StringType, StringType, StringType, StringType,


  StringType, StringType, StringType, StringType,


   LongType, LongType,StringType, StringType,StringType,


   StringType,StringType))

 .formatDateValues(2,DayOfAWeek | Period).slice(excludes = Seq(12,13))

 Out[5]:

 idclickhour1hour2C1banner_possite_idsite_domainsite_categoryapp_idapp_domain
 app_categorydevice_modeldevice_typedevice_conn_typeC14C15C16C17C18C19C20C21
 10941815109427302.03.0100501fbe01fef384576728905ebdecad23867801e8d9
 07d7df2244956a241215706320501722035-179116934911786371502.03.010050
 1fbe01fef384576728905ebdecad23867801e8d907d7df22711ee1201015704320501722035
 10008479137190421511948602.03.0100501fbe01fef384576728905ebdecad2386
 7801e8d907d7df228a4875bd101570432050172203510008479164072448083837602.0

 3.0100501fbe01fef384576728905ebdecad23867801e8d907d7df226332421a101570632050
 172203510008479167905641704209602.03.010051fe8cc4489166c1610569f928

 ecad23867801e8d907d7df22779d90c21018993320502161035-11571720757801103869
 02.03.010050d6137915bb1ef334f028772becad23867801e8d907d7df228a4875bd1016920
 3205018990431100077117172472998854491102.03.0100508fda644b25d4cfcd
 f028772becad23867801e8d907d7df22be6db1d71020362320502333039-1157
 In [7]:

 val test =
 uc.load(uc.toHDFSURI(/tmp/data/input/test_rev4.csv)).applyColumnTypes(Seq(DecimalType(),
 TimestampType, StringType,


  StringType, StringType, StringType, StringType,


   StringType, StringType, StringType, StringType,


   StringType, StringType, StringType, StringType,


  StringType, StringType, StringType, StringType,


   LongType, LongType,StringType, StringType,StringType,


   StringType,StringType)).

 formatDateValues(1,DayOfAWeek | Period).slice(excludes =Seq(11,12))

 Out[7]:
 idhour1hour2C1banner_possite_idsite_domainsite_categoryapp_idapp_domain
 app_categorydevice_modeldevice_typedevice_conn_typeC14C15C16C17C18C19C20C21
 11740588092635695.03.010050235ba823f6ebf28ef028772becad23867801e8d9
 07d7df220eb711ec1083303205076131751000752311825269208554285.03.010050
 1fbe01fef384576728905ebdecad23867801e8d907d7df22ecb851b21022676320502616035
 1000835115541398292139845.03.0100501fbe01fef384576728905ebdecad2386
 7801e8d907d7df221f0bc64f102267632050261603510008351100010946378097988455.0

 3.01005085f751fdc4e18dd650e219e051cedd4eaefc06bd0f2161f8542422a7101864832050
 1092380910015661100013770415586707455.03.01005085f751fdc4e18dd650e219e0

 9c13b4192347f47af95efa071f0bc64f1023160320502667047-122110001521204153353724

 5.03.01005157fe1b205b626596f028772becad23867801e8d907d7df2268b6db2c106563320
 50572239-132100019110567070233785.03.0100501fbe01fef384576728905ebdecad2386
 

Re: Semantics of LGTM

2015-01-17 Thread sandy . ryza
I think clarifying these semantics is definitely worthwhile. Maybe this 
complicates the process with additional terminology, but the way I've used 
these has been:

+1 - I think this is safe to merge and, barring objections from others, would 
merge it immediately.

LGTM - I have no concerns about this patch, but I don't necessarily feel 
qualified to make a final call about it.  The TM part acknowledges the judgment 
as a little more subjective.

I think having some concise way to express both of these is useful.

-Sandy

 On Jan 17, 2015, at 5:40 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Hey All,
 
 Just wanted to ping about a minor issue - but one that ends up having
 consequence given Spark's volume of reviews and commits. As much as
 possible, I think that we should try and gear towards Google Style
 LGTM on reviews. What I mean by this is that LGTM has the following
 semantics:
 
 I know this code well, or I've looked at it close enough to feel
 confident it should be merged. If there are issues/bugs with this code
 later on, I feel confident I can help with them.
 
 Here is an alternative semantic:
 
 Based on what I know about this part of the code, I don't see any
 show-stopper problems with this patch.
 
 The issue with the latter is that it ultimately erodes the
 significance of LGTM, since subsequent reviewers need to reason about
 what the person meant by saying LGTM. In contrast, having strong
 semantics around LGTM can help streamline reviews a lot, especially as
 reviewers get more experienced and gain trust from the comittership.
 
 There are several easy ways to give a more limited endorsement of a patch:
 - I'm not familiar with this code, but style, etc look good (general
 endorsement)
 - The build changes in this code LGTM, but I haven't reviewed the
 rest (limited LGTM)
 
 If people are okay with this, I might add a short note on the wiki.
 I'm sending this e-mail first, though, to see whether anyone wants to
 express agreement or disagreement with this approach.
 
 - Patrick
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Semantics of LGTM

2015-01-17 Thread sandy . ryza
Yeah, the ASF +1 has become partly overloaded to mean both I would like to see 
this feature and this patch should be committed, although, at least in 
Hadoop, using +1 on JIRA (as opposed to, say, in a release vote) should 
unambiguously mean the latter unless qualified in some other way.

I don't have any opinion on the specific characters, but I agree with Aaron 
that it would be nice to have some sort of abbreviation for both the strong and 
weak forms of approval.

-Sandy

 On Jan 17, 2015, at 7:25 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 I think the ASF +1 is *slightly* different than Google's LGTM, because
 it might convey wanting the patch/feature to be merged but not
 necessarily saying you did a thorough review and stand behind it's
 technical contents. For instance, I've seen people pile on +1's to try
 and indicate support for a feature or patch in some projects, even
 though they didn't do a thorough technical review. This +1 is
 definitely a useful mechanism.
 
 There is definitely much overlap though in the meaning, though, and
 it's largely because Spark had it's own culture around reviews before
 it was donated to the ASF, so there is a mix of two styles.
 
 Nonetheless, I'd prefer to stick with the stronger LGTM semantics I
 proposed originally (unlike the one Sandy proposed, e.g.). This is
 what I've seen every project using the LGTM convention do (Google, and
 some open source projects such as Impala) to indicate technical
 sign-off.
 
 - Patrick
 
 On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson ilike...@gmail.com wrote:
 I think I've seen something like +2 = strong LGTM and +1 = weak LGTM;
 someone else should review before. It's nice to have a shortcut which isn't
 a sentence when talking about weaker forms of LGTM.
 
 On Sat, Jan 17, 2015 at 6:59 PM, sandy.r...@cloudera.com wrote:
 
 I think clarifying these semantics is definitely worthwhile. Maybe this
 complicates the process with additional terminology, but the way I've used
 these has been:
 
 +1 - I think this is safe to merge and, barring objections from others,
 would merge it immediately.
 
 LGTM - I have no concerns about this patch, but I don't necessarily feel
 qualified to make a final call about it.  The TM part acknowledges the
 judgment as a little more subjective.
 
 I think having some concise way to express both of these is useful.
 
 -Sandy
 
 On Jan 17, 2015, at 5:40 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Hey All,
 
 Just wanted to ping about a minor issue - but one that ends up having
 consequence given Spark's volume of reviews and commits. As much as
 possible, I think that we should try and gear towards Google Style
 LGTM on reviews. What I mean by this is that LGTM has the following
 semantics:
 
 I know this code well, or I've looked at it close enough to feel
 confident it should be merged. If there are issues/bugs with this code
 later on, I feel confident I can help with them.
 
 Here is an alternative semantic:
 
 Based on what I know about this part of the code, I don't see any
 show-stopper problems with this patch.
 
 The issue with the latter is that it ultimately erodes the
 significance of LGTM, since subsequent reviewers need to reason about
 what the person meant by saying LGTM. In contrast, having strong
 semantics around LGTM can help streamline reviews a lot, especially as
 reviewers get more experienced and gain trust from the comittership.
 
 There are several easy ways to give a more limited endorsement of a
 patch:
 - I'm not familiar with this code, but style, etc look good (general
 endorsement)
 - The build changes in this code LGTM, but I haven't reviewed the
 rest (limited LGTM)
 
 If people are okay with this, I might add a short note on the wiki.
 I'm sending this e-mail first, though, to see whether anyone wants to
 express agreement or disagreement with this approach.
 
 - Patrick
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark Dev

2014-12-19 Thread Sandy Ryza
Hi Harikrishna,

A good place to start is taking a look at the wiki page on contributing:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

-Sandy

On Fri, Dec 19, 2014 at 2:43 PM, Harikrishna Kamepalli 
harikrishna.kamepa...@gmail.com wrote:

 i am interested to contribute to spark



Re: one hot encoding

2014-12-13 Thread Sandy Ryza
Hi Lochana,

We haven't yet added this in 1.2.
https://issues.apache.org/jira/browse/SPARK-4081 tracks adding categorical
feature indexing, which one-hot encoding can be built on.
https://issues.apache.org/jira/browse/SPARK-1216 also tracks a version of
this prior to the ML pipelines work.

-Sandy

On Fri, Dec 12, 2014 at 6:16 PM, Lochana Menikarachchi locha...@gmail.com
wrote:

 Do we have one-hot encoding in spark MLLib 1.1.1 or 1.2.0 ? It wasn't
 available in 1.1.0.
 Thanks.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Sandy Ryza
+1 (non-binding).  Tested on Ubuntu against YARN.

On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin r...@databricks.com wrote:

 +1

 Tested on OS X.

 On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com
 wrote:

  Please vote on releasing the following candidate as Apache Spark version
  1.2.0!
 
  The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.2.0-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1055/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.2.0!
 
  The vote is open until Saturday, December 13, at 21:00 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.2.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == What justifies a -1 vote for this release? ==
  This vote is happening relatively late into the QA period, so
  -1 votes should only occur for significant regressions from
  1.0.2. Bugs already present in 1.1.X, minor
  regressions, or bugs related to new features will not block this
  release.
 
  == What default changes should I be aware of? ==
  1. The default value of spark.shuffle.blockTransferService has been
  changed to netty
  -- Old behavior can be restored by switching to nio
 
  2. The default value of spark.shuffle.manager has been changed to
 sort.
  -- Old behavior can be restored by setting spark.shuffle.manager to
  hash.
 
  == How does this differ from RC1 ==
  This has fixes for a handful of issues identified - some of the
  notable fixes are:
 
  [Core]
  SPARK-4498: Standalone Master can fail to recognize completed/failed
  applications
 
  [SQL]
  SPARK-4552: Query for empty parquet table in spark sql hive get
  IllegalArgumentException
  SPARK-4753: Parquet2 does not prune based on OR filters on partition
  columns
  SPARK-4761: With JDBC server, set Kryo as default serializer and
  disable reference tracking
  SPARK-4785: When called with arguments referring column fields, PMOD
  throws NPE
 
  - Patrick
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;
  For additional commands, e-mail: dev-h...@spark.apache.org
 javascript:;
 
 



Re: HA support for Spark

2014-12-10 Thread Sandy Ryza
I think that if we were able to maintain the full set of created RDDs as
well as some scheduler and block manager state, it would be enough for most
apps to recover.

On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu liuj...@cn.ibm.com wrote:

 Well, it should not be mission impossible thinking there are so many HA
 solution existing today. I would interest to know if there is any specific
 difficult.

 Best Regards


 *Jun Feng Liu*
 IBM China Systems  Technology Laboratory in Beijing

   --
  [image: 2D barcode - encoded with contact information] *Phone: 
 *86-10-82452683

 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
 [image: IBM]

 BLD 28,ZGC Software Park
 No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
 China





  *Reynold Xin r...@databricks.com r...@databricks.com*

 2014/12/10 16:30
   To
 Jun Feng Liu/China/IBM@IBMCN,
 cc
 dev@spark.apache.org dev@spark.apache.org
 Subject
 Re: HA support for Spark




 This would be plausible for specific purposes such as Spark streaming or
 Spark SQL, but I don't think it is doable for general Spark driver since it
 is just a normal JVM process with arbitrary program state.

 On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu liuj...@cn.ibm.com wrote:

  Do we have any high availability support in Spark driver level? For
  example, if we want spark drive can move to another node continue
 execution
  when failure happen. I can see the RDD checkpoint can help to
 serialization
  the status of RDD. I can image to load the check point from another node
  when error happen, but seems like will lost track all tasks status or
 even
  executor information that maintain in spark context. I am not sure if
 there
  is any existing stuff I can leverage to do that. thanks for any suggests
 
  Best Regards
 
 
  *Jun Feng Liu*
  IBM China Systems  Technology Laboratory in Beijing
 
--
   [image: 2D barcode - encoded with contact information] *Phone:
 *86-10-82452683
 
  * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
  [image: IBM]
 
  BLD 28,ZGC Software Park
  No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
  China
 
 
 
 
 




Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-01 Thread Sandy Ryza
+1 (non-binding)

built from source
fired up a spark-shell against YARN cluster
ran some jobs using parallelize
ran some jobs that read files
clicked around the web UI


On Sun, Nov 30, 2014 at 1:10 AM, GuoQiang Li wi...@qq.com wrote:

 +1 (non-binding‍)




 -- Original --
 From:  Patrick Wendell;pwend...@gmail.com;
 Date:  Sat, Nov 29, 2014 01:16 PM
 To:  dev@spark.apache.orgdev@spark.apache.org;

 Subject:  [VOTE] Release Apache Spark 1.2.0 (RC1)



 Please vote on releasing the following candidate as Apache Spark version
 1.2.0!

 The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.0-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1048/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.2.0!

 The vote is open until Tuesday, December 02, at 05:15 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.1.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == What justifies a -1 vote for this release? ==
 This vote is happening very late into the QA period compared with
 previous votes, so -1 votes should only occur for significant
 regressions from 1.0.2. Bugs already present in 1.1.X, minor
 regressions, or bugs related to new features will not block this
 release.

 == What default changes should I be aware of? ==
 1. The default value of spark.shuffle.blockTransferService has been
 changed to netty
 -- Old behavior can be restored by switching to nio

 2. The default value of spark.shuffle.manager has been changed to sort.
 -- Old behavior can be restored by setting spark.shuffle.manager to
 hash.

 == Other notes ==
 Because this vote is occurring over a weekend, I will likely extend
 the vote if this RC survives until the end of the vote period.

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



Re: Too many open files error

2014-11-19 Thread Sandy Ryza
Quizhang,

This is a known issue that ExternalAppendOnlyMap can do tons of tiny spills
in certain situations. SPARK-4452 aims to deal with this issue, but we
haven't finalized a solution yet.

Dinesh's solution should help as a workaround, but you'll likely experience
suboptimal performance when trying to merge tons of small files from disk.

-Sandy

On Wed, Nov 19, 2014 at 10:10 PM, Dinesh J. Weerakkody 
dineshjweerakk...@gmail.com wrote:

 Hi Qiuzhuang,

 This is a linux related issue. Please go through this [1] and change the
 limits. hope this will solve your problem.

 [1] https://rtcamp.com/tutorials/linux/increase-open-files-limit/

 On Thu, Nov 20, 2014 at 9:45 AM, Qiuzhuang Lian qiuzhuang.l...@gmail.com
 wrote:

  Hi All,
 
  While doing some ETL, I  run into error of 'Too many open files' as
  following logs,
 
  Thanks,
  Qiuzhuang
 
  4/11/20 20:12:02 INFO collection.ExternalAppendOnlyMap: Thread 63
 spilling
  in-memory map of 100.8 KB to disk (953 times so far)
  14/11/20 20:12:02 ERROR storage.DiskBlockObjectWriter: Uncaught exception
  while reverting partial writes to file
 
 
 /tmp/spark-local-20141120200455-4137/2f/temp_local_f83cbf2f-60a4-4fbd-b5d2-32a0c569311b
  java.io.FileNotFoundException:
 
 
 /tmp/spark-local-20141120200455-4137/2f/temp_local_f83cbf2f-60a4-4fbd-b5d2-32a0c569311b
  (Too many open files)
  at java.io.FileOutputStream.open(Native Method)
  at java.io.FileOutputStream.init(FileOutputStream.java:221)
  at
 
 
 org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(BlockObjectWriter.scala:178)
  at
 
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:203)
  at
 
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:63)
  at
 
 
 org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:77)
  at
 
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:63)
  at
 
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:131)
  at
 
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
  at
 
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
  at
 
 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
  at
 
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at
  scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at
 
 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
  at
  org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
  at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
  at
  org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
  at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
  at
 
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
  at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
  at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
  at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
  at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
  at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
  at
  org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
  at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
  at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
  at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
  at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
  at
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
  at org.apache.spark.scheduler.Task.run(Task.scala:56)
  at
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:744)
  14/11/20 20:12:02 ERROR executor.Executor: Exception in task 0.0 in stage
  36.0 (TID 20)
  java.io.FileNotFoundException:
 
 
 

Re: Spark Hadoop 2.5.1

2014-11-14 Thread sandy . ryza
You're the second person to request this today. Planning to include this in my 
PR for Spark-4338.

-Sandy

 On Nov 14, 2014, at 8:48 AM, Corey Nolet cjno...@gmail.com wrote:
 
 In the past, I've built it by providing -Dhadoop.version=2.5.1 exactly like
 you've mentioned. What prompted me to write this email was that I did not
 see any documentation that told me Hadoop 2.5.1 was officially supported by
 Spark (i.e. community has been using it, any bugs are being fixed, etc...).
 It builds, tests pass, etc... but there could be other implications that I
 have not run into based on my own use of the framework.
 
 If we are saying that the standard procedure is to build with the
 hadoop-2.4 profile and override the -Dhadoop.version property, should we
 provide that on the build instructions [1] at least?
 
 [1] http://spark.apache.org/docs/latest/building-with-maven.html
 
 On Fri, Nov 14, 2014 at 10:46 AM, Sean Owen so...@cloudera.com wrote:
 
 I don't think it's necessary. You're looking at the hadoop-2.4
 profile, which works with anything = 2.4. AFAIK there is no further
 specialization needed beyond that. The profile sets hadoop.version to
 2.4.0 by default, but this can be overridden.
 
 On Fri, Nov 14, 2014 at 3:43 PM, Corey Nolet cjno...@gmail.com wrote:
 I noticed Spark 1.2.0-SNAPSHOT still has 2.4.x in the pom. Since 2.5.x is
 the current stable Hadoop 2.x, would it make sense for us to update the
 poms?
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Sandy Ryza
Currently there are no mandatory profiles required to build Spark.  I.e.
mvn package just works.  It seems sad that we would need to break this.

On Wed, Nov 12, 2014 at 10:59 PM, Patrick Wendell pwend...@gmail.com
wrote:

 I think printing an error that says -Pscala-2.10 must be enabled is
 probably okay. It's a slight regression but it's super obvious to
 users. That could be a more elegant solution than the somewhat
 complicated monstrosity I proposed on the JIRA.

 On Wed, Nov 12, 2014 at 10:37 PM, Prashant Sharma scrapco...@gmail.com
 wrote:
  One thing we can do it is print a helpful error and break. I don't know
  about how this can be done, but since now I can write groovy inside maven
  build so we have more control. (Yay!!)
 
  Prashant Sharma
 
 
 
  On Thu, Nov 13, 2014 at 12:05 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Yeah Sandy and I were chatting about this today and din't realize
  -Pscala-2.10 was mandatory. This is a fairly invasive change, so I was
  thinking maybe we could try to remove that. Also if someone doesn't
  give -Pscala-2.10 it fails in a way that is initially silent, which is
  bad because most people won't know to do this.
 
  https://issues.apache.org/jira/browse/SPARK-4375
 
  On Wed, Nov 12, 2014 at 10:29 PM, Prashant Sharma scrapco...@gmail.com
 
  wrote:
   Thanks Patrick, I have one suggestion that we should make passing
   -Pscala-2.10 mandatory for maven users. I am sorry for not mentioning
   this
   before. There is no way around not passing that option for maven
   users(only). However, this is unnecessary for sbt users because it is
   added
   automatically if -Pscala-2.11 is absent.
  
  
   Prashant Sharma
  
  
  
   On Wed, Nov 12, 2014 at 3:53 PM, Sean Owen so...@cloudera.com
 wrote:
  
   - Tip: when you rebase, IntelliJ will temporarily think things like
 the
   Kafka module are being removed. Say 'no' when it asks if you want to
   remove
   them.
   - Can we go straight to Scala 2.11.4?
  
   On Wed, Nov 12, 2014 at 5:47 AM, Patrick Wendell pwend...@gmail.com
 
   wrote:
  
Hey All,
   
I've just merged a patch that adds support for Scala 2.11 which
 will
have some minor implications for the build. These are due to the
complexities of supporting two versions of Scala in a single
 project.
   
1. The JDBC server will now require a special flag to build
-Phive-thriftserver on top of the existing flag -Phive. This is
because some build permutations (only in Scala 2.11) won't support
the
JDBC server yet due to transitive dependency conflicts.
   
2. The build now uses non-standard source layouts in a few
 additional
places (we already did this for the Hive project) - the repl and
 the
examples modules. This is just fine for maven/sbt, but it may
 affect
users who import the build in IDE's that are using these projects
 and
want to build Spark from the IDE. I'm going to update our wiki to
include full instructions for making this work well in IntelliJ.
   
If there are any other build related issues please respond to this
thread and we'll make sure they get sorted out. Thanks to Prashant
Sharma who is the author of this feature!
   
- Patrick
   
   
 -
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
   
   
  
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: proposal / discuss: multiple Serializers within a SparkContext?

2014-11-08 Thread Sandy Ryza
Ah awesome.  Passing customer serializers when persisting an RDD is exactly
one of the things I was thinking of.

-Sandy

On Fri, Nov 7, 2014 at 1:19 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Yup, the JIRA for this was https://issues.apache.org/jira/browse/SPARK-540
 (one of our older JIRAs). I think it would be interesting to explore this
 further. Basically the way to add it into the API would be to add a version
 of persist() that takes another class than StorageLevel, say
 StorageStrategy, which allows specifying a custom serializer or perhaps
 even a transformation to turn each partition into another representation
 before saving it. It would also be interesting if this could work directly
 on an InputStream or ByteBuffer to deal with off-heap data.

 One issue we've found with our current Serializer interface by the way is
 that a lot of type information is lost when you pass data to it, so the
 serializers spend a fair bit of time figuring out what class each object
 written is. With this model, it would be possible for a serializer to know
 that all its data is of one type, which is pretty cool, but we might also
 consider ways of expanding the current Serializer interface to take more
 info.

 Matei

  On Nov 7, 2014, at 1:09 AM, Reynold Xin r...@databricks.com wrote:
 
  Technically you can already do custom serializer for each shuffle
 operation
  (it is part of the ShuffledRDD). I've seen Matei suggesting on jira
 issues
  (or github) in the past a storage policy in which you can specify how
  data should be stored. I think that would be a great API to have in the
  long run. Designing it won't be trivial though.
 
 
  On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
 
  Hey all,
 
  Was messing around with Spark and Google FlatBuffers for fun, and it
 got me
  thinking about Spark and serialization.  I know there's been work / talk
  about in-memory columnar formats Spark SQL, so maybe there are ways to
  provide this flexibility already that I've missed?  Either way, my
  thoughts:
 
  Java and Kryo serialization are really nice in that they require almost
 no
  extra work on the part of the user.  They can also represent complex
 object
  graphs with cycles etc.
 
  There are situations where other serialization frameworks are more
  efficient:
  * A Hadoop Writable style format that delineates key-value boundaries
 and
  allows for raw comparisons can greatly speed up some shuffle operations
 by
  entirely avoiding deserialization until the object hits user code.
  Writables also probably ser / deser faster than Kryo.
  * No-deserialization formats like FlatBuffers and Cap'n Proto address
 the
  tradeoff between (1) Java objects that offer fast access but take lots
 of
  space and stress GC and (2) Kryo-serialized buffers that are more
 compact
  but take time to deserialize.
 
  The drawbacks of these frameworks are that they require more work from
 the
  user to define types.  And that they're more restrictive in the
 reference
  graphs they can represent.
 
  In large applications, there are probably a few points where a
  specialized serialization format is useful. But requiring Writables
  everywhere because they're needed in a particularly intense shuffle is
  cumbersome.
 
  In light of that, would it make sense to enable varying Serializers
 within
  an app? It could make sense to choose a serialization framework both
 based
  on the objects being serialized and what they're being serialized for
  (caching vs. shuffle).  It might be possible to implement this
 underneath
  the Serializer interface with some sort of multiplexing serializer that
  chooses between subserializers.
 
  Nothing urgent here, but curious to hear other's opinions.
 
  -Sandy
 




proposal / discuss: multiple Serializers within a SparkContext?

2014-11-07 Thread Sandy Ryza
Hey all,

Was messing around with Spark and Google FlatBuffers for fun, and it got me
thinking about Spark and serialization.  I know there's been work / talk
about in-memory columnar formats Spark SQL, so maybe there are ways to
provide this flexibility already that I've missed?  Either way, my thoughts:

Java and Kryo serialization are really nice in that they require almost no
extra work on the part of the user.  They can also represent complex object
graphs with cycles etc.

There are situations where other serialization frameworks are more
efficient:
* A Hadoop Writable style format that delineates key-value boundaries and
allows for raw comparisons can greatly speed up some shuffle operations by
entirely avoiding deserialization until the object hits user code.
Writables also probably ser / deser faster than Kryo.
* No-deserialization formats like FlatBuffers and Cap'n Proto address the
tradeoff between (1) Java objects that offer fast access but take lots of
space and stress GC and (2) Kryo-serialized buffers that are more compact
but take time to deserialize.

The drawbacks of these frameworks are that they require more work from the
user to define types.  And that they're more restrictive in the reference
graphs they can represent.

In large applications, there are probably a few points where a
specialized serialization format is useful. But requiring Writables
everywhere because they're needed in a particularly intense shuffle is
cumbersome.

In light of that, would it make sense to enable varying Serializers within
an app? It could make sense to choose a serialization framework both based
on the objects being serialized and what they're being serialized for
(caching vs. shuffle).  It might be possible to implement this underneath
the Serializer interface with some sort of multiplexing serializer that
chooses between subserializers.

Nothing urgent here, but curious to hear other's opinions.

-Sandy


Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Sandy Ryza
It looks like the difference between the proposed Spark model and the
CloudStack / SVN model is:
* In the former, maintainers / partial committers are a way of centralizing
oversight over particular components among committers
* In the latter, maintainers / partial committers are a way of giving
non-committers some power to make changes

-Sandy

On Thu, Nov 6, 2014 at 5:17 PM, Corey Nolet cjno...@gmail.com wrote:

 PMC [1] is responsible for oversight and does not designate partial or full
 committer. There are projects where all committers become PMC and others
 where PMC is reserved for committers with the most merit (and willingness
 to take on the responsibility of project oversight, releases, etc...).
 Community maintains the codebase through committers. Committers to mentor,
 roll in patches, and spread the project throughout other communities.

 Adding someone's name to a list as a maintainer is not a barrier. With a
 community as large as Spark's, and myself not being a committer on this
 project, I see it as a welcome opportunity to find a mentor in the areas in
 which I'm interested in contributing. We'd expect the list of names to grow
 as more volunteers gain more interest, correct? To me, that seems quite
 contrary to a barrier.

 [1] http://www.apache.org/dev/pmc.html


 On Thu, Nov 6, 2014 at 7:49 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  So I don't understand, Greg, are the partial committers committers, or
 are
  they not? Spark also has a PMC, but our PMC currently consists of all
  committers (we decided not to have a differentiation when we left the
  incubator). I see the Subversion partial committers listed as
 committers
  on https://people.apache.org/committers-by-project.html#subversion, so I
  assume they are committers. As far as I can see, CloudStack is similar.
 
  Matei
 
   On Nov 6, 2014, at 4:43 PM, Greg Stein gst...@gmail.com wrote:
  
   Partial committers are people invited to work on a particular area, and
  they do not require sign-off to work on that area. They can get a
 sign-off
  and commit outside that area. That approach doesn't compare to this
  proposal.
  
   Full committers are PMC members. As each PMC member is responsible for
  *every* line of code, then every PMC member should have complete rights
 to
  every line of code. Creating disparity flies in the face of a PMC
 member's
  responsibility. If I am a Spark PMC member, then I have responsibility
 for
  GraphX code, whether my name is Ankur, Joey, Reynold, or Greg. And
  interposing a barrier inhibits my responsibility to ensure GraphX is
  designed, maintained, and delivered to the Public.
  
   Cheers,
   -g
  
   (and yes, I'm aware of COMMITTERS; I've been changing that file for the
  past 12 years :-) )
  
   On Thu, Nov 6, 2014 at 6:28 PM, Patrick Wendell pwend...@gmail.com
  mailto:pwend...@gmail.com wrote:
   In fact, if you look at the subversion commiter list, the majority of
   people here have commit access only for particular areas of the
   project:
  
   http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS 
  http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS
  
   On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com
  mailto:pwend...@gmail.com wrote:
Hey Greg,
   
Regarding subversion - I think the reference is to partial vs full
committers here:
https://subversion.apache.org/docs/community-guide/roles.html 
  https://subversion.apache.org/docs/community-guide/roles.html
   
- Patrick
   
On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com
 mailto:
  gst...@gmail.com wrote:
-1 (non-binding)
   
This is an idea that runs COMPLETELY counter to the Apache Way, and
 is
to be severely frowned up. This creates *unequal* ownership of the
codebase.
   
Each Member of the PMC should have *equal* rights to all areas of
 the
codebase until their purview. It should not be subjected to others'
ownership except throught the standard mechanisms of reviews and
if/when absolutely necessary, to vetos.
   
Apache does not want leads, benevolent dictators or assigned
maintainers, no matter how you may dress it up with multiple
maintainers per component. The fact is that this creates an unequal
level of ownership and responsibility. The Board has shut down
projects that attempted or allowed for Leads. Just a few months
 ago,
there was a problem with somebody calling themself a Lead.
   
I don't know why you suggest that Apache Subversion does this. We
absolutely do not. Never have. Never will. The Subversion codebase
 is
owned by all of us, and we all care for every line of it. Some
 people
know more than others, of course. But any one of us, can change any
part, without being subjected to a maintainer. Of course, we ask
people with more knowledge of the component when we feel
uncomfortable, but we also know when it is safe or not to make a
specific 

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Sandy Ryza
This seems like a good idea.

An area that wasn't listed, but that I think could strongly benefit from
maintainers, is the build.  Having consistent oversight over Maven, SBT,
and dependencies would allow us to avoid subtle breakages.

Component maintainers have come up several times within the Hadoop project,
and I think one of the main reasons the proposals have been rejected is
that, structurally, its effect is to slow down development.  As you
mention, this is somewhat mitigated if being a maintainer leads committers
to take on more responsibility, but it might be worthwhile to draw up more
specific ideas on how to combat this?  E.g. do obvious changes, doc fixes,
test fixes, etc. always require a maintainer?

-Sandy

On Wed, Nov 5, 2014 at 5:36 PM, Michael Armbrust mich...@databricks.com
wrote:

 +1 (binding)

 On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  BTW, my own vote is obviously +1 (binding).
 
  Matei
 
   On Nov 5, 2014, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com
  wrote:
  
   Hi all,
  
   I wanted to share a discussion we've been having on the PMC list, as
  well as call for an official vote on it on a public list. Basically, as
 the
  Spark project scales up, we need to define a model to make sure there is
  still great oversight of key components (in particular internal
  architecture and public APIs), and to this end I've proposed
 implementing a
  maintainer model for some of these components, similar to other large
  projects.
  
   As background on this, Spark has grown a lot since joining Apache.
 We've
  had over 80 contributors/month for the past 3 months, which I believe
 makes
  us the most active project in contributors/month at Apache, as well as
 over
  500 patches/month. The codebase has also grown significantly, with new
  libraries for SQL, ML, graphs and more.
  
   In this kind of large project, one common way to scale development is
 to
  assign maintainers to oversee key components, where each patch to that
  component needs to get sign-off from at least one of its maintainers.
 Most
  existing large projects do this -- at Apache, some large ones with this
  model are CloudStack (the second-most active project overall),
 Subversion,
  and Kafka, and other examples include Linux and Python. This is also
  by-and-large how Spark operates today -- most components have a de-facto
  maintainer.
  
   IMO, adopting this model would have two benefits:
  
   1) Consistent oversight of design for that component, especially
  regarding architecture and API. This process would ensure that the
  component's maintainers see all proposed changes and consider them to fit
  together in a good way.
  
   2) More structure for new contributors and committers -- in particular,
  it would be easy to look up who’s responsible for each module and ask
 them
  for reviews, etc, rather than having patches slip between the cracks.
  
   We'd like to start with in a light-weight manner, where the model only
  applies to certain key components (e.g. scheduler, shuffle) and
 user-facing
  APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand
  it if we deem it useful. The specific mechanics would be as follows:
  
   - Some components in Spark will have maintainers assigned to them,
 where
  one of the maintainers needs to sign off on each patch to the component.
   - Each component with maintainers will have at least 2 maintainers.
   - Maintainers will be assigned from the most active and knowledgeable
  committers on that component by the PMC. The PMC can vote to add / remove
  maintainers, and maintained components, through consensus.
   - Maintainers are expected to be active in responding to patches for
  their components, though they do not need to be the main reviewers for
 them
  (e.g. they might just sign off on architecture / API). To prevent
 inactive
  maintainers from blocking the project, if a maintainer isn't responding
 in
  a reasonable time period (say 2 weeks), other committers can merge the
  patch, and the PMC will want to discuss adding another maintainer.
  
   If you'd like to see examples for this model, check out the following
  projects:
   - CloudStack:
 
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
  
 
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
  
   - Subversion:
  https://subversion.apache.org/docs/community-guide/roles.html 
  https://subversion.apache.org/docs/community-guide/roles.html
  
   Finally, I wanted to list our current proposal for initial components
  and maintainers. It would be good to get feedback on other components we
  might add, but please note that personnel discussions (e.g. I don't
 think
  Matei should maintain *that* component) should only happen on the private
  list. The initial components were chosen to include all public APIs and
 the
  main core components, and the maintainers were chosen 

Re: A couple questions about shared variables

2014-09-22 Thread Sandy Ryza
MapReduce counters do not count duplications.  In MapReduce, if a task
needs to be re-run, the value of the counter from the second task
overwrites the value from the first task.

-Sandy

On Mon, Sep 22, 2014 at 4:55 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

  If you think it as necessary to fix, I would like to resubmit that PR
 (seems to have some conflicts with the current DAGScheduler)

 My suggestion is to make it as an option in accumulator, e.g. some
 algorithms utilizing accumulator for result calculation, it needs a
 deterministic accumulator, while others implementing something like Hadoop
 counters may need the current implementation (count everything happened,
 including the duplications)

 Your thoughts?

 --
 Nan Zhu

 On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote:

 Hmm, good point, this seems to have been broken by refactorings of the
 scheduler, but it worked in the past. Basically the solution is simple --
 in a result stage, we should not apply the update for each task ID more
 than once -- the same way we don't call job.listener.taskSucceeded more
 than once. Your PR also tried to avoid this for resubmitted shuffle stages,
 but I don't think we need to do that necessarily (though we could).

 Matei

 On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com)
 wrote:

 Hi, Matei,

 Can you give some hint on how the current implementation guarantee the
 accumulator is only applied for once?

 There is a pending PR trying to achieving this (
 https://github.com/apache/spark/pull/228/files), but from the current
 implementation, I didn’t see this has been done? (maybe I missed something)

 Best,

 --
 Nan Zhu

 On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:

  Hey Sandy,

 On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com)
 wrote:

 Hey All,

 A couple questions came up about shared variables recently, and I wanted
 to
 confirm my understanding and update the doc to be a little more clear.

 *Broadcast variables*
 Now that tasks data is automatically broadcast, the only occasions where
 it
 makes sense to explicitly broadcast are:
 * You want to use a variable from tasks in multiple stages.
 * You want to have the variable stored on the executors in deserialized
 form.
 * You want tasks to be able to modify the variable and have those
 modifications take effect for other tasks running on the same executor
 (usually a very bad idea).

 Is that right?
 Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also
 matters. (We might later factor tasks in a different way to avoid 2, but
 it's hard due to things like Hadoop JobConf objects in the tasks).


 *Accumulators*
 Values are only counted for successful tasks. Is that right? KMeans seems
 to use it in this way. What happens if a node goes away and successful
 tasks need to be resubmitted? Or the stage runs again because a different
 job needed it.
 Accumulators are guaranteed to give a deterministic result if you only
 increment them in actions. For each result stage, the accumulator's update
 from each task is only applied once, even if that task runs multiple times.
 If you use accumulators in transformations (i.e. in a stage that may be
 part of multiple jobs), then you may see multiple updates, from each run.
 This is kind of confusing but it was useful for people who wanted to use
 these for debugging.

 Matei





 thanks,
 Sandy






Re: Spark authenticate enablement

2014-09-12 Thread Sandy Ryza
Hi Jun,

I believe that's correct that Spark authentication only works against YARN.

-Sandy

On Thu, Sep 11, 2014 at 2:14 AM, Jun Feng Liu liuj...@cn.ibm.com wrote:

 Hi, there

 I am trying to enable the authentication on spark on standealone model.
 Seems like only SparkSubmit load the properties from spark-defaults.conf.
  org.apache.spark.deploy.master.Master dose not really load the default
 setting from spark-defaults.conf.

 Dose it mean the spark authentication only work for like YARN model? Or I
 missed something with standalone model.

 Best Regards


 *Jun Feng Liu*
 IBM China Systems  Technology Laboratory in Beijing

   --
  [image: 2D barcode - encoded with contact information] *Phone: 
 *86-10-82452683

 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com


 BLD 28,ZGC Software Park
 No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
 China







Reporting serialized task size after task broadcast change?

2014-09-11 Thread Sandy Ryza
After the change to broadcast all task data, is there any easy way to
discover the serialized size of the data getting sent down for a task?

thanks,
-Sandy


Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Sandy Ryza
It used to be available on the UI, no?

On Thu, Sep 11, 2014 at 6:26 PM, Reynold Xin r...@databricks.com wrote:

 I don't think so. We should probably add a line to log it.


 On Thursday, September 11, 2014, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 After the change to broadcast all task data, is there any easy way to
 discover the serialized size of the data getting sent down for a task?

 thanks,
 -Sandy




Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Sandy Ryza
Hmm, well I can't find it now, must have been hallucinating.  Do you know
off the top of your head where I'd be able to find the size to log it?

On Thu, Sep 11, 2014 at 6:33 PM, Reynold Xin r...@databricks.com wrote:

 I didn't know about that 

 On Thu, Sep 11, 2014 at 6:29 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 It used to be available on the UI, no?

 On Thu, Sep 11, 2014 at 6:26 PM, Reynold Xin r...@databricks.com wrote:

  I don't think so. We should probably add a line to log it.
 
 
  On Thursday, September 11, 2014, Sandy Ryza sandy.r...@cloudera.com
  wrote:
 
  After the change to broadcast all task data, is there any easy way to
  discover the serialized size of the data getting sent down for a task?
 
  thanks,
  -Sandy
 
 





Re: Lost executor on YARN ALS iterations

2014-09-10 Thread Sandy Ryza
That's right

On Tue, Sep 9, 2014 at 2:04 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Last time it did not show up on environment tab but I will give it another
 shot...Expected behavior is that this env variable will show up right ?

 On Tue, Sep 9, 2014 at 12:15 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 I would expect 2 GB would be enough or more than enough for 16 GB
 executors (unless ALS is using a bunch of off-heap memory?).  You mentioned
 earlier in this thread that the property wasn't showing up in the
 Environment tab.  Are you sure it's making it in?

 -Sandy

 On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hmm...I did try it increase to few gb but did not get a successful run
 yet...

 Any idea if I am using say 40 executors, each running 16GB, what's the
 typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large
 matrices with say few billion ratings...

 On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Deb,

 The current state of the art is to increase
 spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
 plans to try to automatically scale this based on the amount of memory
 requested, but it will still just be a heuristic.

 -Sandy

 On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Sandy,

 Any resolution for YARN failures ? It's a blocker for running spark on
 top of YARN.

 Thanks.
 Deb

 On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng men...@gmail.com
 wrote:

 Hi Deb,

 I think this may be the same issue as described in
 https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
 container got killed by YARN because it used much more memory that it
 requested. But we haven't figured out the root cause yet.

 +Sandy

 Best,
 Xiangrui

 On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das 
 debasish.da...@gmail.com wrote:
  Hi,
 
  During the 4th ALS iteration, I am noticing that one of the
 executor gets
  disconnected:
 
  14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
  SendingConnectionManagerId not found
 
  14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor
 5
  disconnected, so removing it
 
  14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
 executor 5
  on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client
 disassociated
 
  14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5
 (epoch 12)
  Any idea if this is a bug related to akka on YARN ?
 
  I am using master
 
  Thanks.
  Deb









Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Sandy Ryza
Hi Deb,

The current state of the art is to increase
spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
plans to try to automatically scale this based on the amount of memory
requested, but it will still just be a heuristic.

-Sandy

On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi Sandy,

 Any resolution for YARN failures ? It's a blocker for running spark on top
 of YARN.

 Thanks.
 Deb

 On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Deb,

 I think this may be the same issue as described in
 https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
 container got killed by YARN because it used much more memory that it
 requested. But we haven't figured out the root cause yet.

 +Sandy

 Best,
 Xiangrui

 On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  During the 4th ALS iteration, I am noticing that one of the executor
 gets
  disconnected:
 
  14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
  SendingConnectionManagerId not found
 
  14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
  disconnected, so removing it
 
  14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
 executor 5
  on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated
 
  14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch
 12)
  Any idea if this is a bug related to akka on YARN ?
 
  I am using master
 
  Thanks.
  Deb





Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Sandy Ryza
I would expect 2 GB would be enough or more than enough for 16 GB executors
(unless ALS is using a bunch of off-heap memory?).  You mentioned earlier
in this thread that the property wasn't showing up in the Environment tab.
 Are you sure it's making it in?

-Sandy

On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das debasish.da...@gmail.com
wrote:

 Hmm...I did try it increase to few gb but did not get a successful run
 yet...

 Any idea if I am using say 40 executors, each running 16GB, what's the
 typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large
 matrices with say few billion ratings...

 On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Deb,

 The current state of the art is to increase
 spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
 plans to try to automatically scale this based on the amount of memory
 requested, but it will still just be a heuristic.

 -Sandy

 On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Sandy,

 Any resolution for YARN failures ? It's a blocker for running spark on
 top of YARN.

 Thanks.
 Deb

 On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng men...@gmail.com
 wrote:

 Hi Deb,

 I think this may be the same issue as described in
 https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
 container got killed by YARN because it used much more memory that it
 requested. But we haven't figured out the root cause yet.

 +Sandy

 Best,
 Xiangrui

 On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  During the 4th ALS iteration, I am noticing that one of the executor
 gets
  disconnected:
 
  14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
  SendingConnectionManagerId not found
 
  14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
  disconnected, so removing it
 
  14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
 executor 5
  on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client
 disassociated
 
  14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5
 (epoch 12)
  Any idea if this is a bug related to akka on YARN ?
 
  I am using master
 
  Thanks.
  Deb







Re: about spark assembly jar

2014-09-02 Thread Sandy Ryza
This doesn't help for every dependency, but Spark provides an option to
build the assembly jar without Hadoop and its dependencies.  We make use of
this in CDH packaging.

-Sandy


On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote:

 Hi sean owen,
 here are some problems when i used assembly jar
 1 i put spark-assembly-*.jar to the lib directory of my application, it
 throw compile error

 Error:scalac: Error: class scala.reflect.BeanInfo not found.
 scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not
 found.

 at scala.tools.nsc.symtab.Definitions$definitions$.
 getModuleOrClass(Definitions.scala:655)

 at scala.tools.nsc.symtab.Definitions$definitions$.
 getClass(Definitions.scala:608)

 at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.
 init(GenJVM.scala:127)

 at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
 scala:85)

 at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)

 at scala.tools.nsc.Global$Run.compile(Global.scala:1041)

 at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)

 at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)

 at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)

 at xsbt.CompilerInterface.run(CompilerInterface.scala:27)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(
 NativeMethodAccessorImpl.java:39)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(
 DelegatingMethodAccessorImpl.java:25)

 at java.lang.reflect.Method.invoke(Method.java:597)

 at sbt.compiler.AnalyzingCompiler.call(
 AnalyzingCompiler.scala:102)

 at sbt.compiler.AnalyzingCompiler.compile(
 AnalyzingCompiler.scala:48)

 at sbt.compiler.AnalyzingCompiler.compile(
 AnalyzingCompiler.scala:41)

 at org.jetbrains.jps.incremental.scala.local.
 IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)

 at org.jetbrains.jps.incremental.scala.local.LocalServer.
 compile(LocalServer.scala:25)

 at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
 scala:58)

 at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
 Main.scala:21)

 at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
 Main.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(
 NativeMethodAccessorImpl.java:39)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(
 DelegatingMethodAccessorImpl.java:25)

 at java.lang.reflect.Method.invoke(Method.java:597)

 at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
 2 i test my branch which updated hive version to org.apache.hive 0.13.1
   it run successfully when use a bag of 3rd jars as dependency but throw
 error using assembly jar, it seems assembly jar lead to conflict
   ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
 at org.apache.hadoop.hive.ql.io.parquet.serde.
 ArrayWritableObjectInspector.getObjectInspector(
 ArrayWritableObjectInspector.java:66)
 at org.apache.hadoop.hive.ql.io.parquet.serde.
 ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59)
 at org.apache.hadoop.hive.ql.io.parquet.serde.
 ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
 at org.apache.hadoop.hive.metastore.MetaStoreUtils.
 getDeserializer(MetaStoreUtils.java:339)
 at org.apache.hadoop.hive.ql.metadata.Table.
 getDeserializerFromMetaStore(Table.java:283)
 at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(
 Table.java:189)
 at org.apache.hadoop.hive.ql.metadata.Hive.createTable(
 Hive.java:597)
 at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(
 DDLTask.java:4194)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.
 java:281)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
 at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(
 TaskRunner.java:85)





 On 2014/9/2 16:45, Sean Owen wrote:

 Hm, are you suggesting that the Spark distribution be a bag of 100
 JARs? It doesn't quite seem reasonable. It does not remove version
 conflicts, just pushes them to run-time, which isn't good. The
 assembly is also necessary because that's where shading happens. In
 development, you want to run against exactly what will be used in a
 real Spark distro.

 On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com wrote:

 hi, all
I suggest spark not use assembly jar as default run-time
 dependency(spark-submit/spark-class depend on assembly jar),use a
 library of
 all 3rd dependency jar like hadoop/hive/hbase more reasonable.

1 assembly jar packaged all 3rd jars into a big one, so we need
 rebuild
 this jar if we want to update the version of some component(such as
 hadoop)
2 in 

Re: Lost executor on YARN ALS iterations

2014-08-20 Thread Sandy Ryza
Hi Debasish,

The fix is to raise spark.yarn.executor.memoryOverhead until this goes
away.  This controls the buffer between the JVM heap size and the amount of
memory requested from YARN (JVMs can take up memory beyond their heap
size). You should also make sure that, in the YARN NodeManager
configuration, yarn.nodemanager.vmem-check-enabled is set to false.

-Sandy


On Wed, Aug 20, 2014 at 12:27 AM, Debasish Das debasish.da...@gmail.com
wrote:

 I could reproduce the issue in both 1.0 and 1.1 using YARN...so this is
 definitely a YARN related problem...

 At least for me right now only deployment option possible is standalone...



 On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Deb,

 I think this may be the same issue as described in
 https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
 container got killed by YARN because it used much more memory that it
 requested. But we haven't figured out the root cause yet.

 +Sandy

 Best,
 Xiangrui

 On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  During the 4th ALS iteration, I am noticing that one of the executor
 gets
  disconnected:
 
  14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
  SendingConnectionManagerId not found
 
  14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
  disconnected, so removing it
 
  14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
 executor 5
  on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated
 
  14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch
 12)
  Any idea if this is a bug related to akka on YARN ?
 
  I am using master
 
  Thanks.
  Deb





Re: Fine-Grained Scheduler on Yarn

2014-08-08 Thread Sandy Ryza
Hi Jun,

Spark currently doesn't have that feature, i.e. it aims for a fixed number
of executors per application regardless of resource usage, but it's
definitely worth considering.  We could start more executors when we have a
large backlog of tasks and shut some down when we're underutilized.

The fine-grained task scheduling is blocked on work from YARN that will
allow changing the CPU allocation of a YARN container dynamically.  The
relevant JIRA for this dependency is YARN-1197, though YARN-1488 might
serve this purpose as well if it comes first.

-Sandy


On Thu, Aug 7, 2014 at 10:56 PM, Jun Feng Liu liuj...@cn.ibm.com wrote:

 Thanks for echo on this. Possible to adjust resource based on container
 numbers? e.g to allocate more container when driver need more resources and
 return some resource by delete some container when parts of container
 already have enough cores/memory

 Best Regards


 *Jun Feng Liu*

 IBM China Systems  Technology Laboratory in Beijing

   --
  [image: 2D barcode - encoded with contact information]
 *Phone: *86-10-82452683
 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
 [image: IBM]

 BLD 28,ZGC Software Park
 No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
 China





  *Patrick Wendell pwend...@gmail.com pwend...@gmail.com*

 2014/08/08 13:10
   To
 Jun Feng Liu/China/IBM@IBMCN,
 cc
 dev@spark.apache.org dev@spark.apache.org
 Subject
 Re: Fine-Grained Scheduler on Yarn




 Hey sorry about that - what I said was the opposite of what is true.

 The current YARN mode is equivalent to coarse grained mesos. There is no
 fine-grained scheduling on YARN at the moment. I'm not sure YARN supports
 scheduling in units other than containers. Fine-grained scheduling requires
 scheduling at the granularity of individual cores.


 On Thu, Aug 7, 2014 at 9:43 PM, Patrick Wendell *pwend...@gmail.com*
 pwend...@gmail.com wrote:
 The current YARN is equivalent to what is called fine grained mode in
 Mesos. The scheduling of tasks happens totally inside of the Spark driver.


 On Thu, Aug 7, 2014 at 7:50 PM, Jun Feng Liu *liuj...@cn.ibm.com*
 liuj...@cn.ibm.com wrote:
 Any one know the answer?
 Best Regards


 * Jun Feng Liu*

 IBM China Systems  Technology Laboratory in Beijing

   --
  *Phone: *86-10-82452683
 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com


 BLD 28,ZGC Software Park
 No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
 China




   *Jun Feng Liu/China/IBM*

 2014/08/07 15:37

   To
 *dev@spark.apache.org* dev@spark.apache.org,
 cc
   Subject
 Fine-Grained Scheduler on Yarn





 Hi, there

 Just aware right now Spark only support fine grained scheduler on Mesos
 with MesosSchedulerBackend. The Yarn schedule sounds like only works on
 coarse-grained model. Is there any plan to implement fine-grained scheduler
 for YARN? Or there is any technical issue block us to do that.

 Best Regards


 * Jun Feng Liu*

 IBM China Systems  Technology Laboratory in Beijing

   --
  *Phone: *86-10-82452683
 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com


 BLD 28,ZGC Software Park
 No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
 China









Re: Fine-Grained Scheduler on Yarn

2014-08-08 Thread Sandy Ryza
I think that would be useful work.  I don't know the minute details of this
code, but in general TaskSchedulerImpl keeps track of pending tasks.  Tasks
are organized into TaskSets, each of which corresponds to a particular
stage.  Each TaskSet has a TaskSetManager, which directly tracks the
pending tasks for that stage.

-Sandy


On Fri, Aug 8, 2014 at 12:37 AM, Jun Feng Liu liuj...@cn.ibm.com wrote:

 Yes, I think we need both level resource control (container numbers and
 dynamically change container resources), which can make the resource
 utilization much more effective, especially when we have more types work
 load share the same infrastructure.

 Is there anyway I can observe the tasks backlog in schedulerbackend?
 Sounds like scheduler backend be triggered during new taskset submitted. I
 did not figured if there is a way to check the whole backlog tasks inside
 it. I am interesting to implement some policy in schedulerbackend and test
 to see how useful it is going to be.

 Best Regards


 *Jun Feng Liu*
 IBM China Systems  Technology Laboratory in Beijing

   --
  [image: 2D barcode - encoded with contact information] *Phone: 
 *86-10-82452683

 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
 [image: IBM]

 BLD 28,ZGC Software Park
 No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
 China





  *Sandy Ryza sandy.r...@cloudera.com sandy.r...@cloudera.com*

 2014/08/08 15:14
   To
 Jun Feng Liu/China/IBM@IBMCN,
 cc
 Patrick Wendell pwend...@gmail.com, dev@spark.apache.org 
 dev@spark.apache.org
 Subject
 Re: Fine-Grained Scheduler on Yarn




 Hi Jun,

 Spark currently doesn't have that feature, i.e. it aims for a fixed number
 of executors per application regardless of resource usage, but it's
 definitely worth considering.  We could start more executors when we have a
 large backlog of tasks and shut some down when we're underutilized.

 The fine-grained task scheduling is blocked on work from YARN that will
 allow changing the CPU allocation of a YARN container dynamically.  The
 relevant JIRA for this dependency is YARN-1197, though YARN-1488 might
 serve this purpose as well if it comes first.

 -Sandy


 On Thu, Aug 7, 2014 at 10:56 PM, Jun Feng Liu liuj...@cn.ibm.com wrote:

  Thanks for echo on this. Possible to adjust resource based on container
  numbers? e.g to allocate more container when driver need more resources
 and
  return some resource by delete some container when parts of container
  already have enough cores/memory
 
  Best Regards
 
 
  *Jun Feng Liu*

 
  IBM China Systems  Technology Laboratory in Beijing
 
--

   [image: 2D barcode - encoded with contact information]
  *Phone: *86-10-82452683
  * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com

  [image: IBM]
 
  BLD 28,ZGC Software Park
  No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
  China
 
 
 
 
 
   *Patrick Wendell pwend...@gmail.com pwend...@gmail.com*

 
  2014/08/08 13:10
To
  Jun Feng Liu/China/IBM@IBMCN,
  cc
  dev@spark.apache.org dev@spark.apache.org
  Subject
  Re: Fine-Grained Scheduler on Yarn
 
 
 
 
  Hey sorry about that - what I said was the opposite of what is true.
 
  The current YARN mode is equivalent to coarse grained mesos. There is
 no
  fine-grained scheduling on YARN at the moment. I'm not sure YARN supports
  scheduling in units other than containers. Fine-grained scheduling
 requires
  scheduling at the granularity of individual cores.
 
 
  On Thu, Aug 7, 2014 at 9:43 PM, Patrick Wendell *pwend...@gmail.com*

  pwend...@gmail.com wrote:
  The current YARN is equivalent to what is called fine grained mode in
  Mesos. The scheduling of tasks happens totally inside of the Spark
 driver.
 
 
  On Thu, Aug 7, 2014 at 7:50 PM, Jun Feng Liu *liuj...@cn.ibm.com*

  liuj...@cn.ibm.com wrote:
  Any one know the answer?
  Best Regards
 
 
  * Jun Feng Liu*

 
  IBM China Systems  Technology Laboratory in Beijing
 
--
   *Phone: *86-10-82452683
  * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com

 
 
  BLD 28,ZGC Software Park
  No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
  China
 
 
 
 
*Jun Feng Liu/China/IBM*
 
  2014/08/07 15:37
 
To
  *dev@spark.apache.org* dev@spark.apache.org,

  cc
Subject
  Fine-Grained Scheduler on Yarn
 
 
 
 
 
  Hi, there
 
  Just aware right now Spark only support fine grained scheduler on Mesos
  with MesosSchedulerBackend. The Yarn schedule sounds like only works on
  coarse-grained model. Is there any plan to implement fine-grained
 scheduler
  for YARN? Or there is any technical issue block us to do that.
 
  Best Regards
 
 
  * Jun Feng Liu*

 
  IBM China Systems  Technology Laboratory in Beijing
 
--
   *Phone: *86-10-82452683
  * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com

 
 
  BLD 28,ZGC Software Park
  No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
  China
 
 
 
 
 
 
 




Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-22 Thread Sandy Ryza
Yeah, the input format doesn't support this behavior.  But it does tell you
the byte position of each record in the file.


On Mon, Jul 21, 2014 at 10:55 PM, Reynold Xin r...@databricks.com wrote:

 Yes, that could work. But it is not as simple as just a binary flag.

 We might want to skip the first row for every file, or the header only for
 the first file. The former is not really supported out of the box by the
 input format I think?


 On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

  It could make sense to add a skipHeader argument to
 SparkContext.textFile?
 
 
  On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin r...@databricks.com
 wrote:
 
   If the purpose is for dropping csv headers, perhaps we don't really
 need
  a
   common drop and only one that drops the first line in a file? I'd
 really
   try hard to avoid a common drop/dropWhile because they can be expensive
  to
   do.
  
   Note that I think we will be adding this functionality (ignoring
 headers)
   to the CsvRDD functionality in Spark SQL.
https://github.com/apache/spark/pull/1351
  
  
   On Mon, Jul 21, 2014 at 1:45 PM, Mark Hamstra m...@clearstorydata.com
 
   wrote:
  
You can find some of the prior, related discussion here:
https://issues.apache.org/jira/browse/SPARK-1021
   
   
On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson e...@redhat.com
  wrote:
   


 - Original Message -
  Rather than embrace non-lazy transformations and add more of
 them,
   I'd
  rather we 1) try to fully characterize the needs that are driving
   their
  creation/usage; and 2) design and implement new Spark
 abstractions
   that
  will allow us to meet those needs and eliminate existing non-lazy
  transformation.


 In the case of drop, obtaining the index of the boundary partition
  can
   be
 viewed as the action forcing compute -- one that happens to be
  invoked
 inside of a transform.  The concept of a lazy action, that is
 only
 triggered if the result rdd has compute invoked on it, might be
sufficient
 to restore laziness to the drop transform.   For that matter, I
 might
find
 some way to make use of Scala lazy values directly and achieve the
  same
 goal for drop.



  They really mess up things like creation of asynchronous
  FutureActions, job cancellation and accounting of job resource
  usage,
 etc.,
  so I'd rather we seek a way out of the existing hole rather than
  make
it
  deeper.
 
 
  On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson e...@redhat.com
 
wrote:
 
  
  
   - Original Message -
Sure, drop() would be useful, but breaking the
 transformations
   are
 lazy;
only actions launch jobs model is abhorrent -- which is not
 to
   say
 that
   we
haven't already broken that model for useful operations (cf.
RangePartitioner, which is used for sorted RDDs), but rather
  that
 each
   such
exception to the model is a significant source of pain that
 can
   be
 hard
   to
work with or work around.
  
   A thought that comes to my mind here is that there are in fact
already
 two
   categories of transform: ones that are truly lazy, and ones
 that
   are
 not.
A possible option is to embrace that, and commit to
 documenting
   the
 two
   categories as such, with an obvious bias towards favoring lazy
 transforms
   (to paraphrase Churchill, we're down to haggling over the
 price).
  
  
   
I really wouldn't like to see another such model-breaking
 transformation
added to the API.  On the other hand, being able to write
 transformations
with dependencies on these kind of internal jobs is
 sometimes
very
useful, so a significant reworking of Spark's Dependency
 model
   that
 would
allow for lazily running such internal jobs and making the
   results
available to subsequent stages may be something worth
 pursuing.
  
  
   This seems like a very interesting angle.   I don't have much
  feel
for
   what a solution would look like, but it sounds as if it would
   involve
   caching all operations embodied by RDD transform method code
 for
   provisional execution.  I believe that these levels of
 invocation
   are
   currently executed in the master, not executor nodes.
  
  
   
   
On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash 
   and...@andrewash.com
   wrote:
   
 Personally I'd find the method useful -- I've often had a
  .csv
file
   with a
 header row that I want to drop so filter it out, which
  touches
all
 partitions anyway.  I don't have any comments on the
implementation
   quite
 yet though

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Sandy Ryza
It could make sense to add a skipHeader argument to SparkContext.textFile?


On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin r...@databricks.com wrote:

 If the purpose is for dropping csv headers, perhaps we don't really need a
 common drop and only one that drops the first line in a file? I'd really
 try hard to avoid a common drop/dropWhile because they can be expensive to
 do.

 Note that I think we will be adding this functionality (ignoring headers)
 to the CsvRDD functionality in Spark SQL.
  https://github.com/apache/spark/pull/1351


 On Mon, Jul 21, 2014 at 1:45 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

  You can find some of the prior, related discussion here:
  https://issues.apache.org/jira/browse/SPARK-1021
 
 
  On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson e...@redhat.com wrote:
 
  
  
   - Original Message -
Rather than embrace non-lazy transformations and add more of them,
 I'd
rather we 1) try to fully characterize the needs that are driving
 their
creation/usage; and 2) design and implement new Spark abstractions
 that
will allow us to meet those needs and eliminate existing non-lazy
transformation.
  
  
   In the case of drop, obtaining the index of the boundary partition can
 be
   viewed as the action forcing compute -- one that happens to be invoked
   inside of a transform.  The concept of a lazy action, that is only
   triggered if the result rdd has compute invoked on it, might be
  sufficient
   to restore laziness to the drop transform.   For that matter, I might
  find
   some way to make use of Scala lazy values directly and achieve the same
   goal for drop.
  
  
  
They really mess up things like creation of asynchronous
FutureActions, job cancellation and accounting of job resource usage,
   etc.,
so I'd rather we seek a way out of the existing hole rather than make
  it
deeper.
   
   
On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson e...@redhat.com
  wrote:
   


 - Original Message -
  Sure, drop() would be useful, but breaking the transformations
 are
   lazy;
  only actions launch jobs model is abhorrent -- which is not to
 say
   that
 we
  haven't already broken that model for useful operations (cf.
  RangePartitioner, which is used for sorted RDDs), but rather that
   each
 such
  exception to the model is a significant source of pain that can
 be
   hard
 to
  work with or work around.

 A thought that comes to my mind here is that there are in fact
  already
   two
 categories of transform: ones that are truly lazy, and ones that
 are
   not.
  A possible option is to embrace that, and commit to documenting
 the
   two
 categories as such, with an obvious bias towards favoring lazy
   transforms
 (to paraphrase Churchill, we're down to haggling over the price).


 
  I really wouldn't like to see another such model-breaking
   transformation
  added to the API.  On the other hand, being able to write
   transformations
  with dependencies on these kind of internal jobs is sometimes
  very
  useful, so a significant reworking of Spark's Dependency model
 that
   would
  allow for lazily running such internal jobs and making the
 results
  available to subsequent stages may be something worth pursuing.


 This seems like a very interesting angle.   I don't have much feel
  for
 what a solution would look like, but it sounds as if it would
 involve
 caching all operations embodied by RDD transform method code for
 provisional execution.  I believe that these levels of invocation
 are
 currently executed in the master, not executor nodes.


 
 
  On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash 
 and...@andrewash.com
 wrote:
 
   Personally I'd find the method useful -- I've often had a .csv
  file
 with a
   header row that I want to drop so filter it out, which touches
  all
   partitions anyway.  I don't have any comments on the
  implementation
 quite
   yet though.
  
  
   On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson 
 e...@redhat.com
 wrote:
  
A few weeks ago I submitted a PR for supporting rdd.drop(n),
   under
SPARK-2315:
https://issues.apache.org/jira/browse/SPARK-2315
   
Supporting the drop method would make some operations
  convenient,
 however
it forces computation of = 1 partition of the parent RDD,
 and
   so it
   would
behave like a partial action that returns an RDD as the
  result.
   
I wrote up a discussion of these trade-offs here:
   
   
  

  
 
 http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
   
  
 

   
  
 



Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sandy Ryza
/YarnAllocationHandler.scala:508:
  not found: type ContainerRequest
 
  [error] ): ArrayBuffer[ContainerRequest] = {
 
  [error]^
 
  [error]
 
 /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:446:
  not found: type ContainerRequest
 
  [error] val hostContainerRequests = new
  ArrayBuffer[ContainerRequest](preferredHostToCount.size)
 
  [error] ^
 
  [error]
 
 /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:458:
  not found: type ContainerRequest
 
  [error] val rackContainerRequests: List[ContainerRequest] =
  createRackResourceRequests(
 
  [error] ^
 
  [error]
 
 /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:467:
  not found: type ContainerRequest
 
  [error] val containerRequestBuffer = new
  ArrayBuffer[ContainerRequest](
 
  [error]  ^
 
  [error]
 
 /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:542:
  not found: type ContainerRequest
 
  [error] ): ArrayBuffer[ContainerRequest] = {
 
  [error]^
 
  [error]
 
 /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:545:
  value newInstance is not a member of object
  org.apache.hadoop.yarn.api.records.Resource
 
  [error] val resource = Resource.newInstance(memoryRequest,
  executorCores)
 
  [error] ^
 
  [error]
 
 /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:550:
  not found: type ContainerRequest
 
  [error] val requests = new ArrayBuffer[ContainerRequest]()
 
  [error]^
 
  [error] 40 errors found
 
  [error] (yarn-stable/compile:compile) Compilation failed
 
  [error] Total time: 98 s, completed Jul 16, 2014 5:14:44 PM
 
 
 
 
 
 
 
 
 
 
 
 
 
  On Wed, Jul 16, 2014 at 4:19 PM, Sandy Ryza sandy.r...@cloudera.com
  wrote:
 
  Hi Ron,
 
  I just checked and this bug is fixed in recent releases of Spark.
 
  -Sandy
 
 
  On Sun, Jul 13, 2014 at 8:15 PM, Chester Chen ches...@alpinenow.com
  wrote:
 
  Ron,
  Which distribution and Version of Hadoop are you using ?
 
   I just looked at CDH5 (  hadoop-mapreduce-client-core-
  2.3.0-cdh5.0.0),
 
  MRJobConfig does have the field :
 
  java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;
 
  Chester
 
 
 
  On Sun, Jul 13, 2014 at 6:49 PM, Ron Gonzalez zlgonza...@yahoo.com
  wrote:
 
  Hi,
I was doing programmatic submission of Spark yarn jobs and I saw
  code in ClientBase.getDefaultYarnApplicationClasspath():
 
  val field =
  classOf[MRJobConfig].getField(DEFAULT_YARN_APPLICATION_CLASSPATH)
  MRJobConfig doesn't have this field so the created launch env is
  incomplete. Workaround is to set yarn.application.classpath with
 the value
  from YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH.
 
  This results in having the spark job hang if the submission config
 is
  different from the default config. For example, if my resource
 manager port
  is 8050 instead of 8030, then the spark app is not able to register
 itself
  and stays in ACCEPTED state.
 
  I can easily fix this by changing this to YarnConfiguration instead
 of
  MRJobConfig but was wondering what the steps are for submitting a
 fix.
 
  Thanks,
  Ron
 
  Sent from my iPhone
 
 
 
 
 
 



Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Sandy Ryza
Stephen,
Often the shuffle is bound by writes to disk, so even if disks have enough
space to store the uncompressed data, the shuffle can complete faster by
writing less data.

Reynold,
This isn't a big help in the short term, but if we switch to a sort-based
shuffle, we'll only need a single LZFOutputStream per map task.


On Mon, Jul 14, 2014 at 3:30 PM, Stephen Haberman 
stephen.haber...@gmail.com wrote:


 Just a comment from the peanut gallery, but these buffers are a real
 PITA for us as well. Probably 75% of our non-user-error job failures
 are related to them.

 Just naively, what about not doing compression on the fly? E.g. during
 the shuffle just write straight to disk, uncompressed?

 For us, we always have plenty of disk space, and if you're concerned
 about network transmission, you could add a separate compress step
 after the blocks have been written to disk, but before being sent over
 the wire.

 Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
 see work in this area!

 - Stephen




Re: Changes to sbt build have been merged

2014-07-10 Thread Sandy Ryza
Woot!


On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell patr...@databricks.com
wrote:

 Just a heads up, we merged Prashant's work on having the sbt build read all
 dependencies from Maven. Please report any issues you find on the dev list
 or on JIRA.

 One note here for developers, going forward the sbt build will use the same
 configuration style as the maven build (-D for options and -P for maven
 profiles). So this will be a change for developers:

 sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly

 For now, we'll continue to support the old env-var options with a
 deprecation warning.

 - Patrick



Re: Data Locality In Spark

2014-07-08 Thread Sandy Ryza
Hi Anish,

Spark, like MapReduce, makes an effort to schedule tasks on the same nodes
and racks that the input blocks reside on.

-Sandy


On Tue, Jul 8, 2014 at 12:27 PM, anishs...@yahoo.co.in 
anishs...@yahoo.co.in wrote:

 Hi All

 My apologies for very basic question, do we have full support of data
 locality in Spark MapReduce.

 Please suggest.

 --
 Anish Sneh
 Experience is the best teacher.
 http://in.linkedin.com/in/anishsneh




Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Sandy Ryza
Having a common framework for clustering makes sense to me.  While we
should be careful about what algorithms we include, having solid
implementations of minibatch clustering and hierarchical clustering seems
like a worthwhile goal, and we should reuse as much code and APIs as
reasonable.


On Tue, Jul 8, 2014 at 1:19 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks, Hector! Your feedback is useful.

 On Tuesday, July 8, 2014, Hector Yee hector@gmail.com wrote:

  I would say for bigdata applications the most useful would be
 hierarchical
  k-means with back tracking and the ability to support k nearest
 centroids.
 
 
  On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
  javascript:; wrote:
 
   Hi all,
  
   MLlib currently has one clustering algorithm implementation, KMeans.
   It would benefit from having implementations of other clustering
   algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
   Clustering, and Affinity Propagation.
  
   I recently submitted a PR [1] for a MiniBatch KMeans implementation,
   and I saw an email on this list about interest in implementing Fuzzy
   C-Means.
  
   Based on Sean Owen's review of my MiniBatch KMeans code, it became
   apparent that before I implement more clustering algorithms, it would
   be useful to hammer out a framework to reduce code duplication and
   implement a consistent API.
  
   I'd like to gauge the interest and goals of the MLlib community:
  
   1. Are you interested in having more clustering algorithms available?
  
   2. Is the community interested in specifying a common framework?
  
   Thanks!
   RJ
  
   [1] - https://github.com/apache/spark/pull/1248
  
  
   --
   em rnowl...@gmail.com javascript:;
   c 954.496.2314
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 


 --
 em rnowl...@gmail.com
 c 954.496.2314



Re: Contributing to MLlib on GLM

2014-06-17 Thread Sandy Ryza
Hi Xiaokai,

I think MLLib is definitely interested in supporting additional GLMs.  I'm
not aware of anybody working on this at the moment.

-Sandy


On Tue, Jun 17, 2014 at 5:00 PM, Xiaokai Wei x...@palantir.com wrote:

 Hi,

 I am an intern at PalantirTech and we are building some stuff on top of
 MLlib. In Particular, GLM is of great interest to us.  Though
 GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as
 Logistic Regression, Linear Regression, some other important GLMs like
 Poisson Regression are still missing.

 I am curious that if anyone is already working on other GLMs (e.g.
 Poisson, Gamma). If not, we would like to contribute to MLlib on GLM. Is
 adding more GLMs on the roadmap of MLlib?


 Sincerely,

 Xiaokai



Re: Please change instruction about Launching Applications Inside the Cluster

2014-05-30 Thread Sandy Ryza
They should be - in the sense that the docs now recommend using
spark-submit and thus include entirely different invocations.


On Fri, May 30, 2014 at 12:46 AM, Reynold Xin r...@databricks.com wrote:

 Can you take a look at the latest Spark 1.0 docs and see if they are fixed?

 https://github.com/apache/spark/tree/master/docs

 Thanks.


 On Thu, May 29, 2014 at 5:29 AM, Lizhengbing (bing, BIPA) 
 zhengbing...@huawei.com wrote:

  The instruction address is in
 
 http://spark.apache.org/docs/0.9.0/spark-standalone.html#launching-applications-inside-the-cluster
  or
 
 http://spark.apache.org/docs/0.9.1/spark-standalone.html#launching-applications-inside-the-cluster
 
  Origin instruction is:
  ./bin/spark-class org.apache.spark.deploy.Client launch
 [client-options] \
 cluster-url application-jar-url main-class \
 [application-options] 
 
  If I follow this instruction, I will not run my program deployed in a
  spark standalone cluster properly.
 
  Based on source code, This instruction should be changed to
  ./bin/spark-class org.apache.spark.deploy.Client [client-options]
 launch \
 cluster-url application-jar-url main-class \
 [application-options] 
 
  That is to say: [client-options] must be put ahead of launch
 



Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-26 Thread Sandy Ryza
+1


On Mon, May 26, 2014 at 7:38 AM, Tathagata Das
tathagata.das1...@gmail.comwrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.0.0!

 This has a few important bug fixes on top of rc10:
 SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
 SPARK-1870 https://github.com/apache/spark/pull/853SPARK-1870:
 https://github.com/apache/spark/pull/848
 SPARK-1897: https://github.com/apache/spark/pull/849

 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc11/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1019/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Thursday, May 29, at 16:00 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 Changes to ML vector specification:

 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10

 Changes to the Java API:

 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 Changes to the streaming API:

 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 Changes to the GraphX API:

 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 Other changes:
 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior



Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Sandy Ryza
This will solve the issue for jars added upon application submission, but,
on top of this, we need to make sure that anything dynamically added
through sc.addJar works as well.

To do so, we need to make sure that any jars retrieved via the driver's
HTTP server are loaded by the same classloader that loads the jars given on
app submission.  To achieve this, we need to either use the same
classloader for both system jars and user jars, or make sure that the user
jars given on app submission are under the same classloader used for
dynamically added jars.

On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng men...@gmail.com wrote:

 Talked with Sandy and DB offline. I think the best solution is sending
 the secondary jars to the distributed cache of all containers rather
 than just the master, and set the classpath to include spark jar,
 primary app jar, and secondary jars before executor starts. In this
 way, user only needs to specify secondary jars via --jars instead of
 calling sc.addJar inside the code. It also solves the scalability
 problem of serving all the jars via http.

 If this solution sounds good, I can try to make a patch.

 Best,
 Xiangrui

 On Mon, May 19, 2014 at 10:04 PM, DB Tsai dbt...@stanford.edu wrote:
  In 1.0, there is a new option for users to choose which classloader has
  higher priority via spark.files.userClassPathFirst, I decided to submit
 the
  PR for 0.9 first. We use this patch in our lab and we can use those jars
  added by sc.addJar without reflection.
 
  https://github.com/apache/spark/pull/834
 
  Can anyone comment if it's a good approach?
 
  Thanks.
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Mon, May 19, 2014 at 7:42 PM, DB Tsai dbt...@stanford.edu wrote:
 
  Good summary! We fixed it in branch 0.9 since our production is still in
  0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
  tonight.
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
 
  It just hit me why this problem is showing up on YARN and not on
  standalone.
 
  The relevant difference between YARN and standalone is that, on YARN,
 the
  app jar is loaded by the system classloader instead of Spark's custom
 URL
  classloader.
 
  On YARN, the system classloader knows about [the classes in the spark
  jars,
  the classes in the primary app jar].   The custom classloader knows
 about
  [the classes in secondary app jars] and has the system classloader as
 its
  parent.
 
  A few relevant facts (mostly redundant with what Sean pointed out):
  * Every class has a classloader that loaded it.
  * When an object of class B is instantiated inside of class A, the
  classloader used for loading B is the classloader that was used for
  loading
  A.
  * When a classloader fails to load a class, it lets its parent
 classloader
  try.  If its parent succeeds, its parent becomes the classloader that
  loaded it.
 
  So suppose class B is in a secondary app jar and class A is in the
 primary
  app jar:
  1. The custom classloader will try to load class A.
  2. It will fail, because it only knows about the secondary jars.
  3. It will delegate to its parent, the system classloader.
  4. The system classloader will succeed, because it knows about the
 primary
  app jar.
  5. A's classloader will be the system classloader.
  6. A tries to instantiate an instance of class B.
  7. B will be loaded with A's classloader, which is the system
 classloader.
  8. Loading B will fail, because A's classloader, which is the system
  classloader, doesn't know about the secondary app jars.
 
  In Spark standalone, A and B are both loaded by the custom
 classloader, so
  this issue doesn't come up.
 
  -Sandy
 
  On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Having a user add define a custom class inside of an added jar and
   instantiate it directly inside of an executor is definitely supported
   in Spark and has been for a really long time (several years). This is
   something we do all the time in Spark.
  
   DB - I'd hold off on a re-architecting of this until we identify
   exactly what is causing the bug you are running into.
  
   In a nutshell, when the bytecode new Foo() is run on the executor,
   it will ask the driver for the class over HTTP using a custom
   classloader. Something in that pipeline is breaking here, possibly
   related to the YARN deployment stuff.
  
  
   On Mon, May 19, 2014 at 12:29 AM, Sean Owen so...@cloudera.com
 wrote:
I don't think a customer classloader is necessary.
   
Well, it occurs to me that this is no new problem. Hadoop, Tomcat,
 etc
all run custom user code that creates new user objects without

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Sandy Ryza
Is that an assumption we can make?  I think we'd run into an issue in this
situation:

*In primary jar:*
def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()

*In app code:*
sc.addJar(dynamicjar.jar)
...
rdd.map(x = makeDynamicObject(some.class.from.DynamicJar))

It might be fair to say that the user should make sure to use the context
classloader when instantiating dynamic classes, but I think it's weird that
this code would work on Spark standalone but not on YARN.

-Sandy


On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng men...@gmail.com wrote:

 I think adding jars dynamically should work as long as the primary jar
 and the secondary jars do not depend on dynamically added jars, which
 should be the correct logic. -Xiangrui

 On Wed, May 21, 2014 at 1:40 PM, DB Tsai dbt...@stanford.edu wrote:
  This will be another separate story.
 
  Since in the yarn deployment, as Sandy said, the app.jar will be always
 in
  the systemclassloader which means any object instantiated in app.jar will
  have parent loader of systemclassloader instead of custom one. As a
 result,
  the custom class loader in yarn will never work without specifically
 using
  reflection.
 
  Solution will be not using system classloader in the classloader
 hierarchy,
  and add all the resources in system one into custom one. This is the
  approach of tomcat takes.
 
  Or we can directly overwirte the system class loader by calling the
  protected method `addURL` which will not work and throw exception if the
  code is wrapped in security manager.
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
 
  This will solve the issue for jars added upon application submission,
 but,
  on top of this, we need to make sure that anything dynamically added
  through sc.addJar works as well.
 
  To do so, we need to make sure that any jars retrieved via the driver's
  HTTP server are loaded by the same classloader that loads the jars
 given on
  app submission.  To achieve this, we need to either use the same
  classloader for both system jars and user jars, or make sure that the
 user
  jars given on app submission are under the same classloader used for
  dynamically added jars.
 
  On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng men...@gmail.com
 wrote:
 
   Talked with Sandy and DB offline. I think the best solution is sending
   the secondary jars to the distributed cache of all containers rather
   than just the master, and set the classpath to include spark jar,
   primary app jar, and secondary jars before executor starts. In this
   way, user only needs to specify secondary jars via --jars instead of
   calling sc.addJar inside the code. It also solves the scalability
   problem of serving all the jars via http.
  
   If this solution sounds good, I can try to make a patch.
  
   Best,
   Xiangrui
  
   On Mon, May 19, 2014 at 10:04 PM, DB Tsai dbt...@stanford.edu
 wrote:
In 1.0, there is a new option for users to choose which classloader
 has
higher priority via spark.files.userClassPathFirst, I decided to
 submit
   the
PR for 0.9 first. We use this patch in our lab and we can use those
  jars
added by sc.addJar without reflection.
   
https://github.com/apache/spark/pull/834
   
Can anyone comment if it's a good approach?
   
Thanks.
   
   
Sincerely,
   
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
   
   
On Mon, May 19, 2014 at 7:42 PM, DB Tsai dbt...@stanford.edu
 wrote:
   
Good summary! We fixed it in branch 0.9 since our production is
 still
  in
0.9. I'm porting it to 1.0 now, and hopefully will submit PR for
 1.0
tonight.
   
   
Sincerely,
   
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
   
   
On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza 
 sandy.r...@cloudera.com
   wrote:
   
It just hit me why this problem is showing up on YARN and not on
standalone.
   
The relevant difference between YARN and standalone is that, on
 YARN,
   the
app jar is loaded by the system classloader instead of Spark's
 custom
   URL
classloader.
   
On YARN, the system classloader knows about [the classes in the
 spark
jars,
the classes in the primary app jar].   The custom classloader
 knows
   about
[the classes in secondary app jars] and has the system
 classloader as
   its
parent.
   
A few relevant facts (mostly redundant with what Sean pointed
 out):
* Every class has a classloader that loaded it.
* When an object of class B is instantiated inside of class A, the
classloader used for loading B

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Sandy Ryza
+1


On Tue, May 20, 2014 at 5:26 PM, Andrew Or and...@databricks.com wrote:

 +1


 2014-05-20 13:13 GMT-07:00 Tathagata Das tathagata.das1...@gmail.com:

  Please vote on releasing the following candidate as Apache Spark version
  1.0.0!
 
  This has a few bug fixes on top of rc9:
  SPARK-1875: https://github.com/apache/spark/pull/824
  SPARK-1876: https://github.com/apache/spark/pull/819
  SPARK-1878: https://github.com/apache/spark/pull/822
  SPARK-1879: https://github.com/apache/spark/pull/823
 
  The tag to be voted on is v1.0.0-rc10 (commit d8070234):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d807023479ce10aec28ef3c1ab646ddefc2e663c
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~tdas/spark-1.0.0-rc10/
 
  The release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/tdas.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1018/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
 
  The full list of changes in this release can be found at:
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=d807023479ce10aec28ef3c1ab646ddefc2e663c
 
  Please vote on releasing this package as Apache Spark 1.0.0!
 
  The vote is open until Friday, May 23, at 20:00 UTC and passes if
  amajority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == API Changes ==
  We welcome users to compile Spark applications against 1.0. There are
  a few API changes in this release. Here are links to the associated
  upgrade guides - user facing changes have been kept as small as
  possible.
 
  Changes to ML vector specification:
 
 
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/mllib-guide.html#from-09-to-10
 
  Changes to the Java API:
 
 
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
  Changes to the streaming API:
 
 
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
 
  Changes to the GraphX API:
 
 
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
 
  Other changes:
  coGroup and related functions now return Iterable[T] instead of Seq[T]
  == Call toSeq on the result to restore the old behavior
 
  SparkContext.jarOfClass returns Option[String] instead of Seq[String]
  == Call toSeq on the result to restore old behavior
 



Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Sandy Ryza
 or Tomcat where your app is
   loaded in a child ClassLoader, and you reference a class that Hadoop
   or Tomcat also has (like a lib class) you will get the container's
   version!)
  
   When you load an external JAR it has a separate ClassLoader which
 does
   not necessarily bear any relation to the one containing your app
   classes, so yeah it is not generally going to make new Foo work.
  
   Reflection lets you pick the ClassLoader, yes.
  
   I would not call setContextClassLoader.
  
   On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza 
 sandy.r...@cloudera.com
   wrote:
I spoke with DB offline about this a little while ago and he
 confirmed
   that
he was able to access the jar from the driver.
   
The issue appears to be a general Java issue: you can't directly
instantiate a class from a dynamically loaded jar.
   
I reproduced it locally outside of Spark with:
---
URLClassLoader urlClassLoader = new URLClassLoader(new URL[] {
 new
File(myotherjar.jar).toURI().toURL() }, null);
Thread.currentThread().setContextClassLoader(urlClassLoader);
MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
---
   
I was able to load the class with reflection.
  
 



Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Sandy Ryza
Hey Xiangrui,

If the jars are placed in the distributed cache and loaded statically, as
the primary app jar is in YARN, then it shouldn't be an issue.  Other jars,
however, including additional jars that are sc.addJar'd and jars specified
with the spark-submit --jars argument, are loaded dynamically by executors
with a URLClassLoader.  These jars aren't next to the executors when they
start - the executors fetch them from the driver's HTTP server.


On Sun, May 18, 2014 at 4:05 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Sandy,

 It is hard to imagine that a user needs to create an object in that
 way. Since the jars are already in distributed cache before the
 executor starts, is there any reason we cannot add the locally cached
 jars to classpath directly?

 Best,
 Xiangrui

 On Sun, May 18, 2014 at 4:00 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
  I spoke with DB offline about this a little while ago and he confirmed
 that
  he was able to access the jar from the driver.
 
  The issue appears to be a general Java issue: you can't directly
  instantiate a class from a dynamically loaded jar.
 
  I reproduced it locally outside of Spark with:
  ---
  URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
  File(myotherjar.jar).toURI().toURL() }, null);
  Thread.currentThread().setContextClassLoader(urlClassLoader);
  MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
  ---
 
  I was able to load the class with reflection.
 
 
 
  On Sun, May 18, 2014 at 11:58 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  @db - it's possible that you aren't including the jar in the classpath
  of your driver program (I think this is what mridul was suggesting).
  It would be helpful to see the stack trace of the CNFE.
 
  - Patrick
 
  On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell pwend...@gmail.com
  wrote:
   @xiangrui - we don't expect these to be present on the system
   classpath, because they get dynamically added by Spark (e.g. your
   application can call sc.addJar well after the JVM's have started).
  
   @db - I'm pretty surprised to see that behavior. It's definitely not
   intended that users need reflection to instantiate their classes -
   something odd is going on in your case. If you could create an
   isolated example and post it to the JIRA, that would be great.
  
   On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng men...@gmail.com
 wrote:
   I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
  
   DB, could you add more info to that JIRA? Thanks!
  
   -Xiangrui
  
   On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng men...@gmail.com
  wrote:
   Btw, I tried
  
   rdd.map { i =
 System.getProperty(java.class.path)
   }.collect()
  
   but didn't see the jars added via --jars on the executor
 classpath.
  
   -Xiangrui
  
   On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng men...@gmail.com
  wrote:
   I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
   reflection approach mentioned by DB didn't work either. I checked
 the
   distributed cache on a worker node and found the jar there. It is
 also
   in the Environment tab of the WebUI. The workaround is making an
   assembly jar.
  
   DB, could you create a JIRA and describe what you have found so
 far?
  Thanks!
  
   Best,
   Xiangrui
  
   On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan 
  mri...@gmail.com wrote:
   Can you try moving your mapPartitions to another class/object
 which
  is
   referenced only after sc.addJar ?
  
   I would suspect CNFEx is coming while loading the class containing
   mapPartitions before addJars is executed.
  
   In general though, dynamic loading of classes means you use
  reflection to
   instantiate it since expectation is you don't know which
  implementation
   provides the interface ... If you statically know it apriori, you
  bundle it
   in your classpath.
  
   Regards
   Mridul
   On 17-May-2014 7:28 am, DB Tsai dbt...@stanford.edu wrote:
  
   Finally find a way out of the ClassLoader maze! It took me some
  times to
   understand how it works; I think it worths to document it in a
  separated
   thread.
  
   We're trying to add external utility.jar which contains
  CSVRecordParser,
   and we added the jar to executors through sc.addJar APIs.
  
   If the instance of CSVRecordParser is created without
 reflection, it
   raises *ClassNotFound
   Exception*.
  
   data.mapPartitions(lines = {
   val csvParser = new CSVRecordParser((delimiter.charAt(0))
   lines.foreach(line = {
 val lineElems = csvParser.parseLine(line)
   })
   ...
   ...
)
  
  
   If the instance of CSVRecordParser is created through reflection,
  it works.
  
   data.mapPartitions(lines = {
   val loader = Thread.currentThread.getContextClassLoader
   val CSVRecordParser =
   loader.loadClass(com.alpine.hadoop.ext.CSVRecordParser)
  
   val csvParser =
 CSVRecordParser.getConstructor(Character.TYPE

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-15 Thread Sandy Ryza
+1 (non-binding)

* Built the release from source.
* Compiled Java and Scala apps that interact with HDFS against it.
* Ran them in local mode.
* Ran them against a pseudo-distributed YARN cluster in both yarn-client
mode and yarn-cluster mode.


On Tue, May 13, 2014 at 9:09 PM, witgo wi...@qq.com wrote:

 You need to set:
 spark.akka.frameSize 5
 spark.default.parallelism1





 -- Original --
 From:  Madhu;ma...@madhu.com;
 Date:  Wed, May 14, 2014 09:15 AM
 To:  devd...@spark.incubator.apache.org;

 Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)



 I just built rc5 on Windows 7 and tried to reproduce the problem described
 in

 https://issues.apache.org/jira/browse/SPARK-1712

 It works on my machine:

 14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at console:17) finished
 in 4.548 s
 14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
 have all completed, from pool
 14/05/13 21:06:47 INFO SparkContext: Job finished: sum at console:17,
 took
 4.814991993 s
 res1: Double = 5.05E11

 I used all defaults, no config files were changed.
 Not sure if that makes a difference...



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 .


Re: Apache Spark running out of the spark shell

2014-05-03 Thread Sandy Ryza
Hi AJ,

You might find this helpful -
http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/

-Sandy


On Sat, May 3, 2014 at 8:42 AM, Ajay Nair prodig...@gmail.com wrote:

 Hi,

 I have written a code that works just about fine in the spark shell on EC2.
 The ec2 script helped me configure my master and worker nodes. Now I want
 to
 run the scala-spark code out side the interactive shell. How do I go about
 doing it.

 I was referring to the instructions mentioned here:
 https://spark.apache.org/docs/0.9.1/quick-start.html

 But this is confusing because it mentions about a simple project jar file
 which I am not sure how to generate. I only have the file that runs
 directly
 on my spark shell. Any easy intruction to get this quickly running as a
 job?

 Thanks
 AJ



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-running-out-of-the-spark-shell-tp6459.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.



Re: Any plans for new clustering algorithms?

2014-04-22 Thread Sandy Ryza
Thanks Matei.  I added a section How to contribute page.


On Mon, Apr 21, 2014 at 7:25 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 The wiki is actually maintained separately in
 https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We
 restricted editing of the wiki because bots would automatically add stuff.
 I've given you permissions now.

 Matei

 On Apr 21, 2014, at 6:22 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

  I thought those are files of spark.apache.org?
 
  --
  Nan Zhu
 
 
  On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:
 
  The markdown files are under spark/docs. You can submit a PR for
  changes. -Xiangrui
 
  On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza 
  sandy.r...@cloudera.com(mailto:
 sandy.r...@cloudera.com) wrote:
  How do I get permissions to edit the wiki?
 
 
  On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com(mailto:
 men...@gmail.com) wrote:
 
  Cannot agree more with your words. Could you add one section about
  how and what to contribute to MLlib's guide? -Xiangrui
 
  On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
  nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com) wrote:
  I'd say a section in the how to contribute page would be a good
 place
 
  to put this.
 
  In general I'd say that the criteria for inclusion of an algorithm
 is it
  should be high quality, widely known, used and accepted (citations and
  concrete use cases as examples of this), scalable and parallelizable,
 well
  documented and with reasonable expectation of dev support
 
  Sent from my iPhone
 
  On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com(mailto:
 sandy.r...@cloudera.com) wrote:
 
  If it's not done already, would it make sense to codify this
 philosophy
  somewhere? I imagine this won't be the first time this discussion
 comes
  up, and it would be nice to have a doc to point to. I'd be happy to
 
 
 
 
  take a
  stab at this.
 
 
  On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng 
  men...@gmail.com(mailto:
 men...@gmail.com)
  wrote:
 
  +1 on Sean's comment. MLlib covers the basic algorithms but we
  definitely need to spend more time on how to make the design
 scalable.
  For example, think about current ProblemWithAlgorithm naming
 scheme.
  That being said, new algorithms are welcomed. I wish they are
  well-established and well-understood by users. They shouldn't be
  research algorithms tuned to work well with a particular dataset
 but
  not tested widely. You see the change log from Mahout:
 
  ===
  The following algorithms that were marked deprecated in 0.8 have
 been
  removed in 0.9:
 
  From Clustering:
  Switched LDA implementation from using Gibbs Sampling to Collapsed
  Variational Bayes (CVB)
  Meanshift
  MinHash - removed due to poor performance, lack of support and
 lack of
  usage
 
  From Classification (both are sequential implementations)
  Winnow - lack of actual usage and support
  Perceptron - lack of actual usage and support
 
  Collaborative Filtering
  SlopeOne implementations in
  org.apache.mahout.cf.taste.hadoop.slopeone and
  org.apache.mahout.cf.taste.impl.recommender.slopeone
  Distributed pseudo recommender in
  org.apache.mahout.cf.taste.hadoop.pseudo
  TreeClusteringRecommender in
  org.apache.mahout.cf.taste.impl.recommender
 
  Mahout Math
  Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
  ===
 
  In MLlib, we should include the algorithms users know how to use
 and
  we can provide support rather than letting algorithms come and go.
 
  My $0.02,
  Xiangrui
 
  On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen 
  so...@cloudera.com(mailto:
 so...@cloudera.com)
  wrote:
  On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown 
  p...@mult.ifario.us(mailto:
 p...@mult.ifario.us)
 
 
 
 
 
  wrote:
  - MLlib as Mahout.next would be a unfortunate. There are some
 gems
 
 
 
 
 
  in
  Mahout, but there are also lots of rocks. Setting a minimal bar
 of
  working, correctly implemented, and documented requires a
 surprising
 
 
 
  amount
  of work.
 
 
  As someone with first-hand knowledge, this is correct. To Sang's
  question, I can't see value in 'porting' Mahout since it is based
 on a
  quite different paradigm. About the only part that translates is
 the
  algorithm concept itself.
 
  This is also the cautionary tale. The contents of the project have
  ended up being a number of drive-by contributions of
 implementations
  that, while individually perhaps brilliant (perhaps), didn't
  necessarily match any other implementation in structure,
 input/output,
  libraries used. The implementations were often a touch academic.
 The
  result was hard to document, maintain, evolve or use.
 
  Far more of the structure of the MLlib implementations are
 consistent
  by virtue of being built around Spark core already. That's great.
 
  One can't wait to completely build the foundation before building
 any
  implementations. To me, the existing implementations are almost
  exactly the basics I

Re: all values for a key must fit in memory

2014-04-21 Thread Sandy Ryza
Thanks Matei and Mridul - was basically wondering whether we would be able
to change the shuffle to accommodate this after 1.0, and from your answers
it sounds like we can.


On Mon, Apr 21, 2014 at 12:31 AM, Mridul Muralidharan mri...@gmail.comwrote:

 As Matei mentioned, the Values is now an Iterable : which can be disk
 backed.
 Does that not address the concern ?

 @Patrick - we do have cases where the length of the sequence is large
 and size per value is also non trivial : so we do need this :-)
 Note that join is a trivial example where this is required (in our
 current implementation).

 Regards,
 Mridul

 On Mon, Apr 21, 2014 at 6:25 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
  The issue isn't that the Iterator[P] can't be disk-backed.  It's that,
 with
  a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read
  into memory at once.  The ShuffledRDD is agnostic to what goes inside P.
 
  On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan mri...@gmail.com
 wrote:
 
  An iterator does not imply data has to be memory resident.
  Think merge sort output as an iterator (disk backed).
 
  Tom is actually planning to work on something similar with me on this
  hopefully this or next month.
 
  Regards,
  Mridul
 
 
  On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza sandy.r...@cloudera.com
  wrote:
   Hey all,
  
   After a shuffle / groupByKey, Hadoop MapReduce allows the values for a
  key
   to not all fit in memory.  The current ShuffleFetcher.fetch API, which
   doesn't distinguish between keys and values, only returning an
  Iterator[P],
   seems incompatible with this.
  
   Any thoughts on how we could achieve parity here?
  
   -Sandy
 



Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
How do I get permissions to edit the wiki?


On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote:

 Cannot agree more with your words. Could you add one section about
 how and what to contribute to MLlib's guide? -Xiangrui

 On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  I'd say a section in the how to contribute page would be a good place
 to put this.
 
  In general I'd say that the criteria for inclusion of an algorithm is it
 should be high quality, widely known, used and accepted (citations and
 concrete use cases as examples of this), scalable and parallelizable, well
 documented and with reasonable expectation of dev support
 
  Sent from my iPhone
 
  On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote:
 
  If it's not done already, would it make sense to codify this philosophy
  somewhere?  I imagine this won't be the first time this discussion comes
  up, and it would be nice to have a doc to point to.  I'd be happy to
 take a
  stab at this.
 
 
  On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com
 wrote:
 
  +1 on Sean's comment. MLlib covers the basic algorithms but we
  definitely need to spend more time on how to make the design scalable.
  For example, think about current ProblemWithAlgorithm naming scheme.
  That being said, new algorithms are welcomed. I wish they are
  well-established and well-understood by users. They shouldn't be
  research algorithms tuned to work well with a particular dataset but
  not tested widely. You see the change log from Mahout:
 
  ===
  The following algorithms that were marked deprecated in 0.8 have been
  removed in 0.9:
 
  From Clustering:
   Switched LDA implementation from using Gibbs Sampling to Collapsed
  Variational Bayes (CVB)
  Meanshift
  MinHash - removed due to poor performance, lack of support and lack of
  usage
 
  From Classification (both are sequential implementations)
  Winnow - lack of actual usage and support
  Perceptron - lack of actual usage and support
 
  Collaborative Filtering
 SlopeOne implementations in
  org.apache.mahout.cf.taste.hadoop.slopeone and
  org.apache.mahout.cf.taste.impl.recommender.slopeone
 Distributed pseudo recommender in
  org.apache.mahout.cf.taste.hadoop.pseudo
 TreeClusteringRecommender in
  org.apache.mahout.cf.taste.impl.recommender
 
  Mahout Math
 Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
  ===
 
  In MLlib, we should include the algorithms users know how to use and
  we can provide support rather than letting algorithms come and go.
 
  My $0.02,
  Xiangrui
 
  On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com
 wrote:
  On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us
 wrote:
  - MLlib as Mahout.next would be a unfortunate.  There are some gems
 in
  Mahout, but there are also lots of rocks.  Setting a minimal bar of
  working, correctly implemented, and documented requires a surprising
  amount
  of work.
 
  As someone with first-hand knowledge, this is correct. To Sang's
  question, I can't see value in 'porting' Mahout since it is based on a
  quite different paradigm. About the only part that translates is the
  algorithm concept itself.
 
  This is also the cautionary tale. The contents of the project have
  ended up being a number of drive-by contributions of implementations
  that, while individually perhaps brilliant (perhaps), didn't
  necessarily match any other implementation in structure, input/output,
  libraries used. The implementations were often a touch academic. The
  result was hard to document, maintain, evolve or use.
 
  Far more of the structure of the MLlib implementations are consistent
  by virtue of being built around Spark core already. That's great.
 
  One can't wait to completely build the foundation before building any
  implementations. To me, the existing implementations are almost
  exactly the basics I would choose. They cover the bases and will
  exercise the abstractions and structure. So that's also great IMHO.
 



Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
I thought this might be a good thing to add to the wiki's How to
contribute 
pagehttps://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark,
as it's not tied to a release.


On Mon, Apr 21, 2014 at 6:09 PM, Xiangrui Meng men...@gmail.com wrote:

 The markdown files are under spark/docs. You can submit a PR for
 changes. -Xiangrui

 On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
  How do I get permissions to edit the wiki?
 
 
  On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Cannot agree more with your words. Could you add one section about
  how and what to contribute to MLlib's guide? -Xiangrui
 
  On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
  nick.pentre...@gmail.com wrote:
   I'd say a section in the how to contribute page would be a good
 place
  to put this.
  
   In general I'd say that the criteria for inclusion of an algorithm is
 it
  should be high quality, widely known, used and accepted (citations and
  concrete use cases as examples of this), scalable and parallelizable,
 well
  documented and with reasonable expectation of dev support
  
   Sent from my iPhone
  
   On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com
 wrote:
  
   If it's not done already, would it make sense to codify this
 philosophy
   somewhere?  I imagine this won't be the first time this discussion
 comes
   up, and it would be nice to have a doc to point to.  I'd be happy to
  take a
   stab at this.
  
  
   On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com
  wrote:
  
   +1 on Sean's comment. MLlib covers the basic algorithms but we
   definitely need to spend more time on how to make the design
 scalable.
   For example, think about current ProblemWithAlgorithm naming
 scheme.
   That being said, new algorithms are welcomed. I wish they are
   well-established and well-understood by users. They shouldn't be
   research algorithms tuned to work well with a particular dataset but
   not tested widely. You see the change log from Mahout:
  
   ===
   The following algorithms that were marked deprecated in 0.8 have
 been
   removed in 0.9:
  
   From Clustering:
Switched LDA implementation from using Gibbs Sampling to Collapsed
   Variational Bayes (CVB)
   Meanshift
   MinHash - removed due to poor performance, lack of support and lack
 of
   usage
  
   From Classification (both are sequential implementations)
   Winnow - lack of actual usage and support
   Perceptron - lack of actual usage and support
  
   Collaborative Filtering
  SlopeOne implementations in
   org.apache.mahout.cf.taste.hadoop.slopeone and
   org.apache.mahout.cf.taste.impl.recommender.slopeone
  Distributed pseudo recommender in
   org.apache.mahout.cf.taste.hadoop.pseudo
  TreeClusteringRecommender in
   org.apache.mahout.cf.taste.impl.recommender
  
   Mahout Math
  Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
   ===
  
   In MLlib, we should include the algorithms users know how to use and
   we can provide support rather than letting algorithms come and go.
  
   My $0.02,
   Xiangrui
  
   On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com
  wrote:
   On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us
  wrote:
   - MLlib as Mahout.next would be a unfortunate.  There are some
 gems
  in
   Mahout, but there are also lots of rocks.  Setting a minimal bar
 of
   working, correctly implemented, and documented requires a
 surprising
   amount
   of work.
  
   As someone with first-hand knowledge, this is correct. To Sang's
   question, I can't see value in 'porting' Mahout since it is based
 on a
   quite different paradigm. About the only part that translates is
 the
   algorithm concept itself.
  
   This is also the cautionary tale. The contents of the project have
   ended up being a number of drive-by contributions of
 implementations
   that, while individually perhaps brilliant (perhaps), didn't
   necessarily match any other implementation in structure,
 input/output,
   libraries used. The implementations were often a touch academic.
 The
   result was hard to document, maintain, evolve or use.
  
   Far more of the structure of the MLlib implementations are
 consistent
   by virtue of being built around Spark core already. That's great.
  
   One can't wait to completely build the foundation before building
 any
   implementations. To me, the existing implementations are almost
   exactly the basics I would choose. They cover the bases and will
   exercise the abstractions and structure. So that's also great IMHO.
  
 



Re: all values for a key must fit in memory

2014-04-20 Thread Sandy Ryza
The issue isn't that the Iterator[P] can't be disk-backed.  It's that, with
a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read
into memory at once.  The ShuffledRDD is agnostic to what goes inside P.

On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan mri...@gmail.comwrote:

 An iterator does not imply data has to be memory resident.
 Think merge sort output as an iterator (disk backed).

 Tom is actually planning to work on something similar with me on this
 hopefully this or next month.

 Regards,
 Mridul


 On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
  Hey all,
 
  After a shuffle / groupByKey, Hadoop MapReduce allows the values for a
 key
  to not all fit in memory.  The current ShuffleFetcher.fetch API, which
  doesn't distinguish between keys and values, only returning an
 Iterator[P],
  seems incompatible with this.
 
  Any thoughts on how we could achieve parity here?
 
  -Sandy



Re: Suggestion

2014-04-11 Thread Sandy Ryza
Hi Priya,

Here's a good place to start:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

-Sandy


On Fri, Apr 11, 2014 at 12:05 PM, priya arora arora.priya4...@gmail.comwrote:

 Hi,

 May I know how one can contribute in this project
 http://spark.apache.org/mllib/ or in any other project. I am very eager to
 contribute. Do let me know.

 Thanks  Regards,
 Priya Arora



Re: cloudera repo down again - mqtt

2014-03-14 Thread sandy . ryza
Our guys are looking into it. I'll post when things are back up.

-Sandy

 On Mar 14, 2014, at 7:37 AM, Tom Graves tgraves...@yahoo.com wrote:
 
 It appears the cloudera repo for the mqtt stuff is down again. 
 
 Did someone  ping them the last time?  
 
 Can we pick this up from some other repo?
 
 [ERROR] Failed to execute goal 
 org.apache.maven.plugins:maven-remote-resources-plugin:1.4:process (default) 
 on project spark-examples_2.10: Error resolving project artifact: Could not 
 transfer artifact org.eclipse.paho:mqtt-client:pom:0.4.0 from/to 
 cloudera-repo (https://repository.cloudera.com/artifactory/cloudera-repos): 
 peer not authenticated for project org.eclipse.paho:mqtt-client:jar:0.4.0 
 
 Tom


Re: Assigning JIRA's to self

2014-03-12 Thread Sandy Ryza
In the mean time, you don't need to wait for the task to be assigned to you
to start work.  If you're worried about someone else picking it up, you can
drop a short comment on the JIRA saying that you're working on it.


On Wed, Mar 12, 2014 at 3:25 PM, Konstantin Boudnik c...@apache.org wrote:

 Someone with proper karma needs to add you to the contributor list in the
 JIRA.

 Cos

 On Wed, Mar 12, 2014 at 02:19PM, Sujeet Varakhedi wrote:
  Hi,
  I am new to Spark and would like to contribute. I wanted to assign a task
  to myself but looks like I do not have permission. What is the process
 if I
  want to work on a JIRA?
 
  Thanks
  Sujeet



Re: YARN Maven build questions

2014-03-04 Thread Sandy Ryza
Hi Lars,

Unfortunately, due to some incompatible changes we pulled in to be closer
to YARN trunk, Spark-on-YARN does not work against CDH 4.4+ (but does work
against CDH5)

-Sandy


On Tue, Mar 4, 2014 at 6:33 AM, Tom Graves tgraves...@yahoo.com wrote:

 What is your question about Any hints?
 The maven build worked for me yesterday again fine.

 You should create a jira for any pull request like the documentation
 states.  The jira thing is new so I think people are still getting used to
 it.

 Tom



 On Tuesday, March 4, 2014 2:51 AM, Lars Francke lars.fran...@gmail.com
 wrote:

 Hi,

 sorry to bother again.

 As a newbie to the project it's hard to judge whether I'm doing
 anything wrong, the documentation is outdated or the Maven/SBT files
 have diverged from the actual code by defining older/now incompatible
 versions or something else going wrong.

 Any hints?

 Also an unrelated note/question: I see tons of pull requests being
 accepted without a JIRA but the documentation says to create a JIRA
 issue first[1]. So I assume it's okay to just send pull requests?

 Thanks for your help.

 Cheers,
 Lars

 [1] 
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark


 On Fri, Feb 28, 2014 at 6:41 PM, Lars Francke lars.fran...@gmail.com
 wrote:
  Hey,
 
  so currently it doesn't work because of
  https://github.com/apache/spark/pull/6#issuecomment-36343187
 
  IntelliJ reports a lot of warnings with default settings and I haven't
  found a way to tell IntellJ to use different Hadoop versions yet.
  mvn clean compile -Pyarn fails as well (compilation errror
 
  Your command works indeed. Default yarn version is 0.23.7 which
  doesn't seem to work with the default 2.2.0 Hadoop version (anymore?)
 
  I was basically trying to follow the documentation:
  http://spark.incubator.apache.org/docs/latest/building-with-maven.html
 
  mvn clean compile -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.5.0
  -Dyarn.version=2.0.0-cdh4.5.0 fails as well as does mvn clean compile
  -Pyarn-alpha
 
  Thanks for showing me a configuration that works. Unfortunately the
  default ones and at least one of the documented ones fail.
 
  Cheers,
  Lars
 
 
  On Fri, Feb 28, 2014 at 3:05 PM, Tom Graves tgraves...@yahoo.com
 wrote:
  what build command are you using?What do you mean when you say YARN
 branch?
 
  The yarn builds have been working fine for me with maven.   Build
 command I use against hadoop 2.2 or higher: mvn -Dyarn.version=2.2.0
 -Dhadoop.version=2.2.0 -Pyarn clean package -DskipTests
 
  Tom
 
 
 
  On Friday, February 28, 2014 6:14 AM, Lars Francke 
 lars.fran...@gmail.com wrote:
 
  Hey,
 
  I'm trying to dig into Spark's code but am running into a couple of
 problems.
 
  1) The yarn-common directory is not included in the Maven build
  causing things to fail because the dependency is missing. If I see the
  history correct it used to be a Maven module but is not anymore.
 
  2) When I try to include the yarn-common directory in the build things
  start going bad. Compilation failures all over the place and I think
  there are some dependency issues in there as well.
 
  This leads me to believe that either the Maven build system isn't
  maintained for YARN or the whole YARN branch isn't. What's the status
  here?
 
  Without YARN things build fine for me using Maven.
 
  Thanks for your help.
 
  Cheers,
  Lars



Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-26 Thread Sandy Ryza
@patrick - It seems like my point about being able to inherit the root pom
was addressed and there's a way to handle this.

The larger point I meant to make is that Maven is by far the most common
build tool in projects that are likely to share contributors with Spark.  I
personally know 10 people I can go to if I run into a Maven problem, and
maybe 1 if I run into an SBT problem.  Google is also much more likely to
be able to answer a question about Maven than SBT.  I don't know of any
particular functionality Maven has that SBT is lacking, but if it was up to
me I'd still lean towards Maven on the grounds that it's way easier to find
familiarity and expertise with.



On Wed, Feb 26, 2014 at 9:42 AM, Patrick Wendell pwend...@gmail.com wrote:

 @mridul - As far as I know both Maven and Sbt use fairly similar
 processes for building the assembly/uber jar. We actually used to
 package spark with sbt and there were no specific issues we
 encountered and AFAIK sbt respects versioning of transitive
 dependencies correctly. Do you have a specific bug listing for sbt
 that indicates something is broken?

 @sandy - It sounds like you are saying that the CDH build would be
 easier with Maven because you can inherit the POM. However, is this
 just a matter of convenience for packagers or would standardizing on
 sbt limit capabilities in some way? I assume that it would just mean a
 bit more manual work for packagers having to figure out how to set the
 hadoop version in SBT and exclude certain dependencies. For instance,
 what does CDH about other components like Impala that are not based on
 Maven at all?

 On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan e...@ooyala.com wrote:
  I'd like to propose the following way to move forward, based on the
  comments I've seen:
 
  1.  Aggressively clean up the giant dependency graph.   One ticket I
  might work on if I have time is SPARK-681 which might remove the giant
  fastutil dependency (~15MB by itself).
 
  2.  Take an intermediate step by having only ONE source of truth
  w.r.t. dependencies and versions.  This means either:
 a)  Using a maven POM as the spec for dependencies, Hadoop version,
  etc.   Then, use sbt-pom-reader to import it.
 b)  Using the build.scala as the spec, and sbt make-pom to
  generate the pom.xml for the dependencies
 
  The idea is to remove the pain and errors associated with manual
  translation of dependency specs from one system to another, while
  still maintaining the things which are hard to translate (plugins).
 
 
  On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers ko...@tresata.com
 wrote:
  We maintain in house spark build using sbt. We have no problem using sbt
  assembly. We did add a few exclude statements for transitive
 dependencies.
 
  The main enemy of assemblies are jars that include stuff they shouldn't
  (kryo comes to mind, I think they include logback?), new versions of
 jars
  that change the provider/artifact without changing the package (asm),
 and
  incompatible new releases (protobuf). These break the transitive
 resolution
  process. I imagine that's true for any build tool.
 
  Besides shading I don't see anything maven can do sbt cannot, and if I
  understand it correctly shading is not done currently using the build
 tool.
 
  Since spark is primarily scala/akka based the main developer base will
 be
  familiar with sbt (I think?). Switching build tool is always painful. I
  personally think it is smarter to put this burden on a limited number of
  upstream integrators than on the community. However that said I don't
 think
  its a problem for us to maintain an sbt build in-house if spark
 switched to
  maven.
  The problem is, the complete spark dependency graph is fairly large,
  and there are lot of conflicting versions in there.
  In particular, when we bump versions of dependencies - making managing
  this messy at best.
 
  Now, I have not looked in detail at how maven manages this - it might
  just be accidental that we get a decent out-of-the-box assembled
  shaded jar (since we dont do anything great to configure it).
  With current state of sbt in spark, it definitely is not a good
  solution : if we can enhance it (or it already is ?), while keeping
  the management of the version/dependency graph manageable, I dont have
  any objections to using sbt or maven !
  Too many exclude versions, pinned versions, etc would just make things
  unmanageable in future.
 
 
  Regards,
  Mridul
 
 
 
 
  On Wed, Feb 26, 2014 at 8:56 AM, Evan chan e...@ooyala.com wrote:
  Actually you can control exactly how sbt assembly merges or resolves
  conflicts.  I believe the default settings however lead to order which
  cannot be controlled.
 
  I do wish for a smarter fat jar plugin.
 
  -Evan
  To be free is not merely to cast off one's chains, but to live in a way
  that respects  enhances the freedom of others. (#NelsonMandela)
 
  On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan mri...@gmail.com
  wrote: