from:"Joseph Bradley"

Re: [VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-11 Thread Joseph Bradley

+1  This will be a great long-term investment for Spark.

On Wed, Feb 6, 2019 at 8:44 AM Marco Gaido  wrote:

> +1 from me as well.
>
> Il giorno mer 6 feb 2019 alle ore 16:58 Yanbo Liang 
> ha scritto:
>
>> +1 for the proposal
>>
>>
>>
>> On Thu, Jan 31, 2019 at 12:46 PM Mingjie Tang  wrote:
>>
>>> +1, this is a very very important feature.
>>>
>>> Mingjie
>>>
>>> On Thu, Jan 31, 2019 at 12:42 AM Xiao Li  wrote:
>>>
>>>> Change my vote from +1 to ++1
>>>>
>>>> Xiangrui Meng  于2019年1月30日周三 上午6:20写道：
>>>>
>>>>> Correction: +0 vote doesn't mean "Don't really care". Thanks Ryan for
>>>>> the offline reminder! Below is the Apache official interpretation
>>>>> <https://www.apache.org/foundation/voting.html#expressing-votes-1-0-1-and-fractions>
>>>>> of fraction values:
>>>>>
>>>>> The in-between values are indicative of how strongly the voting
>>>>> individual feels. Here are some examples of fractional votes and ways in
>>>>> which they might be intended and interpreted:
>>>>> +0: 'I don't feel strongly about it, but I'm okay with this.'
>>>>> -0: 'I won't get in the way, but I'd rather we didn't do this.'
>>>>> -0.5: 'I don't like this idea, but I can't find any rational
>>>>> justification for my feelings.'
>>>>> ++1: 'Wow! I like this! Let's do it!'
>>>>> -0.9: 'I really don't like this, but I'm not going to stand in the way
>>>>> if everyone else wants to go ahead with it.'
>>>>> +0.9: 'This is a cool idea and i like it, but I don't have time/the
>>>>> skills necessary to help out.'
>>>>>
>>>>>
>>>>> On Wed, Jan 30, 2019 at 12:31 AM Martin Junghanns
>>>>>  wrote:
>>>>>
>>>>>> Hi Dongjoon,
>>>>>>
>>>>>> Thanks for the hint! I updated the SPIP accordingly.
>>>>>>
>>>>>> I also changed the access permissions for the SPIP and design sketch
>>>>>> docs so that anyone can comment.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Martin
>>>>>> On 29.01.19 18:59, Dongjoon Hyun wrote:
>>>>>>
>>>>>> Hi, Xiangrui Meng.
>>>>>>
>>>>>> +1 for the proposal.
>>>>>>
>>>>>> However, please update the following section for this vote. As we
>>>>>> see, it seems to be inaccurate because today is Jan. 29th. (Almost
>>>>>> February).
>>>>>> (Since I cannot comment on the SPIP, I replied here.)
>>>>>>
>>>>>> Q7. How long will it take?
>>>>>>
>>>>>>-
>>>>>>
>>>>>>If accepted by the community by the end of December 2018, we
>>>>>>predict to be feature complete by mid-end March, allowing for QA 
>>>>>> during
>>>>>>April 2019, making the SPIP part of the next major Spark release 
>>>>>> (3.0, ETA
>>>>>>May, 2019).
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Tue, Jan 29, 2019 at 8:52 AM Xiao Li  wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> Jules Damji  于2019年1月29日周二 上午8:14写道：
>>>>>>>
>>>>>>>> +1 (non-binding)
>>>>>>>> (Heard their proposed tech-talk at Spark + A.I summit in London.
>>>>>>>> Well attended & well received.)
>>>>>>>>
>>>>>>>> —
>>>>>>>> Sent from my iPhone
>>>>>>>> Pardon the dumb thumb typos :)
>>>>>>>>
>>>>>>>> On Jan 29, 2019, at 7:30 AM, Denny Lee 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> yay - let's do it!
>>>>>>>>
>>>>>>>> On Tue, Jan 29, 2019 at 6:28 AM Xiangrui Meng 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I want to call for a vote of SPARK-25994
>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-25994>. It
>>>>>>>>> introduces a new DataFrame-based component to Spark, which supports
>>>>>>>>> property graph construction, Cypher queries, and graph algorithms. The
>>>>>>>>> proposal
>>>>>>>>> <https://docs.google.com/document/d/1ljqVsAh2wxTZS8XqwDQgRT6i_mania3ffYSYpEgLx9k/edit>
>>>>>>>>> was made available on user@
>>>>>>>>> <https://lists.apache.org/thread.html/269cbffb04a0fbfe2ec298c3e95f01c05b47b5a72838004d27b74169@%3Cuser.spark.apache.org%3E>
>>>>>>>>> and dev@
>>>>>>>>> <https://lists.apache.org/thread.html/c4c9c9d31caa4a9be3dd99444e597b43f7cd2823e456be9f108e8193@%3Cdev.spark.apache.org%3E>
>>>>>>>>>  to
>>>>>>>>> collect input. You can also find a sketch design doc attached to
>>>>>>>>> SPARK-26028 <https://issues.apache.org/jira/browse/SPARK-26028>.
>>>>>>>>>
>>>>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>>>>> vote:
>>>>>>>>>
>>>>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>>>>> +0: Don't really care.
>>>>>>>>> -1: I don't think this is a good idea because of the following
>>>>>>>>> technical reasons.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Xiangrui
>>>>>>>>>
>>>>>>>>

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Joseph Bradley

+1

On Mon, Jun 4, 2018 at 10:16 AM, Mark Hamstra 
wrote:

> +1
>
> On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.3.1.
>>
>> Given that I expect at least a few people to be busy with Spark Summit
>> next
>> week, I'm taking the liberty of setting an extended voting period. The
>> vote
>> will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
>>
>> It passes with a majority of +1 votes, which must include at least 3 +1
>> votes
>> from the PMC.
>>
>> [ ] +1 Release this package as Apache Spark 2.3.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
>> https://github.com/apache/spark/tree/v2.3.1-rc4
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1272/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
>>
>> The list of bug fixes going into 2.3.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.1?
>> ===
>>
>> The current list of open tickets targeted at 2.3.1 can be found at:
>> https://s.apache.org/Q3Uo
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Joseph Bradley

+1

On Sun, Jun 3, 2018 at 9:59 AM, Weichen Xu 
wrote:

> +1
>
> On Fri, Jun 1, 2018 at 3:41 PM, Xiao Li  wrote:
>
>> +1
>>
>> 2018-06-01 15:41 GMT-07:00 Xingbo Jiang :
>>
>>> +1
>>>
>>> 2018-06-01 9:21 GMT-07:00 Xiangrui Meng :
>>>
>>>> Hi all,
>>>>
>>>> I want to call for a vote of SPARK-24374
>>>> <https://issues.apache.org/jira/browse/SPARK-24374>. It introduces a
>>>> new execution mode to Spark, which would help both integration with
>>>> external DL/AI frameworks and MLlib algorithm performance. This is one of
>>>> the follow-ups from a previous discussion on dev@
>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Integrating-ML-DL-frameworks-with-Spark-td23913.html>
>>>> .
>>>>
>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>
>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>> +0: Don't really care.
>>>> -1: I don't think this is a good idea because of the following
>>>> technical reasons.
>>>>
>>>> Best,
>>>> Xiangrui
>>>> --
>>>>
>>>> Xiangrui Meng
>>>>
>>>> Software Engineer
>>>>
>>>> Databricks Inc. [image: http://databricks.com] <http://databricks.com/>
>>>>
>>>
>>>
>>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] SPIP ML Pipelines in R

2018-05-31 Thread Joseph Bradley

Hossein might be slow to respond (OOO), but I just commented on the JIRA.
I'd recommend we follow the same process as the SparkR package.

+1 on this from me (and I'll be happy to help shepherd it, though Felix and
Shivaram are the experts in this area).  CRAN presents challenges, but this
is a good step towards making R a first-class citizen for ML use cases of
Spark.

On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Hossein -- Can you clarify what the resolution on the repository /
> release issue discussed on SPIP ?
>
> Shivaram
>
> On Thu, May 31, 2018 at 9:06 AM, Felix Cheung 
> wrote:
> > +1
> > With my concerns in the SPIP discussion.
> >
> > 
> > From: Hossein 
> > Sent: Wednesday, May 30, 2018 2:03:03 PM
> > To: dev@spark.apache.org
> > Subject: [VOTE] SPIP ML Pipelines in R
> >
> > Hi,
> >
> > I started discussion thread for a new R package to expose MLlib
> pipelines in
> > R.
> >
> > To summarize we will work on utilities to generate R wrappers for MLlib
> > pipeline API for a new R package. This will lower the burden for exposing
> > new API in future.
> >
> > Following the SPIP process, I am proposing the SPIP for a vote.
> >
> > +1: Let's go ahead and implement the SPIP.
> > +0: Don't really care.
> > -1: I do not think this is a good idea for the following reasons.
> >
> > Thanks,
> > --Hossein
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Revisiting Online serving of Spark models?

2018-05-21 Thread Joseph Bradley

Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of
Parquet.  It's easier to parse JSON without Spark, and using the same
format simplifies architecture.  Plus, some people want to check files into
version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just
like DataFrame reader/writers) to handle JSON (and maybe, eventually,
handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people
are around at the Spark Summit, that could be a good time to meet up & then
post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Specifically I’d like bring part of the discussion to Model and
> PipelineModel, and various ModelReader and SharedReadWrite implementations
> that rely on SparkContext. This is a big blocker on reusing  trained models
> outside of Spark for online serving.
>
> What’s the next step? Would folks be interested in getting together to
> discuss/get some feedback?
>
>
> _
> From: Felix Cheung <felixcheun...@hotmail.com>
> Sent: Thursday, May 10, 2018 10:10 AM
> Subject: Re: Revisiting Online serving of Spark models?
> To: Holden Karau <hol...@pigscanfly.ca>, Joseph Bradley <
> jos...@databricks.com>
> Cc: dev <dev@spark.apache.org>
>
>
>
> Huge +1 on this!
>
> --
> *From:* holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of
> Holden Karau <hol...@pigscanfly.ca>
> *Sent:* Thursday, May 10, 2018 9:39:26 AM
> *To:* Joseph Bradley
> *Cc:* dev
> *Subject:* Re: Revisiting Online serving of Spark models?
>
>
>
> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>
>> Awesome! I'm glad other folks think something like this belongs in Spark.
>
>> This was one of the original goals for mllib-local: to have local
>> versions of MLlib models which could be deployed without the big Spark JARs
>> and without a SparkContext or SparkSession.  There are related commercial
>> offerings like this : ) but the overhead of maintaining those offerings is
>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>> libraries will be well worth it.
>>
>> We've talked about this need at Databricks and have also been syncing
>> with the creators of MLeap.  It'd be great to get this functionality into
>> Spark itself.  Some thoughts:
>> * It'd be valuable to have this go beyond adding transform() methods
>> taking a Row to the current Models.  Instead, it would be ideal to have
>> local, lightweight versions of models in mllib-local, outside of the main
>> mllib package (for easier deployment with smaller & fewer dependencies).
>> * Supporting Pipelines is important.  For this, it would be ideal to
>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>> moved into a local sql package.
>> * This architecture may require some awkward APIs currently to have model
>> prediction logic in mllib-local, local model classes in mllib-local, and
>> regular (DataFrame-friendly) model classes in mllib.  We might find it
>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>> architecture while making it feasible for 3rd party developers to extend
>> MLlib APIs (especially in Java).
>>
> I agree this could be interesting, and feed into the other discussion
> around when (or if) we should be considering Spark 3.0
> I _think_ we could probably do it with optional traits people could mix in
> to avoid breaking the current APIs but I could be wrong on that point.
>
>> * It could also be worth discussing local DataFrames.  They might not be
>> as important as per-Row transformations, but they would be helpful for
>> batching for higher throughput.
>>
> That could be interesting as well.
>
>>
>> I'll be interested to hear others' thoughts too!
>>
>> Joseph
>>
>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> Hi y'all,
>>>
>>> With the renewed interest in ML in Apache Spark now seems like a good a
>>> time as any to revisit the online serving situation in Spark ML. DB &
>>> other's have done some excellent working moving a lot of the necessary
>>> tools into a local linear algebra package that doesn't depend on having a
>>> SparkContex

Re: Revisiting Online serving of Spark models?

2018-05-10 Thread Joseph Bradley

Thanks for bringing this up Holden!  I'm a strong supporter of this.

This was one of the original goals for mllib-local: to have local versions
of MLlib models which could be deployed without the big Spark JARs and
without a SparkContext or SparkSession.  There are related commercial
offerings like this : ) but the overhead of maintaining those offerings is
pretty high.  Building good APIs within MLlib to avoid copying logic across
libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with
the creators of MLeap.  It'd be great to get this functionality into Spark
itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking
a Row to the current Models.  Instead, it would be ideal to have local,
lightweight versions of models in mllib-local, outside of the main mllib
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to
utilize elements of Spark SQL, particularly Rows and Types, which could be
moved into a local sql package.
* This architecture may require some awkward APIs currently to have model
prediction logic in mllib-local, local model classes in mllib-local, and
regular (DataFrame-friendly) model classes in mllib.  We might find it
helpful to break some DeveloperApis in Spark 3.0 to facilitate this
architecture while making it feasible for 3rd party developers to extend
MLlib APIs (especially in Java).
* It could also be worth discussing local DataFrames.  They might not be as
important as per-Row transformations, but they would be helpful for
batching for higher throughput.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Hi y'all,
>
> With the renewed interest in ML in Apache Spark now seems like a good a
> time as any to revisit the online serving situation in Spark ML. DB &
> other's have done some excellent working moving a lot of the necessary
> tools into a local linear algebra package that doesn't depend on having a
> SparkContext.
>
> There are a few different commercial and non-commercial solutions round
> this, but currently our individual transform/predict methods are private so
> they either need to copy or re-implement (or put them selves in
> org.apache.spark) to access them. How would folks feel about adding a new
> trait for ML pipeline stages to expose to do transformation of single
> element inputs (or local collections) that could be optionally implemented
> by stages which support this? That way we can have less copy and paste code
> possibly getting out of sync with our model training.
>
> I think continuing to have on-line serving grow in different projects is
> probably the right path, forward (folks have different needs), but I'd love
> to see us make it simpler for other projects to build reliable serving
> tools.
>
> I realize this maybe puts some of the folks in an awkward position with
> their own commercial offerings, but hopefully if we make it easier for
> everyone the commercial vendors can benefit as well.
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
>

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

SparkR test failures in PR builder

2018-05-02 Thread Joseph Bradley

Hi all,

Does anyone know why the PR builder keeps failing on SparkR's CRAN checks?
I've seen this in a lot of unrelated PRs.  E.g.:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90065/console

Hossein spotted this line:
```
* checking CRAN incoming feasibility ...Error in
.check_package_CRAN_incoming(pkgdir) :
  dims [product 24] do not match the length of object [0]
```
and suggested that it could be CRAN flakiness.  I'm not familiar with CRAN,
but do others have thoughts about how to fix this?

Thanks!
Joseph

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [build system] jenkins master unreachable, build system currently down

2018-05-01 Thread Joseph Bradley

Thank you Shane!!

On Tue, May 1, 2018 at 8:58 AM, Xiao Li <gatorsm...@gmail.com> wrote:

> Thank you very much, Shane! Yeah, it works now!
>
> Xiao
>
>
> 2018-05-01 8:40 GMT-07:00 shane knapp <skn...@berkeley.edu>:
>
>> and we're back!  there was apparently a firewall migration yesterday that
>> went sideways.
>>
>> shane
>>
>> On Mon, Apr 30, 2018 at 8:27 PM, shane knapp <skn...@berkeley.edu> wrote:
>>
>>> we just noticed that we're unable to connect to jenkins, and have
>>> reached out to our NOC support staff at our colo.  until we hear back,
>>> there's nothing we can do.
>>>
>>> i'll update the list as soon as i hear something.  sorry for the
>>> inconvenience!
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Possible SPIP to improve matrix and vector column type support

2018-04-18 Thread Joseph Bradley

Thanks for the thoughts!  We've gone back and forth quite a bit about local
linear algebra support in Spark.  For reference, there have been some
discussions here:
https://issues.apache.org/jira/browse/SPARK-6442
https://issues.apache.org/jira/browse/SPARK-16365
https://issues.apache.org/jira/browse/SPARK-19653

Overall, I like the idea of improving linear algebra support, especially
given the rise of Python numerical processing & deep learning.  But some
considerations I'd list include:
* There are great linear algebra libraries out there, and it would be ideal
to reuse those as much as possible.
* SQL support for linear algebra can be a separate effort from expanding
linear algebra primitives.
* It would be valuable to discuss external types as UDTs (which can be
hacked with numpy and scipy types now) vs. adding linear algebra types to
native Spark SQL.

On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh <leif.wa...@gmail.com> wrote:

> Hi all,
>
> I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
> I’ve found myself wanting more.
>
> There is a minor issue in that with the arrow serialization enabled, these
> types don’t serialize properly in python UDF calls or in toPandas. There’s
> a natural representation for them in numpy.ndarray, and I’ve started a
> conversation with the arrow community about supporting tensor-valued
> columns, but that might be a ways out. In the meantime, I think we can fix
> this by using the FixedSizeBinary column type in arrow, together with some
> metadata describing the tensor shape (list of dimension sizes).
>
> The larger issue, for which I intend to submit an SPIP soon, is that these
> types could be better supported at the API layer, regardless of
> serialization. In the limit, we could consider the entire numpy ndarray
> surface area as a target. At the minimum, what I’m thinking is that these
> types should support column operations like matrix multiply, transpose,
> inner and outer product, etc., and maybe have a more ergonomic construction
> API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
> VectorAssembler API is kind of clunky.
>
> One possibility here is to restrict the tensor column types such that
> every value must have the same shape, e.g. a 2x2 matrix. This would allow
> for operations to check validity before execution, for example, a matrix
> multiply could check dimension match and fail fast. However, there might be
> use cases for a column to contain variable shape tensors, I’m open to
> discussion here.
>
> What do you all think?
> --
> --
> Cheers,
> Leif
>

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Joseph Bradley

Welcome!

On Mon, Apr 2, 2018 at 11:00 AM, Takuya UESHIN <ues...@happy-camper.st>
wrote:

> Congratulations!
>
> On Mon, Apr 2, 2018 at 10:34 AM, Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Congratulations!
>>
>> Bests,
>> Dongjoon.
>>
>> On Mon, Apr 2, 2018 at 07:57 Cody Koeninger <c...@koeninger.org> wrote:
>>
>>> Congrats!
>>>
>>> On Mon, Apr 2, 2018 at 12:28 AM, Wenchen Fan <cloud0...@gmail.com>
>>> wrote:
>>> > Hi all,
>>> >
>>> > The Spark PMC recently added Zhenhua Wang as a committer on the
>>> project.
>>> > Zhenhua is the major contributor of the CBO project, and has been
>>> > contributing across several areas of Spark for a while, focusing
>>> especially
>>> > on analyzer, optimizer in Spark SQL. Please join me in welcoming
>>> Zhenhua!
>>> >
>>> > Wenchen
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>



-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Spark.ml roadmap 2.3.0 and beyond

2018-03-20 Thread Joseph Bradley

The promised roadmap JIRA: https://issues.apache.org/jira/browse/SPARK-23758

Note it doesn't have much explicitly listed yet, but committers can add
items as they agree to shepherd them.  (Committers, make sure to check what
you're currently listed as shepherding!)  The links for searching can be
useful too.

On Thu, Dec 7, 2017 at 3:55 PM, Stephen Boesch <java...@gmail.com> wrote:

> Thanks Joseph.  We can wait for post 2.3.0.
>
> 2017-12-07 15:36 GMT-08:00 Joseph Bradley <jos...@databricks.com>:
>
>> Hi Stephen,
>>
>> I used to post those roadmap JIRAs to share instructions for contributing
>> to MLlib and to try to coordinate amongst committers.  My feeling was that
>> the coordination aspect was of mixed success, so I did not post one for
>> 2.3.  I'm glad you pinged about this; if those were useful, then I can plan
>> on posting one for the release after 2.3.  As far as identifying
>> committers' plans, the best option right now is to look for Shepherds in
>> JIRA as well as the few mailing list threads about directions.
>>
>> For myself, I'm mainly focusing on fixing some issues with persistence
>> for custom algorithms in PySpark (done), adding the image schema (done),
>> and using ML Pipelines in Structured Streaming (WIP).
>>
>> Joseph
>>
>> On Wed, Nov 29, 2017 at 6:52 AM, Stephen Boesch <java...@gmail.com>
>> wrote:
>>
>>> There are several  JIRA's and/or PR's that contain logic the Data
>>> Science teams that I work with use in their local models. We are trying to
>>> determine if/when these features may gain traction again.  In at least one
>>> case all of the work were done but the shepherd said that getting it
>>> committed were of lower priority than other tasks - one specifically
>>> mentioned was the mllib/ml parity that has been ongoing for nearly three
>>> years.
>>>
>>> In order to prioritize work that the ML platform would do it would be
>>> helpful to know at least which if any of those tasks were going to be moved
>>> ahead by the community: since we could then focus on other ones instead of
>>> duplicating the effort.
>>>
>>> In addition there are some engineering code jam sessions that happen
>>> periodically: knowing which features are actively on the roadmap would 
>>> *certainly
>>> *influence our selection of work.  The roadmaps from 2.2.0 and earlier
>>> were a very good starting point to understand not just the specific work in
>>> progress - but also the current mindset/thinking of the committers in terms
>>> of general priorities.
>>>
>>> So if the same format of document were not available - then what content *is
>>> *that gives a picture of where spark.ml were headed?
>>>
>>> 2017-11-29 6:39 GMT-08:00 Stephen Boesch <java...@gmail.com>:
>>>
>>>> Any further information/ thoughts?
>>>>
>>>>
>>>>
>>>> 2017-11-22 15:07 GMT-08:00 Stephen Boesch <java...@gmail.com>:
>>>>
>>>>> The roadmaps for prior releases e.g. 1.6 2.0 2.1 2.2 were available:
>>>>>
>>>>> 2.2.0 https://issues.apache.org/jira/browse/SPARK-18813
>>>>>
>>>>> 2.1.0 https://issues.apache.org/jira/browse/SPARK-15581
>>>>> ..
>>>>>
>>>>> It seems those roadmaps were not available per se' for 2.3.0 and
>>>>> later? Is there a different mechanism for that info?
>>>>>
>>>>> stephenb
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [MLlib] QuantRegForest

2018-03-09 Thread Joseph Bradley

Hi Hadrien,

That does sound useful, but just to warn you, it can take a while to get
new algorithms into MLlib itself.  You can definitely make a case for that
on the Spark JIRA.  In the meantime, I'd recommend submitting it to Spark
Packages https://spark-packages.org/  which won't require waiting.  These
helper tools are useful for that:
https://github.com/databricks/spark-package-cmd-tool
https://github.com/databricks/sbt-spark-package

Thanks!
Joseph

On Fri, Mar 9, 2018 at 12:34 AM, Hadrien <chicault.hadr...@gmail.com> wrote:

> Hi,
>
> we implemented a QuantRegForest to be used with Spark. We coded it in
> scala.
> I don't know if you could be interrested but we offer to share it with you
> (btw the original implementation is in R and called quantregForest :
> https://cran.r-project.org/web/packages/quantregForest/index.html)
>
> Can't wait to hear from you!
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Welcoming some new committers

2018-03-09 Thread Joseph Bradley

Congratulations!

On Mon, Mar 5, 2018 at 2:19 PM, Seth Hendrickson <
seth.hendrickso...@gmail.com> wrote:

> Thanks all! :D
>
> On Mon, Mar 5, 2018 at 9:01 AM, Bryan Cutler <cutl...@gmail.com> wrote:
>
>> Thanks everyone, this is very exciting!  I'm looking forward to working
>> with you all and helping out more in the future.  Also, congrats to the
>> other committers as well!!
>>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Spark.ml roadmap 2.3.0 and beyond

2017-12-07 Thread Joseph Bradley

Hi Stephen,

I used to post those roadmap JIRAs to share instructions for contributing
to MLlib and to try to coordinate amongst committers.  My feeling was that
the coordination aspect was of mixed success, so I did not post one for
2.3.  I'm glad you pinged about this; if those were useful, then I can plan
on posting one for the release after 2.3.  As far as identifying
committers' plans, the best option right now is to look for Shepherds in
JIRA as well as the few mailing list threads about directions.

For myself, I'm mainly focusing on fixing some issues with persistence for
custom algorithms in PySpark (done), adding the image schema (done), and
using ML Pipelines in Structured Streaming (WIP).

Joseph

On Wed, Nov 29, 2017 at 6:52 AM, Stephen Boesch <java...@gmail.com> wrote:

> There are several  JIRA's and/or PR's that contain logic the Data Science
> teams that I work with use in their local models. We are trying to
> determine if/when these features may gain traction again.  In at least one
> case all of the work were done but the shepherd said that getting it
> committed were of lower priority than other tasks - one specifically
> mentioned was the mllib/ml parity that has been ongoing for nearly three
> years.
>
> In order to prioritize work that the ML platform would do it would be
> helpful to know at least which if any of those tasks were going to be moved
> ahead by the community: since we could then focus on other ones instead of
> duplicating the effort.
>
> In addition there are some engineering code jam sessions that happen
> periodically: knowing which features are actively on the roadmap would 
> *certainly
> *influence our selection of work.  The roadmaps from 2.2.0 and earlier
> were a very good starting point to understand not just the specific work in
> progress - but also the current mindset/thinking of the committers in terms
> of general priorities.
>
> So if the same format of document were not available - then what content *is
> *that gives a picture of where spark.ml were headed?
>
> 2017-11-29 6:39 GMT-08:00 Stephen Boesch <java...@gmail.com>:
>
>> Any further information/ thoughts?
>>
>>
>>
>> 2017-11-22 15:07 GMT-08:00 Stephen Boesch <java...@gmail.com>:
>>
>>> The roadmaps for prior releases e.g. 1.6 2.0 2.1 2.2 were available:
>>>
>>> 2.2.0 https://issues.apache.org/jira/browse/SPARK-18813
>>>
>>> 2.1.0 https://issues.apache.org/jira/browse/SPARK-15581
>>> ..
>>>
>>> It seems those roadmaps were not available per se' for 2.3.0 and later?
>>> Is there a different mechanism for that info?
>>>
>>> stephenb
>>>
>>
>>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [ML] Migrating transformers from mllib to ml

2017-11-07 Thread Joseph Bradley

Hi, we do still want to do this migration; it's just been a bit stalled due
to low bandwidth.  There are still a few feature parity items which need to
be completed, so the deprecation will likely not happen until after 2.3.
Joseph

On Tue, Nov 7, 2017 at 12:38 AM, 颜发才(Yan Facai) <facai@gmail.com> wrote:

> Hi, I have migrated HashingTF from mllib to ml, and wait for review.
>
> see:
> [SPARK-21748][ML] Migrate the implementation of HashingTF from MLlib to ML
> #18998
> https://github.com/apache/spark/pull/18998
>
>
>
> On Mon, Nov 6, 2017 at 10:58 PM, Marco Gaido <marcogaid...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I saw that there are several TODOs to migrate some transformers (like
>> HashingTF and IDF) to use only ml.Vector in order to avoid the overhead of
>> converting them to the mllib ones and back.
>>
>> Is there any reason why this has not been done so far? Is it to avoid
>> code duplication? If so, is it still an issue since we are going to
>> deprecate mllib from 2.3 (at least this is what I read on Spark docs)? If
>> no, I can work on this.
>>
>> Thanks,
>> Marco
>>
>>
>>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-07 Thread Joseph Bradley

+1

On Mon, Nov 6, 2017 at 5:11 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> +1
>
> On Sat, Nov 4, 2017 at 11:02 AM, Xiao Li <gatorsm...@gmail.com> wrote:
>
>> +1
>>
>> 2017-11-04 11:00 GMT-07:00 Burak Yavuz <brk...@gmail.com>:
>>
>>> +1
>>>
>>> On Fri, Nov 3, 2017 at 10:02 PM, vaquar khan <vaquar.k...@gmail.com>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Fri, Nov 3, 2017 at 8:14 PM, Weichen Xu <weichen...@databricks.com>
>>>> wrote:
>>>>
>>>>> +1.
>>>>>
>>>>> On Sat, Nov 4, 2017 at 8:04 AM, Matei Zaharia <matei.zaha...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> +1 from me too.
>>>>>>
>>>>>> Matei
>>>>>>
>>>>>> > On Nov 3, 2017, at 4:59 PM, Wenchen Fan <cloud0...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > +1.
>>>>>> >
>>>>>> > I think this architecture makes a lot of sense to let executors
>>>>>> talk to source/sink directly, and bring very low latency.
>>>>>> >
>>>>>> > On Thu, Nov 2, 2017 at 9:01 AM, Sean Owen <so...@cloudera.com>
>>>>>> wrote:
>>>>>> > +0 simply because I don't feel I know enough to have an opinion. I
>>>>>> have no reason to doubt the change though, from a skim through the doc.
>>>>>> >
>>>>>> >
>>>>>> > On Wed, Nov 1, 2017 at 3:37 PM Reynold Xin <r...@databricks.com>
>>>>>> wrote:
>>>>>> > Earlier I sent out a discussion thread for CP in Structured
>>>>>> Streaming:
>>>>>> >
>>>>>> > https://issues.apache.org/jira/browse/SPARK-20928
>>>>>> >
>>>>>> > It is meant to be a very small, surgical change to Structured
>>>>>> Streaming to enable ultra-low latency. This is great timing because we 
>>>>>> are
>>>>>> also designing and implementing data source API v2. If designed properly,
>>>>>> we can have the same data source API working for both streaming and 
>>>>>> batch.
>>>>>> >
>>>>>> >
>>>>>> > Following the SPIP process, I'm putting this SPIP up for a vote.
>>>>>> >
>>>>>> > +1: Let's go ahead and design / implement the SPIP.
>>>>>> > +0: Don't really care.
>>>>>> > -1: I do not think this is a good idea for the following reasons.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>> -
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>> +1 -224-436-0783 <(224)%20436-0783>
>>>> Greater Chicago
>>>>
>>>
>>>
>>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: HashingTFModel/IDFModel in Structured Streaming

2017-10-20 Thread Joseph Bradley

Hi Davis,
We've started tracking these issues under this umbrella:
https://issues.apache.org/jira/browse/SPARK-21926
I'm hoping we can fix some of these for 2.3.
Thanks,
Joseph

On Mon, Oct 16, 2017 at 9:23 PM, Davis Varghese <vergh...@gmail.com> wrote:

>  I have built a ML pipeline model on a static twitter data for sentiment
> analysis. When I use the model on a structured stream, it always throws
> "Queries with streaming sources must be executed with writeStream.start()".
> This particular model doesn't contain any documented "unsupported"
> operations. It only calls the transform() method of the stages. Anyone have
> encountered the issue? if the model doesn't contain
> HashingTFModel/IDFModel,
> it works fine, but then I can not create feature vectors from the tweet.
>
> Thanks,
> Davis
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: SparkR is now available on CRAN

2017-10-20 Thread Joseph Bradley

Awesome, this is a big step for Spark!

On Thu, Oct 12, 2017 at 12:06 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> That's wonderful news! :) Now we have Spark in CRAN, PyPi, and maven so
> the on-rap should be easy for every one. Excited to see more SparkR users
> joining us :)
>
> On Thu, Oct 12, 2017 at 11:25 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> This is huge!
>>
>>
>> On Thu, Oct 12, 2017 at 11:21 AM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> Hi all
>>>
>>> I'm happy to announce that the most recent release of Spark, 2.1.2 is
>>> now available for download as an R package from CRAN at
>>> https://cran.r-project.org/web/packages/SparkR/ . This makes it easy to
>>> get started with SparkR for new R users and the package includes code to
>>> download the corresponding Spark binaries. https://issues.apach
>>> e.org/jira/browse/SPARK-15799 has more details on this.
>>>
>>> Many thanks to everyone who helped put this together -- especially Felix
>>> Cheung for making a number of fixes to meet the CRAN requirements and
>>> Holden Karau for the 2.1.2 release.
>>>
>>> Thanks
>>> Shivaram
>>>
>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-27 Thread Joseph Bradley

This vote passes with 11 +1s (4 binding) and no +0s or -1s.

+1:
Sean Owen (binding)
Holden Karau
Denny Lee
Reynold Xin (binding)
Joseph Bradley (binding)
Noman Khan
Weichen Xu
Yanbo Liang
Dongjoon Hyun
Matei Zaharia (binding)
Vaquar Khan

Thanks everyone!
Joseph

On Sat, Sep 23, 2017 at 4:23 PM, vaquar khan <vaquar.k...@gmail.com> wrote:

> +1 looks good,
>
> Regards,
> Vaquar khan
>
> On Sat, Sep 23, 2017 at 12:22 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> +1; we should consider something similar for multi-dimensional tensors
>> too.
>>
>> Matei
>>
>> > On Sep 23, 2017, at 7:27 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>> >
>> > +1
>> >
>> > On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan <nomanbp...@live.com>
>> wrote:
>> > +1
>> >
>> > Regards
>> > Noman
>> > From: Denny Lee <denny.g@gmail.com>
>> > Sent: Friday, September 22, 2017 2:59:33 AM
>> > To: Apache Spark Dev; Sean Owen; Tim Hunter
>> > Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan
>> > Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
>> >
>> > +1
>> >
>> > On Thu, Sep 21, 2017 at 11:15 Sean Owen <so...@cloudera.com> wrote:
>> > Am I right that this doesn't mean other packages would use this
>> representation, but that they could?
>> >
>> > The representation looked fine to me w.r.t. what DL frameworks need.
>> >
>> > My previous comment was that this is actually quite lightweight. It's
>> kind of like how I/O support is provided for CSV and JSON, so makes enough
>> sense to add to Spark. It doesn't really preclude other solutions.
>> >
>> > For those reasons I think it's fine. +1
>> >
>> > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter <timhun...@databricks.com>
>> wrote:
>> > Hello community,
>> >
>> > I would like to call for a vote on SPARK-21866. It is a short proposal
>> that has important applications for image processing and deep learning.
>> Joseph Bradley has offered to be the shepherd.
>> >
>> > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
>> > PDF version: https://issues.apache.org/jira/secure/attachment/12884792/
>> SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf
>> >
>> > Background and motivation
>> > As Apache Spark is being used more and more in the industry, some new
>> use cases are emerging for different data formats beyond the traditional
>> SQL types or the numerical types (vectors and matrices). Deep Learning
>> applications commonly deal with image processing. A number of projects add
>> some Deep Learning capabilities to Spark (see list below), but they
>> struggle to communicate with each other or with MLlib pipelines because
>> there is no standard way to represent an image in Spark DataFrames. We
>> propose to federate efforts for representing images in Spark by defining a
>> representation that caters to the most common needs of users and library
>> developers.
>> > This SPIP proposes a specification to represent images in Spark
>> DataFrames and Datasets (based on existing industrial standards), and an
>> interface for loading sources of images. It is not meant to be a
>> full-fledged image processing library, but rather the core description that
>> other libraries and users can rely on. Several packages already offer
>> various processing facilities for transforming images or doing more complex
>> operations, and each has various design tradeoffs that make them better as
>> standalone solutions.
>> > This project is a joint collaboration between Microsoft and Databricks,
>> which have been testing this design in two open source packages: MMLSpark
>> and Deep Learning Pipelines.
>> > The proposed image format is an in-memory, decompressed representation
>> that targets low-level applications. It is significantly more liberal in
>> memory usage than compressed image representations such as JPEG, PNG, etc.,
>> but it allows easy communication with popular image processing libraries
>> and has no decoding overhead.
>> > Targets users and personas:
>> > Data scientists, data engineers, library developers.
>> > The following libraries define primitives for loading and representing
>> images, and will gain from a common interchange format (in alphabetical
>> order):
>> >   • BigDL
>> >   • DeepLearning4J
>> >   • Deep Learning

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-30 Thread Joseph Bradley

Congrats!

On Aug 29, 2017 9:55 AM, "Felix Cheung"  wrote:

> Congrats!
>
> --
> *From:* Wenchen Fan 
> *Sent:* Tuesday, August 29, 2017 9:21:38 AM
> *To:* Kevin Yu
> *Cc:* Meisam Fathi; dev
> *Subject:* Re: Welcoming Saisai (Jerry) Shao as a committer
>
> Congratulations, Saisai!
>
> On 29 Aug 2017, at 10:38 PM, Kevin Yu  wrote:
>
> Congratulations, Jerry!
>
> On Tue, Aug 29, 2017 at 6:35 AM, Meisam Fathi 
> wrote:
>
>> Congratulations, Jerry!
>>
>> Thanks,
>> Meisam
>>
>> On Tue, Aug 29, 2017 at 1:13 AM Wang, Carson 
>> wrote:
>>
>>> Congratulations, Saisai!
>>>
>>>
>>> -Original Message-
>>> From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
>>> Sent: Tuesday, August 29, 2017 9:29 AM
>>> To: dev 
>>> Cc: Saisai Shao 
>>> Subject: Welcoming Saisai (Jerry) Shao as a committer
>>>
>>> Hi everyone,
>>>
>>> The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai
>>> has been contributing to many areas of the project for a long time, so it’s
>>> great to see him join. Join me in thanking and congratulating him!
>>>
>>> Matei
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
>

Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-10 Thread Joseph Bradley

Congrats!

On Aug 8, 2017 9:31 PM, "Minho Kim"  wrote:

> Congrats, Hyukjin and Sameer!!
>
> 2017-08-09 9:55 GMT+09:00 Sandeep Joshi :
>
>> Congratulations Hyukjin and Sameer !
>>
>> On 7 Aug 2017 9:23 p.m., "Matei Zaharia"  wrote:
>>
>>> Hi everyone,
>>>
>>> The Spark PMC recently voted to add Hyukjin Kwon and Sameer Agarwal as
>>> committers. Join me in congratulating both of them and thanking them for
>>> their contributions to the project!
>>>
>>> Matei
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-02 Thread Joseph Bradley

+1

On Sat, Jul 1, 2017 at 7:49 AM, Sean Owen <so...@cloudera.com> wrote:

> +1 binding. Same as last time. All tests pass with -Phive -Phadoop-2.7
> -Pyarn, all sigs and licenses look OK.
>
> We have one issue opened yesterday for 2.2.0:
> https://issues.apache.org/jira/browse/SPARK-21267
>
> I assume this isn't really meant to be in this release, and sounds
> non-essential, so OK.
>
> On Sat, Jul 1, 2017 at 2:45 AM Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc6
>> <https://github.com/apache/spark/tree/v2.2.0-rc6> (a2c7b2133cfee7f
>> a9abfaa2bfbfb637155466783)
>>
>> List of JIRA tickets resolved can be found with this filter
>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1245/
>>
>> The documentation corresponding to this release can be found at:
>> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-13 Thread Joseph Bradley

Re: the QA JIRAs:
Thanks for discussing them.  I still feel they are very helpful; I
particularly notice not having to spend a solid 2-3 weeks of time QAing
(unlike in earlier Spark releases).  One other point not mentioned above: I
think they serve as a very helpful reminder/training for the community for
rigor in development.  Since we instituted QA JIRAs, contributors have been
a lot better about adding in docs early, rather than waiting until the end
of the cycle (though I know this is drawing conclusions from correlations).

I would vote in favor of the RC...but I'll wait to see about the reported
failures.

On Fri, Jun 9, 2017 at 3:30 PM, Sean Owen <so...@cloudera.com> wrote:

> Different errors as in https://issues.apache.org/jira/browse/SPARK-20520 but
> that's also reporting R test failures.
>
> I went back and tried to run the R tests and they passed, at least on
> Ubuntu 17 / R 3.3.
>
>
> On Fri, Jun 9, 2017 at 9:12 AM Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> All Scala, Python tests pass. ML QA and doc issues are resolved (as well
>> as R it seems).
>>
>> However, I'm seeing the following test failure on R consistently:
>> https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72
>>
>>
>> On Thu, 8 Jun 2017 at 08:48 Denny Lee <denny.g@gmail.com> wrote:
>>
>>> +1 non-binding
>>>
>>> Tested on macOS Sierra, Ubuntu 16.04
>>> test suite includes various test cases including Spark SQL, ML,
>>> GraphFrames, Structured Streaming
>>>
>>>
>>> On Wed, Jun 7, 2017 at 9:40 PM vaquar khan <vaquar.k...@gmail.com>
>>> wrote:
>>>
>>>> +1 non-binding
>>>>
>>>> Regards,
>>>> vaquar khan
>>>>
>>>> On Jun 7, 2017 4:32 PM, "Ricardo Almeida" <ricardo.alme...@actnowib.com>
>>>> wrote:
>>>>
>>>> +1 (non-binding)
>>>>
>>>> Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn
>>>> -Phive -Phive-thriftserver -Pscala-2.11 on
>>>>
>>>>- Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
>>>>- macOS 10.12.5 Java 8 (build 1.8.0_131)
>>>>
>>>>
>>>> On 5 June 2017 at 21:14, Michael Armbrust <mich...@databricks.com>
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00
>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.2.0-rc4
>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc4> (377cfa8ac7ff7a8
>>>>> a6a6d273182e18ea7dc25ce7e)
>>>>>
>>>>> List of JIRA tickets resolved can be found with this filter
>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
>>>>> .
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/
>>>>> orgapachespark-1241/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-
>>>>> 2.2.0-rc4-docs/
>>>>>
>>>>>
>>>>> *FAQ*
>>>>>
>>>>> *How can I help test this release?*
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>>
>>>>> *But my bug isn't fixed!??!*
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from 2.1.1.
>>>>>
>>>>
>>>>
>>>>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

GraphFrames 0.5.0 - critical bug fix + other improvements

2017-05-19 Thread Joseph Bradley

Hi Spark community,

I'd like to announce a new release of GraphFrames, a Spark Package for
DataFrame-based graphs!

*We strongly encourage all users to use this latest release for the bug fix
described below.*

*Critical bug fix*
This release fixes a bug in indexing vertices.  This may have affected your
results if:
* your graph uses non-Integer IDs and
* you use ConnectedComponents and other algorithms which are wrappers
around GraphX.
The bug occurs when the input DataFrame is non-deterministic. E.g., running
an algorithm on a DataFrame just loaded from disk should be fine in
previous releases, but running that algorithm on a DataFrame produced using
shuffling, unions, and other operators can cause incorrect results. This
issue is fixed in this release.

*New features*
* Python API for aggregateMessages for building custom graph algorithms
* Scala API for parallel personalized PageRank, wrapping the GraphX
implementation. This is only available when using GraphFrames with Spark
2.1+.

Support for Spark 1.6, 2.0, and 2.1

*Special thanks to Felix Cheung for his work as a new committer for
GraphFrames!*

*Full release notes*:
https://github.com/graphframes/graphframes/releases/tag/release-0.5.0
*Docs*: http://graphframes.github.io/
*Spark Package*: https://spark-packages.org/package/graphframes/graphframes
*Source*: https://github.com/graphframes/graphframes

Thanks to all contributors and to the community for feedback!
Joseph

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-16 Thread Joseph Bradley

All of the ML/Graph/SparkR QA blocker JIRAs have been resolved.  Thanks
everyone who helped out on those!

We still have open ML/Graph/SparkR JIRAs targeted at 2.2, but they are
essentially all for documentation.

Joseph

On Thu, May 11, 2017 at 3:08 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> Since you'll be creating a new RC, I'd wait until SPARK-20666 is
> fixed, since the change that caused it is in branch-2.2. Probably a
> good idea to raise it to blocker at this point.
>
> On Thu, May 11, 2017 at 2:59 PM, Michael Armbrust
> <mich...@databricks.com> wrote:
> > I'm going to -1 given the outstanding issues and lack of +1s.  I'll
> create
> > another RC once ML has had time to take care of the more critical
> problems.
> > In the meantime please keep testing this release!
> >
> > On Tue, May 9, 2017 at 2:00 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com>
> > wrote:
> >>
> >> +1 (non-binding)
> >>
> >> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests
> for
> >> core have passed.
> >>
> >> $ java -version
> >> openjdk version "1.8.0_111"
> >> OpenJDK Runtime Environment (build
> >> 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
> >> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
> >> $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7
> >> package install
> >> $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core
> >> ...
> >> Run completed in 15 minutes, 12 seconds.
> >> Total number of tests run: 1940
> >> Suites: completed 206, aborted 0
> >> Tests: succeeded 1940, failed 0, canceled 4, ignored 8, pending 0
> >> All tests passed.
> >> [INFO]
> >> 
> 
> >> [INFO] BUILD SUCCESS
> >> [INFO]
> >> 
> 
> >> [INFO] Total time: 16:51 min
> >> [INFO] Finished at: 2017-05-09T17:51:04+09:00
> >> [INFO] Final Memory: 53M/514M
> >> [INFO]
> >> 
> 
> >> [WARNING] The requested profile "hive" could not be activated because it
> >> does not exist.
> >>
> >>
> >> Kazuaki Ishizaki,
> >>
> >>
> >>
> >> From:Michael Armbrust <mich...@databricks.com>
> >> To:"dev@spark.apache.org" <dev@spark.apache.org>
> >> Date:2017/05/05 02:08
> >> Subject:[VOTE] Apache Spark 2.2.0 (RC2)
> >> 
> >>
> >>
> >>
> >> Please vote on releasing the following candidate as Apache Spark version
> >> 2.2.0. The vote is open until Tues, May 9th, 2017 at 12:00 PST and
> passes if
> >> a majority of at least 3 +1 PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Spark 2.2.0
> >> [ ] -1 Do not release this package because ...
> >>
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> The tag to be voted on is v2.2.0-rc2
> >> (1d4017b44d5e6ad156abeaae6371747f111dd1f9)
> >>
> >> List of JIRA tickets resolved can be found with this filter.
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-bin/
> >>
> >> Release artifacts are signed with the following key:
> >> https://people.apache.org/keys/committer/pwendell.asc
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1236/
> >>
> >> The documentation corresponding to this release can be found at:
> >> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-docs/
> >>
> >>
> >> FAQ
> >>
> >> How can I help test this release?
> >>
> >> If you are a Spark user, you can help us test this release by taking an
> >> existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> What should happen to JIRA tickets still targeting 2.2.0?
> >>
> >> Committers should look at those and triage. Extremely important bug
> fixes,
> >> documentation, and API tweaks that impact compatibility should be
> worked on
> >> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
> >>
> >> But my bug isn't fixed!??!
> >>
> >> In order to make timely releases, we will typically not hold the release
> >> unless the bug in question is a regression from 2.1.1.
> >>
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-08 Thread Joseph Bradley

I'll work on resolving some of the ML QA blockers this week, but it'd be
great to get help.  *@committers & contributors who work on ML*, many of
you have helped in the past, so please help take QA tasks wherever
possible.  (Thanks Yanbo & Felix for jumping in already.)  Anyone is
welcome to chip in of course!
Joseph

On Thu, May 4, 2017 at 4:09 PM, Sean Owen <so...@cloudera.com> wrote:

> The tests pass, licenses are OK, sigs, etc. I'd endorse it but we do still
> have blockers, so I assume people mean we need there will be another RC at
> some point.
>
> Blocker
> SPARK-20503 ML 2.2 QA: API: Python API coverage
> SPARK-20501 ML, Graph 2.2 QA: API: New Scala APIs, docs
> SPARK-20502 ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final,
> sealed audit
> SPARK-20509 SparkR 2.2 QA: New R APIs and API docs
> SPARK-20504 ML 2.2 QA: API: Java compatibility, docs
> SPARK-20500 ML, Graph 2.2 QA: API: Binary incompatible changes
>
> Critical
> SPARK-20499 Spark MLlib, GraphX 2.2 QA umbrella
> SPARK-20520 R streaming tests failed on Windows
> SPARK-18891 Support for specific collection types
> SPARK-20505 ML, Graph 2.2 QA: Update user guide for new features & APIs
> SPARK-20364 Parquet predicate pushdown on columns with dots return empty
> results
> SPARK-20508 Spark R 2.2 QA umbrella
> SPARK-20512 SparkR 2.2 QA: Programming guide, migration guide, vignettes
> updates
> SPARK-20513 Update SparkR website for 2.2
> SPARK-20510 SparkR 2.2 QA: Update user guide for new features & APIs
> SPARK-20507 Update MLlib, GraphX websites for 2.2
> SPARK-20506 ML, Graph 2.2 QA: Programming guide update and migration guide
> SPARK-19690 Join a streaming DataFrame with a batch DataFrame may not work
> SPARK-7768 Make user-defined type (UDT) API public
> SPARK-4502 Spark SQL reads unneccesary nested fields from Parquet
> SPARK-17626 TPC-DS performance improvements using star-schema heuristics
>
>
> On Thu, May 4, 2017 at 6:07 PM Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Tues, May 9th, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc2
>> <https://github.com/apache/spark/tree/v2.2.0-rc2> (1d4017b44d5e6ad
>> 156abeaae6371747f111dd1f9)
>>
>> List of JIRA tickets resolved can be found with this filter
>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1236/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Joseph Bradley

That's very fair.

For my part, I should have been faster to make these JIRAs and get critical
dev community QA started when the branch was cut last week.

On Thu, Apr 27, 2017 at 2:59 PM, Sean Owen <so...@cloudera.com> wrote:

> That makes sense, but we have an RC, not just a branch. I think we've
> followed the pattern in http://spark.apache.org/versioning-policy.html in
> the past. This generally comes before and RC right, because until
> everything that Must Happen before a release has happened, someone's saying
> the RC can't possibly pass. I get it, in practice, this is an "RC0" that
> can't pass (unless somehow these issue result in zero changes) and there's
> value in that anyway. Just want to see if we're on the same page about
> process, maybe even just say this is how we manage releases, with "RCs"
> starting before QA ends.
>
> On Thu, Apr 27, 2017 at 10:36 PM Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> This is the same thing as ever for MLlib: Once a branch has been cut, we
>> stop merging features.  Now that features are not being merged, we can
>> begin QA.  I strongly prefer to track QA work in JIRA and to have those
>> items targeted for 2.2.  I also believe that certain QA tasks should be
>> blockers; e.g., if we have not checked for binary or Java compatibility
>> issues in new APIs, then I am not comfortable signing off on a release.  I
>> agree with Michael that these don't block testing on a release; the point
>> of these issues is to do testing.
>>
>> I'll close the roadmap JIRA though.
>>
>> On Thu, Apr 27, 2017 at 1:49 PM, Michael Armbrust <mich...@databricks.com
>> > wrote:
>>
>>> All of those look like QA or documentation, which I don't think needs to
>>> block testing on an RC (and in fact probably needs an RC to test?).
>>> Joseph, please correct me if I'm wrong.  It is unlikely this first RC is
>>> going to pass, but I wanted to get the ball rolling on testing 2.2.
>>>
>>> On Thu, Apr 27, 2017 at 1:45 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> These are still blockers for 2.2:
>>>>
>>>> SPARK-20501 ML, Graph 2.2 QA: API: New Scala APIs, docs
>>>> SPARK-20504 ML 2.2 QA: API: Java compatibility, docs
>>>> SPARK-20503 ML 2.2 QA: API: Python API coverage
>>>> SPARK-20502 ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final,
>>>> sealed audit
>>>> SPARK-20500 ML, Graph 2.2 QA: API: Binary incompatible changes
>>>> SPARK-18813 MLlib 2.2 Roadmap
>>>>
>>>> Joseph you opened most of these just now. Is this an "RC0" we know
>>>> won't pass? or, wouldn't we normally cut an RC after those things are 
>>>> ready?
>>>>
>>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00
>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.2.0-rc1
>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c
>>>>> 1a8f8966c7e64010cf5632cb6)
>>>>>
>>>>> List of JIRA tickets resolved can be found with this filter
>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>>> .
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/
>>>>> orgapachespark-1235/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-
>>>>> 2.2.0-

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Joseph Bradley

This is the same thing as ever for MLlib: Once a branch has been cut, we
stop merging features.  Now that features are not being merged, we can
begin QA.  I strongly prefer to track QA work in JIRA and to have those
items targeted for 2.2.  I also believe that certain QA tasks should be
blockers; e.g., if we have not checked for binary or Java compatibility
issues in new APIs, then I am not comfortable signing off on a release.  I
agree with Michael that these don't block testing on a release; the point
of these issues is to do testing.

I'll close the roadmap JIRA though.

On Thu, Apr 27, 2017 at 1:49 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> All of those look like QA or documentation, which I don't think needs to
> block testing on an RC (and in fact probably needs an RC to test?).
> Joseph, please correct me if I'm wrong.  It is unlikely this first RC is
> going to pass, but I wanted to get the ball rolling on testing 2.2.
>
> On Thu, Apr 27, 2017 at 1:45 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> These are still blockers for 2.2:
>>
>> SPARK-20501 ML, Graph 2.2 QA: API: New Scala APIs, docs
>> SPARK-20504 ML 2.2 QA: API: Java compatibility, docs
>> SPARK-20503 ML 2.2 QA: API: Python API coverage
>> SPARK-20502 ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final,
>> sealed audit
>> SPARK-20500 ML, Graph 2.2 QA: API: Binary incompatible changes
>> SPARK-18813 MLlib 2.2 Roadmap
>>
>> Joseph you opened most of these just now. Is this an "RC0" we know won't
>> pass? or, wouldn't we normally cut an RC after those things are ready?
>>
>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST
>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.2.0-rc1
>>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c
>>> 1a8f8966c7e64010cf5632cb6)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1.
>>>
>>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Pull Request Made, Ignored So Far

2017-03-31 Thread Joseph Bradley

Hi John,

Thanks for pinging about this.  It does look useful, but I'll admit 3 days
isn't a long time since there are so many (hundreds) of open PRs.  I'll see
if I can take a look soon (or others should feel free to as well).  I'd
also recommend checking out the surrounding code and pinging the
contributors or committers who have worked on it to grab their attention &
early feedback.

Thanks!
Joseph

On Fri, Mar 31, 2017 at 7:37 AM, John Compitello <jo...@broadinstitute.org>
wrote:

> Hi all,
>
> I’m a new Spark contributor who put in a pull request a few days ago:
> https://github.com/apache/spark/pull/17459
>
> It’s a relatively small, isolated change that should be pretty simple to
> review. It has been a big help in the main project I’m working on (
> https://github.com/hail-is/hail <https://hail.is/>) so I wanted to
> contribute it back to main Spark. It’s been a few days though, and I
> haven’t had my branch cleared to run tests or any acknowledgement of it
> all. Is there any process to ask someone to review your PR or get it
> assigned to someone? I’m afraid it’s just going to slowly sink down onto
> later and later pages in the PR list until it’s too deep for anyone to be
> expected to find otherwise.
>
> Best,
>
> John
>

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: SPIP docs are live

2017-03-16 Thread Joseph Bradley

Awesome!  Thanks for pushing this through, Cody.
Joseph

On Sun, Mar 12, 2017 at 1:18 AM, Sean Owen <so...@cloudera.com> wrote:

> http://spark.apache.org/improvement-proposals.html
>
> (Thanks Cody!)
>
> We should use this process where appropriate now, and we can refine it
> further if needed.
>

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Question on Spark's graph libraries roadmap

2017-03-15 Thread Joseph Bradley

>, spark users <
>>>> u...@spark.apache.org>
>>>>
>>>> +1
>>>>
>>>> Regards,
>>>> _
>>>> *Md. Rezaul Karim*, BSc, MSc
>>>> PhD Researcher, INSIGHT Centre for Data Analytics
>>>> National University of Ireland, Galway
>>>> IDA Business Park, Dangan, Galway, Ireland
>>>> Web: http://www.reza-analytics.eu/index.html
>>>> <http://139.59.184.114/index.html>
>>>>
>>>> On 10 March 2017 at 12:10, Robin East <robin.e...@xense.co.uk> wrote:
>>>>
>>>> I would love to know the answer to that too.
>>>> 
>>>> ---
>>>> Robin East
>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>> Manning Publications Co.
>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 9 Mar 2017, at 17:42, enzo <e...@smartinsightsfromdata.com> wrote:
>>>>
>>>> I am a bit confused by the current roadmap for graph and graph
>>>> analytics in Apache Spark.
>>>>
>>>> I understand that we have had for some time two libraries (the
>>>> following is my understanding - please amend as appropriate!):
>>>>
>>>> . GraphX, part of Spark project.  This library is based on RDD and it
>>>> is only accessible via Scala.  It doesn’t look that this library has been
>>>> enhanced recently.
>>>> . GraphFrames, independent (at the moment?) library for Spark.  This
>>>> library is based on Spark DataFrames and accessible by Scala & Python. Last
>>>> commit on GitHub was 2 months ago.
>>>>
>>>> GraphFrames cam about with the promise at some point to be integrated
>>>> in Apache Spark.
>>>>
>>>> I can see other projects coming up with interesting libraries and ideas
>>>> (e.g. Graphulo on Accumulo, a new project with the goal of
>>>> implementing the GraphBlas building blocks for graph algorithms on top
>>>> of Accumulo).
>>>>
>>>> Where is Apache Spark going?
>>>>
>>>> Where are graph libraries in the roadmap?
>>>>
>>>>
>>>>
>>>> Thanks for any clarity brought to this matter.
>>>>
>>>> Enzo
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Spark Improvement Proposals

2017-02-24 Thread Joseph Bradley

gt; >>> >> or
> >>> >> different thing, that is supposed to solve some problems that aren't
> >>> >> otherwise solvable. I see mentioned problems like: clear process for
> >>> >> managing work, public communication, more committers, some sort of
> >>> >> binding
> >>> >> outcome and deadline.
> >>> >>
> >>> >> If SPIP is supposed to be a way to make people design in public and
> a
> >>> >> way to
> >>> >> force attention to a particular change, then, this doesn't do that
> by
> >>> >> itself. Therefore I don't want to let a detailed discussion of SPIP
> >>> >> detract
> >>> >> from the discussion about doing what SPIP implies. It's just a
> process
> >>> >> document.
> >>> >>
> >>> >> Still, a fine step IMHO.
> >>> >>
> >>> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <r...@databricks.com>
> >>> >> wrote:
> >>> >>>
> >>> >>> Updated. Any feedback from other community members?
> >>> >>>
> >>> >>>
> >>> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <
> c...@koeninger.org>
> >>> >>> wrote:
> >>> >>>>
> >>> >>>> Thanks for doing that.
> >>> >>>>
> >>> >>>> Given that there are at least 4 different Apache voting processes,
> >>> >>>> "typical Apache vote process" isn't meaningful to me.
> >>> >>>>
> >>> >>>> I think the intention is that in order to pass, it needs at least
> 3
> >>> >>>> +1
> >>> >>>> votes from PMC members *and no -1 votes from PMC members*.  But
> the
> >>> >>>> document
> >>> >>>> doesn't explicitly say that second part.
> >>> >>>>
> >>> >>>> There's also no mention of the duration a vote should remain open.
> >>> >>>> There's a mention of a month for finding a shepherd, but that's
> >>> >>>> different.
> >>> >>>>
> >>> >>>> Other than that, LGTM.
> >>> >>>>
> >>> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com
> >
> >>> >>>> wrote:
> >>> >>>>>
> >>> >>>>> Here's a new draft that incorporated most of the feedback:
> >>> >>>>>
> >>> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#
> >>> >>>>>
> >>> >>>>> I added a specific role for SPIP Author and another one for SPIP
> >>> >>>>> Shepherd.
> >>> >>>>>
> >>> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com>
> >>> >>>>> wrote:
> >>> >>>>>>
> >>> >>>>>> During the summit, I also had a lot of discussions over similar
> >>> >>>>>> topics
> >>> >>>>>> with multiple Committers and active users. I heard many
> fantastic
> >>> >>>>>> ideas. I
> >>> >>>>>> believe Spark improvement proposals are good channels to collect
> >>> >>>>>> the
> >>> >>>>>> requirements/designs.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> IMO, we also need to consider the priority when working on these
> >>> >>>>>> items.
> >>> >>>>>> Even if the proposal is accepted, it does not mean it will be
> >>> >>>>>> implemented
> >>> >>>>>> and merged immediately. It is not a FIFO queue.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Even if some PRs are merged, sometimes, we still have to revert
> >>> >>>>>> them
> >>> >>>>>> back, if the design and implementation are not reviewed
> carefully.
> >>> >>>>>> We have
> >>> >>>>>> to ensure our quality. Spark is not an application software. It
> is
> >>> >>>>>> an
> >>> >>>>>> infrastructure software that is being used by many many
> companies.
> >>> >>>>>> We have
> >>> >>>>>> to be very careful in the design and implementation, especially
> >>> >>>>>> adding/changing the external APIs.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> When I developed the Mainframe infrastructure/middleware
> software
> >>> >>>>>> in
> >>> >>>>>> the past 6 years, I were involved in the discussions with
> >>> >>>>>> external/internal
> >>> >>>>>> customers. The to-do feature list was always above 100.
> Sometimes,
> >>> >>>>>> the
> >>> >>>>>> customers are feeling frustrated when we are unable to deliver
> >>> >>>>>> them on time
> >>> >>>>>> due to the resource limits and others. Even if they paid us
> >>> >>>>>> billions, we
> >>> >>>>>> still need to do it phase by phase or sometimes they have to
> >>> >>>>>> accept the
> >>> >>>>>> workarounds. That is the reality everyone has to face, I think.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Thanks,
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Xiao Li
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>
> >>> >
> >>> > 
> -
> >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >
> >
> >
> >
> > --
> > Regards,
> > Vaquar Khan
> > +1 -224-436-0783
> >
> > IT Architect / Lead Consultant
> > Greater Chicago
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-02-23 Thread Joseph Bradley

+1 for Nick's comment about discussing APIs which need to be made public in
https://issues.apache.org/jira/browse/SPARK-19498 !

On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> On 22 Feb 2017, at 20:51, Shouheng Yi <sho...@microsoft.com.INVALID>
> wrote:
>
> Hi Spark developers,
>
> Currently my team at Microsoft is extending Spark’s machine learning
> functionalities to include new learners and transformers. We would like
> users to use these within spark pipelines so that they can mix and match
> with existing Spark learners/transformers, and overall have a native spark
> experience. We cannot accomplish this using a non-“org.apache” namespace
> with the current implementation, and we don’t want to release code inside
> the apache namespace because it’s confusing and there could be naming
> rights issues.
>
>
> This isn't actually the ASF has a strong stance against, more left to
> projects themselves. After all: the source is licensed by the ASF, and the
> license doesn't say you can't.
>
> Indeed, there's a bit of org.apache.hive in the Spark codebase where the
> hive team kept stuff package private. Though that's really a sign that
> things could be improved there.
>
> Where is problematic is that stack traces end up blaming the wrong group;
> nobody likes getting a bug report which doesn't actually exist in your
> codebase., not least because you have to waste time to even work it out.
>
> You also have to expect absolutely no stability guarantees, so you'd
> better set your nightly build to work against trunk
>
> Apache Bahir does put some stuff into org.apache.spark.stream, but they've
> sort of inherited that right.when they picked up the code from spark. new
> stuff is going into org.apache.bahir
>
>
> We need to extend several classes from spark which happen to have
> “private[spark].” For example, one of our class extends VectorUDT[0] which
> has private[spark] class VectorUDT as its access modifier. This
> unfortunately put us in a strange scenario that forces us to work under the
> namespace org.apache.spark.
>
> To be specific, currently the private classes/traits we need to use to
> create new Spark learners & Transformers are HasInputCol, VectorUDT and
> Logging. We will expand this list as we develop more.
>
>
> I do think tis a shame that logging went from public to private.
>
> One thing that could be done there is to copy the logging into Bahir,
> under an org.apache.bahir package, for yourself and others to use. That's
> be beneficial to me too.
>
> For the ML stuff, that might be place to work too, if you are going to
> open source the code.
>
>
>
> Is there a way to avoid this namespace issue? What do other
> people/companies do in this scenario? Thank you for your help!
>
>
> I've hit this problem in the past.  Scala code tends to force your hand
> here precisely because of that (very nice) private feature. While it offers
> the ability of a project to guarantee that implementation details aren't
> picked up where they weren't intended to be, in OSS dev, all that
> implementation is visible and for lower level integration,
>
> What I tend to do is keep my own code in its package and try to do as
> think a bridge over to it from the [private] scope. It's also important to
> name things obviously, say,  org.apache.spark.microsoft , so stack traces
> in bug reports can be dealt with more easily
>
>
> [0]: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/
> apache/spark/ml/linalg/VectorUDT.scala
>
> Best,
> Shouheng
>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-15 Thread Joseph Bradley

Congrats and welcome!

On Mon, Feb 13, 2017 at 6:54 PM, Takuya UESHIN <ues...@happy-camper.st>
wrote:

> Thank you very much everyone!
> I really look forward to working with you!
>
>
> On Tue, Feb 14, 2017 at 9:47 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>
>> Congratulations!
>>
>> On Mon, Feb 13, 2017 at 3:29 PM, Kazuaki Ishizaki <ishiz...@jp.ibm.com>
>> wrote:
>>
>>> Congrats!
>>>
>>> Kazuaki Ishizaki
>>>
>>>
>>>
>>> From:Reynold Xin <r...@databricks.com>
>>> To:"dev@spark.apache.org" <dev@spark.apache.org>
>>> Date:2017/02/14 04:18
>>> Subject:welcoming Takuya Ueshin as a new Apache Spark committer
>>> --
>>>
>>>
>>>
>>> Hi all,
>>>
>>> Takuya-san has recently been elected an Apache Spark committer. He's
>>> been active in the SQL area and writes very small, surgical patches that
>>> are high quality. Please join me in congratulating Takuya-san!
>>>
>>>
>>>
>>>
>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>



-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

PSA: Java 8 unidoc build

2017-02-06 Thread Joseph Bradley

Public service announcement: Our doc build has worked with Java 8 for brief
time periods, but new changes keep breaking the Java 8 unidoc build.
Please be aware of this, and try to test doc changes with Java 8!  In
general, it is stricter than Java 7 for docs.

A shout out to @HyukjinKwon and others who have made many fixes for this!
See these sample PRs for some issues causing failures (especially around
links):
https://github.com/apache/spark/pull/16741
https://github.com/apache/spark/pull/16604

Thanks,
Joseph

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Feedback on MLlib roadmap process proposal

2017-01-26 Thread Joseph Bradley

to
> add. The original PR was abandoned by the author and nobody else submitted
> one -- despite the Votes. I hesitate to signal that no PRs would be
> considered, but, doesn't seem like it's in demand enough for someone to
> work on?
>
>
> I think one of my messages is that, de facto, here, like in many Apache
> projects, committers do not take requests. They pursue the work they
> believe needs doing, and shepherd work initiated by others (a clear bug
> report, a PR) to a resolution. Things get done by doing them, or by
> building influence by doing other things the project needs doing. It isn't
> a mechanical, objective process, and can't be. But it does work in a
> recognizable way.
>
>>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: MLlib mission and goals

2017-01-24 Thread Joseph Bradley

*Re: performance measurement framework*
We (Databricks) used to use spark-perf
<https://github.com/databricks/spark-perf>, but that was mainly for the
RDD-based API.  We've now switched to spark-sql-perf
<https://github.com/databricks/spark-sql-perf>, which does include some ML
benchmarks despite the project name.  I'll see about updating the project
README to document how to run MLlib tests.


On Tue, Jan 24, 2017 at 6:02 PM, bradc <brad.carl...@oracle.com> wrote:

> I believe one of the higher level goals of Spark MLlib should be to
> improve the efficiency of the ML algorithms that already exist. Currently
> there ML has a reasonable coverage of the important core algorithms. The
> work to get to feature parity for DataFrame-based API and model persistence
> are also important.
>
> Apache Spark needs to use higher-level BLAS3 and LAPACK routines, instead
> of BLAS1 & BLAS3. For a long time we've used the concept of compute
> intensity (compute_intensity = FP_operations/Word) to help look at the
> performance of the underling compute kernels (see the papers referenced
> below). It has been proven in many implementations that performance,
> scalability, and huge reduction in memory pressure can be achieved by using
> higher-level BLAS3 or LAPACK routines in both single node as well as
> distributed computations.
>
> I performed a survey of some of Apache Spark's ML algorithms.
> Unfortunately most of the ML algorithms are implemented with BLAS1 or BLAS2
> routines which have very low compute intensity. BLAS2 and BLAS1 routines
> require a lot more memory bandwidth and will not achieve peak performance
> on x86, GPUs, or any other processor.
>
> Apache Spark 2.1.0 ML routines & BLAS Routines
>
> ALS(Alternating Least Squares matrix factorization
>
>- BLAS2: _SPR, _TPSV
>- BLAS1: _AXPY, _DOT, _SCAL, _NRM2
>
> Logistic regression classification
>
>- BLAS2: _GEMV
>- BLAS1: _DOT, _SCAL
>
> Generalized linear regression
>
>- BLAS1: _DOT
>
> Gradient-boosted tree regression
>
>- BLAS1: _DOT
>
> GraphX SVD++
>
>- BLAS1: _AXPY, _DOT,_SCAL
>
> Neural Net Multi-layer Perceptron
>
>- BLAS3: _GEMM
>- BLAS2: _GEMV
>
> Only the Neural Net Multi-layer Perceptron uses BLAS3 matrix multiply
> (DGEMM). BTW the underscores are replaced by S, D, Z, C for (32-bit real,
> 64-bit double, 32-bit complex, 64-bit complex operations; respectably).
>
> Refactoring the algorithms to use BLAS3 routines or higher level LAPACK
> routines will require coding changes to use sub-block algorithms but the
> performance benefits can be great.
>
> More at: https://blogs.oracle.com/BestPerf/entry/improving_
> algorithms_in_spark_ml
> Background:
>
> Brad Carlile. Parallelism, compute intensity, and data vectorization.
> SuperComputing'93, November 1993.
> <https://blogs.oracle.com/BestPerf/resource/Carlile-app_compute-intensity-1993.pdf>
>
> John McCalpin. 213876927_Memory_Bandwidth_and_Machine_Balance_in_
> Current_High_Performance_Computers 1995
> <https://www.researchgate.net/publication/213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers>
>
> --
> View this message in context: Re: MLlib mission and goals
> <http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-mission-and-goals-tp20715p20754.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Joseph Bradley

Congratulations Burak & Holden!

On Tue, Jan 24, 2017 at 10:33 AM, Dongjoon Hyun <dongj...@apache.org> wrote:

> Great! Congratulations, Burak and Holden.
>
> Bests,
> Dongjoon.
>
> On 2017-01-24 10:29 (-0800), Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
> >  
> >
> > Congratulations, Burak and Holden.
> >
> > On Tue, Jan 24, 2017 at 1:27 PM Russell Spitzer <
> russell.spit...@gmail.com>
> > wrote:
> >
> > > Great news! Congratulations!
> > >
> > > On Tue, Jan 24, 2017 at 10:25 AM Dean Wampler <deanwamp...@gmail.com>
> > > wrote:
> > >
> > > Congratulations to both of you!
> > >
> > > dean
> > >
> > > *Dean Wampler, Ph.D.*
> > > Author: Programming Scala, 2nd Edition
> > > <http://shop.oreilly.com/product/0636920033073.do>, Fast Data
> > > Architectures for Streaming Applications
> > > <http://www.oreilly.com/data/free/fast-data-architectures-
> for-streaming-applications.csp>,
> > > Functional Programming for Java Developers
> > > <http://shop.oreilly.com/product/0636920021667.do>, and Programming
> Hive
> > > <http://shop.oreilly.com/product/0636920023555.do> (O'Reilly)
> > > Lightbend <http://lightbend.com>
> > > @deanwampler <http://twitter.com/deanwampler>
> > > http://polyglotprogramming.com
> > > https://github.com/deanwampler
> > >
> > > On Tue, Jan 24, 2017 at 6:14 PM, Xiao Li <gatorsm...@gmail.com> wrote:
> > >
> > > Congratulations! Burak and Holden!
> > >
> > > 2017-01-24 10:13 GMT-08:00 Reynold Xin <r...@databricks.com>:
> > >
> > > Hi all,
> > >
> > > Burak and Holden have recently been elected as Apache Spark committers.
> > >
> > > Burak has been very active in a large number of areas in Spark,
> including
> > > linear algebra, stats/maths functions in DataFrames, Python/R APIs for
> > > DataFrames, dstream, and most recently Structured Streaming.
> > >
> > > Holden has been a long time Spark contributor and evangelist. She has
> > > written a few books on Spark, as well as frequent contributions to the
> > > Python API to improve its usability and performance.
> > >
> > > Please join me in welcoming the two!
> > >
> > >
> > >
> > >
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

MLlib mission and goals

2017-01-23 Thread Joseph Bradley

This thread is split off from the "Feedback on MLlib roadmap process
proposal" thread for discussing the high-level mission and goals for
MLlib.  I hope this thread will collect feedback and ideas, not necessarily
lead to huge decisions.

Copying from the previous thread:

*Seth:*
"""
I would love to hear some discussion on the higher level goal of Spark
MLlib (if this derails the original discussion, please let me know and we
can discuss in another thread). The roadmap does contain specific items
that help to convey some of this (ML parity with MLlib, model persistence,
etc...), but I'm interested in what the "mission" of Spark MLlib is. We
often see PRs for brand new algorithms which are sometimes rejected and
sometimes not. Do we aim to keep implementing more and more algorithms? Or
is our focus really, now that we have a reasonable library of algorithms,
to simply make the existing ones faster/better/more robust? Should we aim
to make interfaces that are easily extended for developers to easily
implement their own custom code (e.g. custom optimization libraries), or do
we want to restrict things to out-of-the box algorithms? Should we focus on
more flexible, general abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this
discussion may have happened, but I think it would be useful to either
revisit it or restate it here for some of the newer developers.
"""

*Mingjie:*
"""
+1 general abstractions like distributed linear algebra.
"""


I'll add my thoughts, starting with our past *t**rajectory*:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and
making the library more robust.

I agree with Seth that a few *immediate goals* are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity

*In the future*, it's harder to say, but if I had to pick my top 2 items,
I'd list:

*(1) Making MLlib more extensible*
It will not be feasible to support a huge number of algorithms, so allowing
users to customize their ML on Spark workflows will be critical.  This is
IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and
we will need to make it easier for users to write their own algorithms and
packages to facilitate this.  Part of this could be allowing users to
customize existing algorithms with custom loss functions, etc.

*(2) Consistent improvements to core algorithms*
A less exciting but still very important item will be constantly improving
the core set of algorithms in MLlib. This could mean speed, scaling,
robustness, and usability for the few algorithms which cover 90% of use
cases.

There are plenty of other possibilities, and it will be great to hear the
community's thoughts!

Thanks,
Joseph

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Feedback on MLlib roadmap process proposal

2017-01-23 Thread Joseph Bradley

Hi Seth,

The proposal is geared towards exactly the issue you're describing:
providing more visibility into the capacity and intentions of committers.
If there are things you'd add to it or change to improve further, it would
be great to hear ideas!  The past roadmap JIRA has some more background
discussion which is worth looking at too.

Let's break off the MLlib mission discussion into another thread.  I'll
start one now.

Thanks,
Joseph

On Thu, Jan 19, 2017 at 1:51 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Hi Seth
>
> Re: "The most important thing we can do, given that MLlib currently has a
> very limited committer review bandwidth, is to make clear issues that, if
> worked on, will definitely get reviewed. "
>
> We are adopting a Shepherd model, as described in the JIRA Joseph has, in
> which, when assigned, the Shepherd will see it through with the contributor
> to make sure it lands with the target release.
>
> I'm sure Joseph can explain it better than I do ;)
>
>
> _
> From: Mingjie Tang <tangr...@gmail.com>
> Sent: Thursday, January 19, 2017 10:30 AM
> Subject: Re: Feedback on MLlib roadmap process proposal
> To: Seth Hendrickson <seth.hendrickso...@gmail.com>
> Cc: Joseph Bradley <jos...@databricks.com>, <dev@spark.apache.org>
>
>
>
> +1 general abstractions like distributed linear algebra.
>
> On Thu, Jan 19, 2017 at 8:54 AM, Seth Hendrickson <
> seth.hendrickso...@gmail.com> wrote:
>
>> I think the proposal laid out in SPARK-18813 is well done, and I do think
>> it is going to improve the process going forward. I also really like the
>> idea of getting the community to vote on JIRAs to give some of them
>> priority - provided that we listen to those votes, of course. The biggest
>> problem I see is that we do have several active contributors and those who
>> want to help implement these changes, but PRs are reviewed rather
>> sporadically and I imagine it is very difficult for contributors to
>> understand why some get reviewed and some do not. The most important thing
>> we can do, given that MLlib currently has a very limited committer review
>> bandwidth, is to make clear issues that, if worked on, will definitely get
>> reviewed. A hard thing to do in open source, no doubt, but even if we have
>> to limit the scope of such issues to a very small subset, it's a gain for
>> all I think.
>>
>> On a related note, I would love to hear some discussion on the higher
>> level goal of Spark MLlib (if this derails the original discussion, please
>> let me know and we can discuss in another thread). The roadmap does contain
>> specific items that help to convey some of this (ML parity with MLlib,
>> model persistence, etc...), but I'm interested in what the "mission" of
>> Spark MLlib is. We often see PRs for brand new algorithms which are
>> sometimes rejected and sometimes not. Do we aim to keep implementing more
>> and more algorithms? Or is our focus really, now that we have a reasonable
>> library of algorithms, to simply make the existing ones faster/better/more
>> robust? Should we aim to make interfaces that are easily extended for
>> developers to easily implement their own custom code (e.g. custom
>> optimization libraries), or do we want to restrict things to out-of-the box
>> algorithms? Should we focus on more flexible, general abstractions like
>> distributed linear algebra?
>>
>> I was not involved in the project in the early days of MLlib when this
>> discussion may have happened, but I think it would be useful to either
>> revisit it or restate it here for some of the newer developers.
>>
>> On Tue, Jan 17, 2017 at 3:38 PM, Joseph Bradley <jos...@databricks.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> This is a general call for thoughts about the process for the MLlib
>>> roadmap proposed in SPARK-18813.  See the section called "Roadmap process."
>>>
>>> Summary:
>>> * This process is about committers indicating intention to shepherd and
>>> review.
>>> * The goal is to improve visibility and communication.
>>> * This is fairly orthogonal to the SIP discussion since this proposal is
>>> more about setting release targets than about proposing future plans.
>>>
>>> Thanks!
>>> Joseph
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] <http://databricks.com/>
>>>
>>
>>
>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Feedback on MLlib roadmap process proposal

2017-01-17 Thread Joseph Bradley

Hi all,

This is a general call for thoughts about the process for the MLlib roadmap
proposed in SPARK-18813.  See the section called "Roadmap process."

Summary:
* This process is about committers indicating intention to shepherd and
review.
* The goal is to improve visibility and communication.
* This is fairly orthogonal to the SIP discussion since this proposal is
more about setting release targets than about proposing future plans.

Thanks!
Joseph

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: ml word2vec finSynonyms return type

2017-01-05 Thread Joseph Bradley

We returned a DataFrame since it is a nicer API, but I agree forcing RDD
operations is not ideal.  I'd be OK with adding a new method, but I agree
with Felix that we cannot break the API for something like this.

On Thu, Jan 5, 2017 at 12:44 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Given how Word2Vec is used the pipeline model in the new ml
> implementation, we might need to keep the current behavior?
>
>
> https://github.com/apache/spark/blob/master/examples/
> src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala
>
>
> _
> From: Asher Krim <ak...@hubspot.com>
> Sent: Tuesday, January 3, 2017 11:58 PM
> Subject: Re: ml word2vec finSynonyms return type
> To: Felix Cheung <felixcheun...@hotmail.com>
> Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley <
> jos...@databricks.com>, <dev@spark.apache.org>
>
>
>
> The jira: https://issues.apache.org/jira/browse/SPARK-17629
>
> Adding new methods could result in method clutter. Changing behavior of
> non-experimental classes is unfortunate (ml Word2Vec was marked
> Experimental until Spark 2.0). Neither option is great. If I had to pick, I
> would rather change the existing methods to keep the class simpler moving
> forward.
>
>
> On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> Could you link to the JIRA here?
>>
>> What you suggest makes sense to me. Though we might want to maintain
>> compatibility and add a new method instead of changing the return type of
>> the existing one.
>>
>>
>> _
>> From: Asher Krim <ak...@hubspot.com>
>> Sent: Wednesday, December 28, 2016 11:52 AM
>> Subject: ml word2vec finSynonyms return type
>> To: <dev@spark.apache.org>
>> Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley <
>> jos...@databricks.com>
>>
>>
>>
>> Hey all,
>>
>> I would like to propose changing the return type of `findSynonyms` in
>> ml's Word2Vec
>> <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248>
>> :
>>
>> def findSynonyms(word: String, num: Int): DataFrame = {
>>   val spark = SparkSession.builder().getOrCreate()
>>   spark.createDataFrame(wordVectors.findSynonyms(word,
>> num)).toDF("word", "similarity")
>> }
>>
>> I find it very strange that the results are parallelized before being
>> returned to the user. The results are already on the driver to begin with,
>> and I can imagine that for most usecases (and definitely for ours) the
>> synonyms are collected right back to the driver. This incurs both an added
>> cost of shipping data to and from the cluster, as well as a more cumbersome
>> interface than needed.
>>
>> Can we change it to just the following?
>>
>> def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
>>   wordVectors.findSynonyms(word, num)
>> }
>>
>> If the user wants the results parallelized, they can still do so on their
>> own.
>>
>> (I had brought this up a while back in Jira. It was suggested that the
>> mailing list would be a better forum to discuss it, so here we are.)
>>
>> Thanks,
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Spark Improvement Proposals

2017-01-03 Thread Joseph Bradley

;> >>> think it is time to bring a messaging model in conjunction
> >> >>> >>> >>> with
> >> >>> >>> >>> the
> >> >>> >>> >>> batch/micro-batch API that Spark is good atakka-streams
> >> >>> >>> >>> close
> >> >>> >>> >>> integration with spark micro-batching APIs looks like a
> great
> >> >>> >>> >>> direction to
> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
> >> >>> >>> >>> streaming
> >> >>> >>> >>> with
> >> >>> >>> >>> batch with the assumption is that micro-batching is
> sufficient
> >> >>> >>> >>> to
> >> >>> >>> >>> run
> >> >>> >>> >>> SQL
> >> >>> >>> >>> commands on stream but do we really have time to do SQL
> >> >>> >>> >>> processing at
> >> >>> >>> >>> streaming data within 1-2 seconds ?
> >> >>> >>> >>>
> >> >>> >>> >>> After reading the email chain, I started to look into Flink
> >> >>> >>> >>> documentation
> >> >>> >>> >>> and if you compare it with Spark documentation, I think we
> >> >>> >>> >>> have
> >> >>> >>> >>> major
> >> >>> >>> >>> work
> >> >>> >>> >>> to do detailing out Spark internals so that more people from
> >> >>> >>> >>> community
> >> >>> >>> >>> start
> >> >>> >>> >>> to take active role in improving the issues so that Spark
> >> >>> >>> >>> stays
> >> >>> >>> >>> strong
> >> >>> >>> >>> compared to Flink.
> >> >>> >>> >>>
> >> >>> >>> >>>
> >> >>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/
> Spark+Internals
> >> >>> >>> >>>
> >> >>> >>> >>>
> >> >>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/
> Flink+Internals
> >> >>> >>> >>>
> >> >>> >>> >>> Spark is no longer an engine that works for micro-batch and
> >> >>> >>> >>> batch...We
> >> >>> >>> >>> (and
> >> >>> >>> >>> I am sure many others) are pushing spark as an engine for
> >> >>> >>> >>> stream
> >> >>> >>> >>> and
> >> >>> >>> >>> query
> >> >>> >>> >>> processing.we need to make it a state-of-the-art engine
> >> >>> >>> >>> for
> >> >>> >>> >>> high
> >> >>> >>> >>> speed
> >> >>> >>> >>> streaming data and user queries as well !
> >> >>> >>> >>>
> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
> >> >>> >>> >>> <tomasz.gaw...@outlook.com>
> >> >>> >>> >>> wrote:
> >> >>> >>> >>>>
> >> >>> >>> >>>> Hi everyone,
> >> >>> >>> >>>>
> >> >>> >>> >>>> I'm quite late with my answer, but I think my suggestions
> may
> >> >>> >>> >>>> help a
> >> >>> >>> >>>> little bit. :) Many technical and organizational topics
> were
> >> >>> >>> >>>> mentioned,
> >> >>> >>> >>>> but I want to focus on these negative posts about Spark and
> >> >>> >>> >>>> about
> >> >>> >>> >>>> "haters"
> >> >>> >>> >>>>
> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
> community
> >> >>> >>> >>>> -
> >> >>> >>> >>>> it's
> >> >>> >>> >>>> everything here. But Every project has to "flight" on
> >> >>> >>> >>>> "framework
> >> >>> >>> >>>> market"
> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
> >> >>> >>> >>>> communities,
> >> >>> >>> >>>> maybe my mail will inspire someone :)
> >> >>> >>> >>>>
> >> >>> >>> >>>> You (every Spark developer; so far I didn't have enough
> time
> >> >>> >>> >>>> to
> >> >>> >>> >>>> join
> >> >>> >>> >>>> contributing to Spark) has done excellent job. So why are
> >> >>> >>> >>>> some
> >> >>> >>> >>>> people
> >> >>> >>> >>>> saying that Flink (or other framework) is better, like it
> was
> >> >>> >>> >>>> posted
> >> >>> >>> >>>> in
> >> >>> >>> >>>> this mailing list? No, not because that framework is better
> >> >>> >>> >>>> in
> >> >>> >>> >>>> all
> >> >>> >>> >>>> cases.. In my opinion, many of these discussions where
> >> >>> >>> >>>> started
> >> >>> >>> >>>> after
> >> >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow
> >> >>> >>> >>>> "Flink
> >> >>> >>> >>>> vs
> >> >>> >>> >>>> "
> >> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
> >> >>> >>> >>>> sometimes
> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users (often
> >> >>> >>> >>>> PMC's)
> >> >>> >>> >>>> are
> >> >>> >>> >>>> just posting same information about real-time streaming,
> >> >>> >>> >>>> about
> >> >>> >>> >>>> delta
> >> >>> >>> >>>> iterations, etc. It look smart and very often it is marked
> as
> >> >>> >>> >>>> an
> >> >>> >>> >>>> aswer,
> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
> >> >>> >>> >>>>
> >> >>> >>> >>>>
> >> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
> >> >>> >>> >>>> perform
> >> >>> >>> >>>> huge
> >> >>> >>> >>>> performance test. Maybe some company, that supports Spark
> >> >>> >>> >>>> (Databricks,
> >> >>> >>> >>>> Cloudera? - just saying you're most visible in community
> :) )
> >> >>> >>> >>>> could
> >> >>> >>> >>>> perform performance test of:
> >> >>> >>> >>>>
> >> >>> >>> >>>> - streaming engine - probably Spark will loose because of
> >> >>> >>> >>>> mini-batch
> >> >>> >>> >>>> model, however currently the difference should be much
> lower
> >> >>> >>> >>>> that in
> >> >>> >>> >>>> previous versions
> >> >>> >>> >>>>
> >> >>> >>> >>>> - Machine Learning models
> >> >>> >>> >>>>
> >> >>> >>> >>>> - batch jobs
> >> >>> >>> >>>>
> >> >>> >>> >>>> - Graph jobs
> >> >>> >>> >>>>
> >> >>> >>> >>>> - SQL queries
> >> >>> >>> >>>>
> >> >>> >>> >>>> People will see that Spark is envolving and is also a
> modern
> >> >>> >>> >>>> framework,
> >> >>> >>> >>>> because after reading posts mentioned above people may
> think
> >> >>> >>> >>>> "it
> >> >>> >>> >>>> is
> >> >>> >>> >>>> outdated, future is in framework X".
> >> >>> >>> >>>>
> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
> >> >>> >>> >>>> Structured
> >> >>> >>> >>>> Streaming beats every other framework in terms of
> easy-of-use
> >> >>> >>> >>>> and
> >> >>> >>> >>>> reliability. Performance tests, done in various
> environments
> >> >>> >>> >>>> (in
> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
> >> >>> >>> >>>> 20-node
> >> >>> >>> >>>> cluster), could be also very good marketing stuff to say
> >> >>> >>> >>>> "hey,
> >> >>> >>> >>>> you're
> >> >>> >>> >>>> telling that you're better, but Spark is still faster and
> is
> >> >>> >>> >>>> still
> >> >>> >>> >>>> getting even more fast!". This would be based on facts
> (just
> >> >>> >>> >>>> numbers),
> >> >>> >>> >>>> not opinions. It would be good for companies, for marketing
> >> >>> >>> >>>> puproses
> >> >>> >>> >>>> and
> >> >>> >>> >>>> for every Spark developer
> >> >>> >>> >>>>
> >> >>> >>> >>>>
> >> >>> >>> >>>> Second: real-time streaming. I've written some time ago
> about
> >> >>> >>> >>>> real-time
> >> >>> >>> >>>> streaming support in Spark Structured Streaming. Some work
> >> >>> >>> >>>> should be
> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
> possible.
> >> >>> >>> >>>> Maybe
> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on top of
> >> >>> >>> >>>> Akka?
> >> >>> >>> >>>> I
> >> >>> >>> >>>> don't
> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think that
> >> >>> >>> >>>> Spark
> >> >>> >>> >>>> should
> >> >>> >>> >>>> have real-time streaming support. Currently I see many
> >> >>> >>> >>>> posts/comments
> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing
> >> >>> >>> >>>> very
> >> >>> >>> >>>> good
> >> >>> >>> >>>> jobs with micro-batches, however I think it is possible to
> >> >>> >>> >>>> add
> >> >>> >>> >>>> also
> >> >>> >>> >>>> more
> >> >>> >>> >>>> real-time processing.
> >> >>> >>> >>>>
> >> >>> >>> >>>> Other people said much more and I agree with proposal of
> SIP.
> >> >>> >>> >>>> I'm
> >> >>> >>> >>>> also
> >> >>> >>> >>>> happy that PMC's are not saying that they will not listen
> to
> >> >>> >>> >>>> users,
> >> >>> >>> >>>> but
> >> >>> >>> >>>> they really want to make Spark better for every user.
> >> >>> >>> >>>>
> >> >>> >>> >>>>
> >> >>> >>> >>>> What do you think about these two topics? Especially I'm
> >> >>> >>> >>>> looking
> >> >>> >>> >>>> at
> >> >>> >>> >>>> Cody
> >> >>> >>> >>>> (who has started this topic) and PMCs :)
> >> >>> >>> >>>>
> >> >>> >>> >>>> Pozdrawiam / Best regards,
> >> >>> >>> >>>>
> >> >>> >>> >>>> Tomasz
> >> >>> >>> >>>>
> >> >>> >>> >>>>
> >> >>> >>>
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: mllib metrics vs ml evaluators and how to improve apis for users

2017-01-02 Thread Joseph Bradley

Hi Ilya,

Thanks for your thoughts.  Here's my understanding of where we are headed:
* We will want to move the *Metrics functionality to the spark.ml package,
as part of *Evaluator or related classes such as model/result summaries.
* It has not yet been decided if or when the spark.mllib package will be
removed.  This cannot happen until spark.ml has complete feature parity and
has been separated from spark.mllib internally for a few releases, and it
will require a community vote and significant QA.
* You're correct that Evaluators are meant for model tuning.  IMO, multiple
metrics are more naturally handled by model/result summaries, though I
could see good arguments for separating the metric computation from
models.  This is an issue which has not yet been discussed properly.  There
have also been questions about Evaluators maintaining multiple metrics
along the way during model tuning (SPARK-18704).

I created a JIRA for discussing this further:
https://issues.apache.org/jira/browse/SPARK-19053

Thanks!
Joseph

On Thu, Dec 29, 2016 at 8:36 PM, Ilya Matiach <il...@microsoft.com> wrote:

> Hi ML/MLLib developers,
>
> 1.I’m trying to add a weights column to ml spark evaluators
> (RegressionEvaluator, BinaryClassificationEvaluator,
> MutliclassClassificationEvaluator) that use mllib metrics and I have a
> few questions (JIRA
>
> 2.SPARK-18693 <https://issues.apache.org/jira/browse/SPARK-18693>).
> I didn’t see any similar question on the forums or stackoverflow.
>
> Moving forward, will we keep mllib metrics (RegressionMetrics,
> MulticlassMetrics, BinaryClassificationMetrics) as something separate to
> the evaluators, or will we remove them when mllib is removed in spark 3.0?
>
The mllib metrics seem very useful because they are able to compute/expose
> many metrics on one dataset, whereas with the evaluators it is not
> performant to re-evaluate the entire dataset for a single different metric.
>
> For example, if I calculate the RMSE and then MSE using the ML
> RegressionEvaluator, I will be redoing most of the work twice, so the ML
> api doesn’t make sense to use in this scenario.
>
Also, the ml evaluators expose a lot fewer metrics than the mllib metrics
> classes, so it seems like the ml evaluators are not at parity with the
> mllib metrics classes.
>
> I can see how the ml evaluators are useful in CrossValidator, but for
> exploring all metrics from a scored dataset it doesn’t really make sense to
> use them.
>
> From the viewpoint of exploring all metrics for a scored model, does this
> mean that the mllib metrics classes should be moved to ml?
>
That would solve my issue if that is what is planned in the future.
> However, that doesn’t make sense to me, because it may cause some confusion
> for ml users to see metrics and evaluators classes.
>
>
>
> Instead, it seems like the ml evaluators need to be changed at the api
> layer to:
>
>1. Allow the user to either retrieve a single value
>2. Allow the user to retrieve all metrics or a set of metrics
>
> One possibility would be to overload evaluate so that we would have
> something like:
>
>
>
> override def evaluate(dataset: Dataset[_]): Double
>
> override def evaluate(dataset: Dataset[_], metrics:Array[String]):
> Dataset[_]
>
>
>
> But for some metrics like confusion matrix you couldn’t really fit the
> data into the result of the second api in addition to the single-value
> metrics.
>
> The format of the mllib metrics classes was much more convenient, as you
> could retrieve them directly.
>
> Following this line of thought, maybe the APIs could be:
>
>
>
> override def evaluate(dataset: Dataset[_]): Double
>
> def evaluateMetrics(dataset: Dataset[_]): RegressionEvaluation (or
> classification/multiclass etc)
>
>
>
> where the evaluation class returned will have very similar fields to the
> corresponding mllib RegressionMetrics class that can be called by the user.
>
>
>
> Any thoughts/ideas about spark ml evaluators/mllib metrics apis, coding
> suggestions for the api proposed, or a general roadmap would be really
> appreciated.
>
>
>
> Thank you, Ilya
>



-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-16 Thread Joseph Bradley

+1

On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:

> +1
>
> On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li <gatorsm...@gmail.com> wrote:
>
>> +1
>>
>> Xiao Li
>>
>> 2016-12-16 12:19 GMT-08:00 Felix Cheung <felixcheun...@hotmail.com>:
>>
>>> For R we have a license field in the DESCRIPTION, and this is standard
>>> practice (and requirement) for R packages.
>>>
>>> https://cran.r-project.org/doc/manuals/R-exts.html#Licensing
>>>
>>> --
>>> *From:* Sean Owen <so...@cloudera.com>
>>> *Sent:* Friday, December 16, 2016 9:57:15 AM
>>> *To:* Reynold Xin; dev@spark.apache.org
>>> *Subject:* Re: [VOTE] Apache Spark 2.1.0 (RC5)
>>>
>>> (If you have a template for these emails, maybe update it to use https
>>> links. They work for apache.org domains. After all we are asking people
>>> to verify the integrity of release artifacts, so it might as well be
>>> secure.)
>>>
>>> (Also the new archives use .tar.gz instead of .tgz like the others. No
>>> big deal, my OCD eye just noticed it.)
>>>
>>> I don't see an Apache license / notice for the Pyspark or SparkR
>>> artifacts. It would be good practice to include this in a convenience
>>> binary. I'm not sure if it's strictly mandatory, but something to adjust in
>>> any event. I think that's all there is to do for SparkR. For Pyspark, which
>>> packages a bunch of dependencies, it does include the licenses (good) but I
>>> think it should include the NOTICE file.
>>>
>>> This is the first time I recall getting 0 test failures off the bat!
>>> I'm using Java 8 / Ubuntu 16 and yarn/hive/hadoop-2.7 profiles.
>>>
>>> I think I'd +1 this therefore unless someone knows that the license
>>> issue above is real and a blocker.
>>>
>>> On Fri, Dec 16, 2016 at 5:17 AM Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT
>>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 2.1.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is v2.1.0-rc5 (cd0a08361e2526519e7c131c42116
>>>> bf56fa62c76)
>>>>
>>>> List of JIRA tickets resolved are:  https://issues.apache.org/jir
>>>> a/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1223/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-docs/
>>>>
>>>>
>>>> *FAQ*
>>>>
>>>> *How can I help test this release?*
>>>>
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> *What should happen to JIRA tickets still targeting 2.1.0?*
>>>>
>>>> Committers should look at those and triage. Extremely important bug
>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>> worked on immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>>>>
>>>> *What happened to RC3/RC5?*
>>>>
>>>> They had issues withe release packaging and as a result were skipped.
>>>>
>>>>
>>
>
>
> --
>
> Herman van Hövell
>
> Software Engineer
>
> Databricks Inc.
>
> hvanhov...@databricks.com
>
> +31 6 420 590 27
>
> databricks.com
>
> [image: http://databricks.com] <http://databricks.com/>
>



-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Please limit commits for branch-2.1

2016-11-21 Thread Joseph Bradley

To committers and contributors active in MLlib,

Thanks everyone who has started helping with the QA tasks in SPARK-18316!
I'd like to request that we stop committing non-critical changes to MLlib,
including the Python and R APIs, since still-changing public APIs make it
hard to QA.  We need have already started to sign off on some QA tasks, but
we may need to re-open them if changes are committed, especially if those
changes are to public APIs.  There's no need to push Python and R wrappers
into 2.1 at the last minute.

Let's focus on completing QA, after which we can resume committing API
changes to master (not branch-2.1).

Thanks everyone!
Joseph


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Joseph Bradley

Hi Georg,

It's true we need better documentation for this.  I'd recommend checking
out simple algorithms within Spark for examples:
ml.feature.Tokenizer
ml.regression.IsotonicRegression

You should not need to put your library in Spark's namespace.  The shared
Params in SPARK-7146 are not necessary to create a custom algorithm; they
are just niceties.

Though there aren't great docs yet, you should be able to follow existing
examples.  And I'd like to add more docs in the future!

Good luck,
Joseph

On Wed, Nov 16, 2016 at 6:29 AM, Georg Heiler 
wrote:

> HI,
>
> I want to develop a library with custom Estimator / Transformers for
> spark. So far not a lot of documentation could be found but
> http://stackoverflow.com/questions/37270446/how-to-
> roll-a-custom-estimator-in-pyspark-mllib
>
> Suggest that:
> Generally speaking, there is no documentation because as for Spark 1.6 /
> 2.0 most of the related API is not intended to be public. It should change
> in Spark 2.1.0 (see SPARK-7146
> ).
>
> Where can I already find documentation today?
> Is it true that my library would require residing in Sparks`s namespace
> similar to https://github.com/collectivemedia/spark-ext to utilize all
> the handy functionality?
>
> Kind Regards,
> Georg
>

Re: Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread Joseph Bradley

Thanks for the suggestion.  That would be faster, but less accurate in most
cases.  It's generally better to use a new random sample on each iteration,
based on literature and results I've seen.
Joseph

On Fri, Nov 11, 2016 at 5:13 AM, WangJianfei <
wangjianfe...@otcaix.iscas.ac.cn> wrote:

> when we train the mode, we will use the data with a subSampleRate, so if
> the
> subSampleRate < 1.0 , we can do a sample first to reduce the memory usage.
> se the code below in GradientBoostedTrees.boost()
>
>  while (m < numIterations && !doneLearning) {
>   // Update data with pseudo-residuals 剩余误差
>   val data = predError.zip(input).map { case ((pred, _), point) =>
> LabeledPoint(-loss.gradient(pred, point.label), point.features)
>   }
>
>   timer.start(s"building tree $m")
>   logDebug("###")
>   logDebug("Gradient boosting tree iteration " + m)
>   logDebug("###")
>   val dt = new DecisionTreeRegressor().setSeed(seed + m)
>   val model = dt.train(data, treeStrategy)
>
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Reduce-the-memory-
> usage-if-we-do-same-first-in-GradientBoostedTrees-if-
> subsamplingRate-1-0-tp19826.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-04 Thread Joseph Bradley

+1

On Fri, Nov 4, 2016 at 11:20 AM, Michael Armbrust 
wrote:

> +1
>
> On Tue, Nov 1, 2016 at 9:51 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.2. The vote is open until Fri, Nov 4, 2016 at 22:00 PDT and passes if a
>> majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.2-rc2 (a6abe1ee22141931614bf27a4f371
>> c46d8379e33)
>>
>> This release candidate resolves 84 issues: https://s.apache.org/spark-2.0
>> .2-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1210/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc2-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.1.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series. Bugs already
>> present in 2.0.1, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>> (i.e. RC3) is cut, I will change the fix version of those patches to 2.0.2.
>>
>
>

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Joseph Bradley

+1

On Thu, Nov 3, 2016 at 9:51 PM, Kousuke Saruta 
wrote:

> +1 (non-binding)
>
> - Kousuke
>
> On 2016/11/03 9:40, Reynold Xin wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a
>> majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.3
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v1.6.3-rc2 (1e860747458d74a4ccbd081103a05
>> 42a2367b14b)
>>
>> This release candidate addresses 52 JIRA tickets:
>> https://s.apache.org/spark-1.6.3-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-bin/ <
>> http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.3-rc2-bin/>
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1212/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-docs/ <
>> http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.3-rc2-docs/
>> >
>>
>>
>> ===
>> == How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions from 1.6.2.
>>
>> 
>> == What justifies a -1 vote for this release?
>> 
>> This is a maintenance release in the 1.6.x series.  Bugs already present
>> in 1.6.2, missing features, or bugs related to new features will not
>> necessarily block this release.
>>
>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [ML]Random Forest Error : Size exceeds Integer.MAX_VALUE

2016-10-05 Thread Joseph Bradley

Could you please file a bug report JIRA and also include more info about
what you ran?
* Random forest Param settings
* dataset dimensionality, partitions, etc.
Thanks!

On Tue, Oct 4, 2016 at 10:44 PM, Samkit Shah  wrote:

> Hello folks,
> I am running Random Forest from ml from spark 1.6.1 on bimbo[1] dataset
> with following configurations:
>
> "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" 
> "-Dspark.driver.extraJavaOptions=-Xss10240k
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution
> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=2
> -XX:-UseAdaptiveSizePolicy -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit  -XX:
> CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g
> -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster"
> "-Dspark.speculation=true" "-Dspark.speculation.multiplier=2"
> "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms"
>  "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768"
> "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6"
> "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution
> -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC
> -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g
> -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps"
> "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g"
> "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails"
> "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution"
> "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2"
> "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2"
> "-XX:-UseGCOverheadLimit" "-XX:CMSInitiatingOccupancyFraction=75"
> "-XX:NewSize=8g" "-XX:MaxNewSize=8g" "-XX:SurvivorRatio=3"
> "-DnumPartitions=36" "org.apache.spark.deploy.worker.DriverWrapper"
> "spark://Worker@11.0.0.106:56419"
>
>
> I get following error:
> 16/10/04 06:55:05 WARN TaskSetManager: Lost task 8.0 in stage 19.0 (TID
> 194, 11.0.0.106): java.lang.IllegalArgumentException: Size exceeds
> Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
> at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.
> apply(DiskStore.scala:127)
> at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.
> apply(DiskStore.scala:115)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
> at org.apache.spark.storage.BlockManager.doGetLocal(
> BlockManager.scala:503)
> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(
> ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> I have varied number of partitions from 24 to 48. I still get the same
> error. How can this problem be tackled?
>
>
> Thanks,
> Samkit
>
>
>
>
> [1]: https://www.kaggle.com/c/grupo-bimbo-inventory-demand
>

Re: welcoming Xiao Li as a committer

2016-10-05 Thread Joseph Bradley

Congrats!

On Tue, Oct 4, 2016 at 4:09 PM, Kousuke Saruta 
wrote:

> Congratulations Xiao!
>
> - Kousuke
> On 2016/10/05 7:44, Bryan Cutler wrote:
>
> Congrats Xiao!
>
> On Tue, Oct 4, 2016 at 11:14 AM, Holden Karau 
> wrote:
>
>> Congratulations :D :) Yay!
>>
>> On Tue, Oct 4, 2016 at 11:14 AM, Suresh Thalamati <
>> suresh.thalam...@gmail.com> wrote:
>>
>>> Congratulations, Xiao!
>>>
>>>
>>>
>>> > On Oct 3, 2016, at 10:46 PM, Reynold Xin  wrote:
>>> >
>>> > Hi all,
>>> >
>>> > Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
>>> committer. Xiao has been a super active contributor to Spark SQL. Congrats
>>> and welcome, Xiao!
>>> >
>>> > - Reynold
>>> >
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>

Re: Nominal Attribute

2016-10-03 Thread Joseph Bradley

There are plans...but not concrete ones yet:
https://issues.apache.org/jira/browse/SPARK-8515
I agree categorical data handling is a pain point and that we need to
improve it!

On Tue, Sep 13, 2016 at 4:45 PM, Danil Kirsanov 
wrote:

> NominalAttribute in MLib is used to represent categorical data internally.
> It is barely documented though and has a number of limitations: for
> example,
> it supports only integer and string data.
> Is there any current effort to expose it (and categorical data handling in
> general) to the users, or is it intended to be an internal MLib data
> representation only?
>
> Thank you,
> Danil
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Nominal-Attribute-tp18935.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Joseph Bradley

+1

On Thu, Sep 29, 2016 at 2:11 PM, Dongjoon Hyun  wrote:

> +1 (non-binding)
>
> At this time, I tested RC4 on the followings.
>
> - CentOS 6.8 (Final)
> - OpenJDK 1.8.0_101
> - Python 2.7.12
>
> /build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> -Dpyspark -Dsparkr -DskipTests clean package
> /build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> -Dpyspark -Dsparkr test
> python2.7 python/run-tests.py --python-executables python2.7
>
> All tests are passed.
>
> Cheers,
> Dongjoon.
>
> On 2016-09-29 12:20 (-0700), Sameer Agarwal  wrote:
> > +1
> >
> > On Thu, Sep 29, 2016 at 12:04 PM, Sean Owen  wrote:
> >
> > > +1 from me too, same result as my RC3 vote/testing.
> > >
> > > On Wed, Sep 28, 2016 at 10:14 PM, Reynold Xin 
> wrote:
> > > > Please vote on releasing the following candidate as Apache Spark
> version
> > > > 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and
> passes
> > > if a
> > > > majority of at least 3+1 PMC votes are cast.
> > > >
> > > > [ ] +1 Release this package as Apache Spark 2.0.1
> > > > [ ] -1 Do not release this package because ...
> > > >
> > > >
> > > > The tag to be voted on is v2.0.1-rc4
> > > > (933d2c1ea4e5f5c4ec8d375b5ccaa4577ba4be38)
> > > >
> > > > This release candidate resolves 301 issues:
> > > > https://s.apache.org/spark-2.0.1-jira
> > > >
> > > > The release files, including signatures, digests, etc. can be found
> at:
> > > > http://people.apache.org/~pwendell/spark-releases/spark-
> 2.0.1-rc4-bin/
> > > >
> > > > Release artifacts are signed with the following key:
> > > > https://people.apache.org/keys/committer/pwendell.asc
> > > >
> > > > The staging repository for this release can be found at:
> > > > https://repository.apache.org/content/repositories/
> orgapachespark-1203/
> > > >
> > > > The documentation corresponding to this release can be found at:
> > > > http://people.apache.org/~pwendell/spark-releases/spark-
> 2.0.1-rc4-docs/
> > > >
> > > >
> > > > Q: How can I help test this release?
> > > > A: If you are a Spark user, you can help us test this release by
> taking
> > > an
> > > > existing Spark workload and running on this release candidate, then
> > > > reporting any regressions from 2.0.0.
> > > >
> > > > Q: What justifies a -1 vote for this release?
> > > > A: This is a maintenance release in the 2.0.x series.  Bugs already
> > > present
> > > > in 2.0.0, missing features, or bugs related to new features will not
> > > > necessarily block this release.
> > > >
> > > > Q: What fix version should I use for patches merging into branch-2.0
> from
> > > > now on?
> > > > A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new
> RC
> > > > (i.e. RC5) is cut, I will change the fix version of those patches to
> > > 2.0.1.
> > > >
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
> >
> > --
> > Sameer Agarwal
> > Software Engineer | Databricks Inc.
> > http://cs.berkeley.edu/~sameerag
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [discuss] Spark 2.x release cadence

2016-09-28 Thread Joseph Bradley

+1 for 4 months.  With QA taking about a month, that's very reasonable.

My main ask (especially for MLlib) is for contributors and committers to
take extra care not to delay on updating the Programming Guide for new
APIs.  Documentation debt often collects and has to be paid off during QA,
and a longer cycle will exacerbate this problem.

On Wed, Sep 28, 2016 at 7:30 AM, Tom Graves 
wrote:

> +1 to 4 months.
>
> Tom
>
>
> On Tuesday, September 27, 2016 2:07 PM, Reynold Xin 
> wrote:
>
>
> We are 2 months past releasing Spark 2.0.0, an important milestone for the
> project. Spark 2.0.0 deviated (took 6 month from the regular release
> cadence we had for the 1.x line, and we never explicitly discussed what the
> release cadence should look like for 2.x. Thus this email.
>
> During Spark 1.x, roughly every three months we make a new 1.x feature
> release (e.g. 1.5.0 comes out three months after 1.4.0). Development
> happened primarily in the first two months, and then a release branch was
> cut at the end of month 2, and the last month was reserved for QA and
> release preparation.
>
> During 2.0.0 development, I really enjoyed the longer release cycle
> because there was a lot of major changes happening and the longer time was
> critical for thinking through architectural changes as well as API design.
> While I don't expect the same degree of drastic changes in a 2.x feature
> release, I do think it'd make sense to increase the length of release cycle
> so we can make better designs.
>
> My strawman proposal is to maintain a regular release cadence, as we did
> in Spark 1.x, and increase the cycle from 3 months to 4 months. This
> effectively gives us ~50% more time to develop (in reality it'd be slightly
> less than 50% since longer dev time also means longer QA time). As for
> maintenance releases, I think those should still be cut on-demand, similar
> to Spark 1.x, but more aggressively.
>
> To put this into perspective, 4-month cycle means we will release Spark
> 2.1.0 at the end of Nov or early Dec (and branch cut / code freeze at the
> end of Oct).
>
> I am curious what others think.
>
>
>
>
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Joseph Bradley

+1

On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee  wrote:

> +1 (non-binding)
> On Sun, Sep 25, 2016 at 23:20 Jeff Zhang  wrote:
>
>> +1
>>
>> On Mon, Sep 26, 2016 at 2:03 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> ＋1
>>>
>>> On Sun, Sep 25, 2016 at 10:43 PM, Pete Lee 
>>> wrote:
>>>
 +1


 On Sun, Sep 25, 2016 at 3:26 PM, Herman van Hövell tot Westerflier <
 hvanhov...@databricks.com> wrote:

> +1 (non-binding)
>
> On Sun, Sep 25, 2016 at 2:05 PM, Ricardo Almeida <
> ricardo.alme...@actnowib.com> wrote:
>
>> +1 (non-binding)
>>
>> Built and tested on
>> - Ubuntu 16.04 / OpenJDK 1.8.0_91
>> - CentOS / Oracle Java 1.7.0_55
>> (-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver
>> -Pyarn)
>>
>>
>> On 25 September 2016 at 22:35, Matei Zaharia > > wrote:
>>
>>> +1
>>>
>>> Matei
>>>
>>> On Sep 25, 2016, at 1:25 PM, Josh Rosen 
>>> wrote:
>>>
>>> +1
>>>
>>> On Sun, Sep 25, 2016 at 1:16 PM Yin Huai 
>>> wrote:
>>>
 +1

 On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun <
 dongj...@apache.org> wrote:

> +1 (non binding)
>
> RC3 is compiled and tested on the following two systems, too. All
> tests passed.
>
> * CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1
>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
> -Phive-thriftserver -Dsparkr
> * CentOS 7.2 / Open JDK 1.8.0_102
>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
> -Phive-thriftserver
>
> Cheers,
> Dongjoon
>
>
>
> On Saturday, September 24, 2016, Reynold Xin 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark
>> version 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT 
>> and
>> passes if a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.1-rc3 (
>> 9d28cc10357a8afcfb2fa2e6eecb5c2cc2730d17)
>>
>> This release candidate resolves 290 issues:
>> https://s.apache.org/spark-2.0.1-jira
>>
>> The release files, including signatures, digests, etc. can be
>> found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-
>> 2.0.1-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/
>> orgapachespark-1201/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-
>> 2.0.1-rc3-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by
>> taking an existing Spark workload and running on this release 
>> candidate,
>> then reporting any regressions from 2.0.0.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series.  Bugs
>> already present in 2.0.0, missing features, or bugs related to new 
>> features
>> will not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into
>> branch-2.0 from now on?
>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a
>> new RC (i.e. RC4) is cut, I will change the fix version of those 
>> patches to
>> 2.0.1.
>>
>>
>>

>>>
>>
>

>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

Re: GraphFrames 0.2.0 released

2016-08-26 Thread Joseph Bradley

This should do it:
https://github.com/graphframes/graphframes/releases/tag/release-0.2.0
Thanks for the reminder!
Joseph

On Wed, Aug 24, 2016 at 10:11 AM, Maciej Bryński  wrote:

> Hi,
> Do you plan to add tag for this release on github ?
> https://github.com/graphframes/graphframes/releases
>
> Regards,
> Maciek
>
> 2016-08-17 3:18 GMT+02:00 Jacek Laskowski :
>
>> Hi Tim,
>>
>> AWESOME. Thanks a lot for releasing it. That makes me even more eager
>> to see it in Spark's codebase (and replacing the current RDD-based
>> API)!
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Tue, Aug 16, 2016 at 9:32 AM, Tim Hunter 
>> wrote:
>> > Hello all,
>> > I have released version 0.2.0 of the GraphFrames package. Apart from a
>> few
>> > bug fixes, it is the first release published for Spark 2.0 and both
>> scala
>> > 2.10 and 2.11. Please let us know if you have any comment or questions.
>> >
>> > It is available as a Spark package:
>> > https://spark-packages.org/package/graphframes/graphframes
>> >
>> > The source code is available as always at
>> > https://github.com/graphframes/graphframes
>> >
>> >
>> > What is GraphFrames?
>> >
>> > GraphFrames is a DataFrame-based graph engine Spark. In addition to the
>> > algorithms available in GraphX, users can write highly expressive
>> queries by
>> > leveraging the DataFrame API, combined with a new API for motif
>> finding. The
>> > user also benefits from DataFrame performance optimizations within the
>> Spark
>> > SQL engine.
>> >
>> > Cheers
>> >
>> > Tim
>> >
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Maciek Bryński
>

Re: Welcoming Felix Cheung as a committer

2016-08-16 Thread Joseph Bradley

Welcome Felix!

On Mon, Aug 15, 2016 at 6:16 AM, mayur bhole 
wrote:

> Congrats Felix!
>
> On Mon, Aug 15, 2016 at 2:57 PM, Paul Roy  wrote:
>
>> Congrats Felix
>>
>> Paul Roy.
>>
>> On Mon, Aug 8, 2016 at 9:15 PM, Matei Zaharia 
>> wrote:
>>
>>> Hi all,
>>>
>>> The PMC recently voted to add Felix Cheung as a committer. Felix has
>>> been a major contributor to SparkR and we're excited to have him join
>>> officially. Congrats and welcome, Felix!
>>>
>>> Matei
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> "Change is slow and gradual. It requires hardwork, a bit of
>> luck, a fair amount of self-sacrifice and a lot of patience."
>>
>> Roy.
>>
>
>

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Joseph Bradley

+1

Mainly tested ML/Graph/R.  Perf tests from Tim Hunter showed minor speedups
from 1.6 for common ML algorithms.

On Thu, Jul 21, 2016 at 9:41 AM, Ricardo Almeida <
ricardo.alme...@actnowib.com> wrote:

> +1 (non binding)
>
> Tested PySpark Core, DataFrame/SQL, MLlib and Streaming on a standalone
> cluster
>
> On 21 July 2016 at 05:24, Reynold Xin  wrote:
>
>> +1
>>
>>
>> On Wednesday, July 20, 2016, Krishna Sankar  wrote:
>>
>>> +1 (non-binding, of course)
>>>
>>> 1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 24:07 min
>>>  mvn clean package -Pyarn -Phadoop-2.7 -DskipTests
>>> 2. Tested pyspark, mllib (iPython 4.0)
>>> 2.0 Spark version is 2.0.0
>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>> 2.2. Linear/Ridge/Lasso Regression OK
>>> 2.3. Classification : Decision Tree, Naive Bayes OK
>>> 2.4. Clustering : KMeans OK
>>>Center And Scale OK
>>> 2.5. RDD operations OK
>>>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>Model evaluation/optimization (rank, numIter, lambda) with
>>> itertools OK
>>> 3. Scala - MLlib
>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>> 3.2. LinearRegressionWithSGD OK
>>> 3.3. Decision Tree OK
>>> 3.4. KMeans OK
>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>> 3.6. saveAsParquetFile OK
>>> 3.7. Read and verify the 3.6 save(above) - sqlContext.parquetFile,
>>> registerTempTable, sql OK
>>> 3.8. result = sqlContext.sql("SELECT
>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>> 4.0. Spark SQL from Python OK
>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>> OK
>>> 5.0. Packages
>>> 5.1. com.databricks.spark.csv - read/write OK (--packages
>>> com.databricks:spark-csv_2.10:1.4.0)
>>> 6.0. DataFrames
>>> 6.1. cast,dtypes OK
>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>> 6.3. All joins,sql,set operations,udf OK
>>> [Dataframe Operations very fast from 11 secs to 3 secs, to 1.8 secs, to
>>> 1.5 secs! Good work !!!]
>>> 7.0. GraphX/Scala
>>> 7.1. Create Graph (small and bigger dataset) OK
>>> 7.2. Structure APIs - OK
>>> 7.3. Social Network/Community APIs - OK
>>> 7.4. Algorithms : PageRank of 2 datasets, aggregateMessages() - OK
>>>
>>> Cheers
>>> 
>>>
>>> On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT
 and passes if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 2.0.0
 [ ] -1 Do not release this package because ...


 The tag to be voted on is v2.0.0-rc5
 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

 This release candidate resolves ~2500 issues:
 https://s.apache.org/spark-2.0.0-jira

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1195/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/


 =
 How can I help test this release?
 =
 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions from 1.x.

 ==
 What justifies a -1 vote for this release?
 ==
 Critical bugs impacting major functionalities.

 Bugs already present in 1.x, missing features, or bugs related to new
 features will not necessarily block this release. Note that historically
 Spark documentation has been published on the website separately from the
 main release so we do not need to block the release due to documentation
 errors either.


>>>
>

Re: Hello

2016-06-20 Thread Joseph Bradley

Hi Harmeet,

I'll add one more item to the other advice: The community is in the process
of putting together a roadmap JIRA for 2.1 for ML:
https://issues.apache.org/jira/browse/SPARK-15581

This JIRA lists some of the major items and links to a few umbrella JIRAs
with subtasks.  I'd expect this roadmap to change a little more as it is
still being formed, but I hope it provides some guidance.  Feel free to
ping on specific JIRAs to ask about their current importance and to see who
else is working on them.

Thanks!
Joseph

On Fri, Jun 17, 2016 at 3:32 PM, Michael Armbrust 
wrote:

> Another good signal is the "target version" (which by convention is only
> set by committers).  When I set this for the upcoming version it means I
> think its important enough that I will prioritize reviewing a patch for it.
>
> On Fri, Jun 17, 2016 at 3:22 PM, Pedro Rodriguez 
> wrote:
>
>> What is the best way to determine what the library maintainers believe is
>> important work to be done?
>>
>> I have looked through the JIRA and its unclear what are priority items
>> one could do work on. I am guessing this is in part because things are a
>> little hectic with final work for 2.0, but it would be helpful to know what
>> to look for or if its better to ask library maintainers directly.
>>
>> Thanks,
>> Pedro Rodriguez
>>
>> On Fri, Jun 17, 2016 at 10:46 AM, Xinh Huynh 
>> wrote:
>>
>>> Here are some guidelines about contributing to Spark:
>>>
>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>>>
>>> There is also a section specific to MLlib:
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>>
>>> -Xinh
>>>
>>> On Thu, Jun 16, 2016 at 9:30 AM,  wrote:
>>>
 Dear All,

 Looking for guidance.

 I am Interested in contributing to the Spark MLlib. Could you please
 take a few minutes to guide me as to what you would consider an ideal path
 / skill an individual should posses.

 I know R / Python / Java / C and C++

 I have a firm understanding of algorithms and Machine learning. I do
 know spark at a "workable knowledge level".

 Where should I start and what should I try to do first  ( spark
 internal level ) and then pick up items on JIRA OR new specifications on
 Spark.

 R has a great set of packages - would it be difficult to migrate them
 to Spark R set. I could try it with your support or if it's desired.

 I wouldn't mind doing testing of some defects etc as an initial
 learning curve if that would assist the community.

 Please, guide.

 Regards,
 Harmeet

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

>>>
>>
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>

Re: DAG in Pipeline

2016-06-12 Thread Joseph Bradley

One more note: When you specify the stages in the Pipeline, they need to be
in topological order according to the DAG.

On Sun, Jun 12, 2016 at 10:47 AM, Joseph Bradley <jos...@databricks.com>
wrote:

> Hi Pranay,
>
> Yes, you can do this.  The DAG structure should be specified via the
> various Transformers' input and output columns, where a Transformer can
> have multiple input and/or output columns.  Most of the classification and
> regression Models are good examples of Transformers with multiple input and
> output columns.
>
> Hope this helps!
> Joseph
>
> On Wed, Jun 8, 2016 at 9:59 PM, Pranay Tonpay <pton...@gmail.com> wrote:
>
>> Hi,
>> Pipeline as of now seems to be having a series of transformers and
>> estimators in a serial fashion.
>> Is it possible to create a DAG sort of thing -
>> Eg -
>> Two transformers running in parallel to cleanse data (a custom built
>> Transformer)  in some way and then their outputs ( two outputs ) used for
>> some sort of correlation ( another custom built Transformer )
>>
>> Let me know -
>>
>> thx
>> pranay
>>
>
>

Re: DAG in Pipeline

2016-06-12 Thread Joseph Bradley

Hi Pranay,

Yes, you can do this.  The DAG structure should be specified via the
various Transformers' input and output columns, where a Transformer can
have multiple input and/or output columns.  Most of the classification and
regression Models are good examples of Transformers with multiple input and
output columns.

Hope this helps!
Joseph

On Wed, Jun 8, 2016 at 9:59 PM, Pranay Tonpay  wrote:

> Hi,
> Pipeline as of now seems to be having a series of transformers and
> estimators in a serial fashion.
> Is it possible to create a DAG sort of thing -
> Eg -
> Two transformers running in parallel to cleanse data (a custom built
> Transformer)  in some way and then their outputs ( two outputs ) used for
> some sort of correlation ( another custom built Transformer )
>
> Let me know -
>
> thx
> pranay
>

Re: Welcoming Yanbo Liang as a committer

2016-06-12 Thread Joseph Bradley

Congrats & welcome!

On Tue, Jun 7, 2016 at 7:15 AM, Xiangrui Meng  wrote:

> Congrats!!
>
> On Mon, Jun 6, 2016, 8:12 AM Gayathri Murali 
> wrote:
>
>> Congratulations Yanbo Liang! Well deserved.
>>
>>
>> On Sun, Jun 5, 2016 at 7:10 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> Congrats, Yanbo!
>>>
>>> On Sun, Jun 5, 2016 at 6:25 PM, Liwei Lin  wrote:
>>>
 Congratulations Yanbo!

 On Mon, Jun 6, 2016 at 7:07 AM, Bryan Cutler  wrote:

> Congratulations Yanbo!
> On Jun 5, 2016 4:03 AM, "Kousuke Saruta" 
> wrote:
>
>> Congratulations Yanbo!
>>
>>
>> - Kousuke
>>
>> On 2016/06/04 11:48, Matei Zaharia wrote:
>>
>>> Hi all,
>>>
>>> The PMC recently voted to add Yanbo Liang as a committer. Yanbo has
>>> been a super active contributor in many areas of MLlib. Please join me 
>>> in
>>> welcoming Yanbo!
>>>
>>> Matei
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

>>>
>>

Re: Shrinking the DataFrame lineage

2016-06-12 Thread Joseph Bradley

Sorry for the slow response.  I agree with Hamel on #1.
GraphFrames are mostly wrappers for GraphX algorithms.  There are a few
which are not:
* BFS: This is an iterative DataFrame alg.  Though it has unit tests, I
have not pushed it in scaling to see how far it can go.
* Belief Propagation example: This uses the conversion to and from an RDD.
Not great, but it's really just an example for now.

I definitely want to get this issue fixed ASAP!

On Sun, May 15, 2016 at 7:15 AM, Hamel Kothari <hamelkoth...@gmail.com>
wrote:

> I don't know about the second one but for question #1:
> When you convert from a cached DF to an RDD (via a map function or the
> "rdd" value) the types are converted from the off-heap types to on-heap
> types. If your rows are fairly large/complex this can have a pretty big
> performance impact so I would watch out for that.
>
> On Fri, May 13, 2016 at 5:29 PM Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
>> Hi Joseph,
>>
>>
>>
>> Thank you for the link! Two follow up questions
>>
>> 1)Suppose I have the original DataFrame in Tungsen, i.e. catalyst types
>> and cached in off-heap store. It might be quite useful for iterative
>> workloads due to lower GC overhead. Then I convert it to RDD and then
>> backto DF. Will the resulting DF remain off-heap or it will be on heap as
>> regular RDD?
>>
>> 2)How is the mentioned problem handled in GraphFrames? Suppose, I want to
>> use aggregateMessages in the iterative loop, for implementing PageRank.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Joseph Bradley [mailto:jos...@databricks.com]
>> *Sent:* Friday, May 13, 2016 12:38 PM
>> *To:* Ulanov, Alexander <alexander.ula...@hpe.com>
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: Shrinking the DataFrame lineage
>>
>>
>>
>> Here's a JIRA for it: https://issues.apache.org/jira/browse/SPARK-13346
>>
>>
>>
>> I don't have a great method currently, but hacks can get around it:
>> convert the DataFrame to an RDD and back to truncate the query plan lineage.
>>
>>
>>
>> Joseph
>>
>>
>>
>> On Wed, May 11, 2016 at 12:46 PM, Ulanov, Alexander <
>> alexander.ula...@hpe.com> wrote:
>>
>> Dear Spark developers,
>>
>>
>>
>> Recently, I was trying to switch my code from RDDs to DataFrames in order
>> to compare the performance. The code computes RDD in a loop. I use
>> RDD.persist followed by RDD.count to force Spark compute the RDD and cache
>> it, so that it does not need to re-compute it on each iteration. However,
>> it does not seem to work for DataFrame:
>>
>>
>>
>> import scala.util.Random
>>
>> val rdd = sc.parallelize(1 to 10, 2).map(x => (Random(5), Random(5))
>>
>> val edges = sqlContext.createDataFrame(rdd).toDF("from", "to")
>>
>> val vertices =
>> edges.select("from").unionAll(edges.select("to")).distinct().cache()
>>
>> vertices.count
>>
>> [Stage 34:=> (65 + 4)
>> / 200]
>>
>> [Stage 34:>  (90 + 5)
>> / 200]
>>
>> [Stage 34:==>   (114 + 4)
>> / 200]
>>
>> [Stage 34:> (137 + 4)
>> / 200]
>>
>> [Stage 34:==>   (157 + 4)
>> / 200]
>>
>> [Stage 34:=>(182 + 4)
>> / 200]
>>
>>
>>
>> res25: Long = 5
>>
>> If I run count again, it recomputes it again instead of using the cached
>> result:
>>
>> scala> vertices.count
>>
>> [Stage 37:=> (49 + 4)
>> / 200]
>>
>> [Stage 37:==>(66 + 4)
>> / 200]
>>
>> [Stage 37:>  (90 + 4)
>> / 200]
>>
>> [Stage 37:=>(110 + 4)
>> / 200]
>>
>> [Stage 37:===>  (133 + 4)
>> / 200]
>>
>> [Stage 37:==>   (157 + 4)
>> / 200]
>>
>> [Stage 37:> (178 + 5)
>> / 200]
>>
>> res26: Long = 5
>>
>>
>>
>> Could you suggest how to schrink the DataFrame lineage ?
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>

Re: Implementing linear albegra operations in the distributed linalg package

2016-06-10 Thread Joseph Bradley

I agree that more distributed matrix ops would be good to have, but I think
there are a few things which need to happen first:
* Now that the spark.ml package has local linear algebra separate from the
spark.mllib package, we should migrate the distributed linear algebra
implementations over to spark.ml.
* This migration will require a bit of thinking about what the API should
look like.  Should it use Datasets?  If so, are there missing requirements
to fix within Datasets or local linear algebra?

I just created a JIRA; let's discuss more there:
https://issues.apache.org/jira/browse/SPARK-15882

Thanks for bringing this up!
Joseph

On Fri, Jun 3, 2016 at 4:02 AM, José Manuel Abuín Mosquera <
abui...@gmail.com> wrote:

> Hello,
>
> I would like to add some linear algebra operations to all the
> DistributedMatrix classes that Spark actually handles (CoordinateMatrix,
> BlockMatrix, IndexedRowMatrix and RowMatrix), but first I would like do ask
> if you consider this useful. (For me, it is)
>
> Of course, these operations will be distributed, but they will rely on the
> local implementation of mllib linalg. For example, when multiplying an
> IndexedRowMatrix by a DenseVector, the multiplication of one of the matrix
> rows by the vector will be performed by using the local implementation
>
> What is your opinion about it?
>
> Thank you
>
> --
> José Manuel Abuín Mosquera
> Pre-doctoral researcher
> Centro de Investigación en Tecnoloxías da Información (CiTIUS)
> University of Santiago de Compostela
> 15782 Santiago de Compostela, Spain
>
> http://citius.usc.es/equipo/investigadores-en-formacion/josemanuel.abuin
> http://jmabuin.github.io
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Joseph Bradley

+1

On Wed, May 18, 2016 at 10:49 AM, Reynold Xin  wrote:

> Hi Ovidiu-Cristian ,
>
> The best source of truth is change the filter with target version to
> 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we
> get closer to 2.0 release, more will be retargeted at 2.1.0.
>
>
>
> On Wed, May 18, 2016 at 10:43 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
>> Yes, I can filter..
>> Did that and for example:
>>
>> https://issues.apache.org/jira/browse/SPARK-15370?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20affectedVersion%20%3D%202.0.0
>> 
>>
>> To rephrase: for 2.0 do you have specific issues that are not a priority
>> and will released maybe with 2.1 for example?
>>
>> Keep up the good work!
>>
>> On 18 May 2016, at 18:19, Reynold Xin  wrote:
>>
>> You can find that by changing the filter to target version = 2.0.0.
>> Cheers.
>>
>> On Wed, May 18, 2016 at 9:00 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> +1 Great, I see the list of resolved issues, do you have a list of known
>>> issue you plan to stay with this release?
>>>
>>> with
>>> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Phive
>>> -Phive-thriftserver -DskipTests clean package
>>>
>>> mvn -version
>>> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5;
>>> 2015-11-10T17:41:47+01:00)
>>> Maven home: /Users/omarcu/tools/apache-maven-3.3.9
>>> Java version: 1.7.0_80, vendor: Oracle Corporation
>>> Java home:
>>> /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home/jre
>>> Default locale: en_US, platform encoding: UTF-8
>>> OS name: "mac os x", version: "10.11.5", arch: "x86_64", family: “mac"
>>>
>>> [INFO] Reactor Summary:
>>> [INFO]
>>> [INFO] Spark Project Parent POM ... SUCCESS [
>>> 2.635 s]
>>> [INFO] Spark Project Tags . SUCCESS [
>>> 1.896 s]
>>> [INFO] Spark Project Sketch ... SUCCESS [
>>> 2.560 s]
>>> [INFO] Spark Project Networking ... SUCCESS [
>>> 6.533 s]
>>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>>> 4.176 s]
>>> [INFO] Spark Project Unsafe ... SUCCESS [
>>> 4.809 s]
>>> [INFO] Spark Project Launcher . SUCCESS [
>>> 6.242 s]
>>> [INFO] Spark Project Core . SUCCESS
>>> [01:20 min]
>>> [INFO] Spark Project GraphX ... SUCCESS [
>>> 9.148 s]
>>> [INFO] Spark Project Streaming  SUCCESS [
>>> 22.760 s]
>>> [INFO] Spark Project Catalyst . SUCCESS [
>>> 50.783 s]
>>> [INFO] Spark Project SQL .. SUCCESS
>>> [01:05 min]
>>> [INFO] Spark Project ML Local Library . SUCCESS [
>>> 4.281 s]
>>> [INFO] Spark Project ML Library ... SUCCESS [
>>> 54.537 s]
>>> [INFO] Spark Project Tools  SUCCESS [
>>> 0.747 s]
>>> [INFO] Spark Project Hive . SUCCESS [
>>> 33.032 s]
>>> [INFO] Spark Project HiveContext Compatibility  SUCCESS [
>>> 3.198 s]
>>> [INFO] Spark Project REPL . SUCCESS [
>>> 3.573 s]
>>> [INFO] Spark Project YARN Shuffle Service . SUCCESS [
>>> 4.617 s]
>>> [INFO] Spark Project YARN . SUCCESS [
>>> 7.321 s]
>>> [INFO] Spark Project Hive Thrift Server ... SUCCESS [
>>> 16.496 s]
>>> [INFO] Spark Project Assembly . SUCCESS [
>>> 2.300 s]
>>> [INFO] Spark Project External Flume Sink .. SUCCESS [
>>> 4.219 s]
>>> [INFO] Spark Project External Flume ... SUCCESS [
>>> 6.987 s]
>>> [INFO] Spark Project External Flume Assembly .. SUCCESS [
>>> 1.465 s]
>>> [INFO] Spark Integration for Kafka 0.8  SUCCESS [
>>> 6.891 s]
>>> [INFO] Spark Project Examples . SUCCESS [
>>> 13.465 s]
>>> [INFO] Spark Project External Kafka Assembly .. SUCCESS [
>>> 2.815 s]
>>> [INFO]
>>> 
>>> [INFO] BUILD SUCCESS
>>> [INFO]
>>> 
>>> [INFO] Total time: 07:04 min
>>> [INFO] Finished at: 2016-05-18T17:55:33+02:00
>>> [INFO] Final Memory: 90M/824M
>>> [INFO]
>>> 
>>>
>>> On 18 May 2016, at 16:28, Sean Owen  wrote:
>>>
>>> I think it's a good idea. Although releases have been preceded before
>>> by release candidates for developers, it would be

Re: Shrinking the DataFrame lineage

2016-05-13 Thread Joseph Bradley

Here's a JIRA for it: https://issues.apache.org/jira/browse/SPARK-13346

I don't have a great method currently, but hacks can get around it: convert
the DataFrame to an RDD and back to truncate the query plan lineage.

Joseph

On Wed, May 11, 2016 at 12:46 PM, Ulanov, Alexander <
alexander.ula...@hpe.com> wrote:

> Dear Spark developers,
>
>
>
> Recently, I was trying to switch my code from RDDs to DataFrames in order
> to compare the performance. The code computes RDD in a loop. I use
> RDD.persist followed by RDD.count to force Spark compute the RDD and cache
> it, so that it does not need to re-compute it on each iteration. However,
> it does not seem to work for DataFrame:
>
>
>
> import scala.util.Random
>
> val rdd = sc.parallelize(1 to 10, 2).map(x => (Random(5), Random(5))
>
> val edges = sqlContext.createDataFrame(rdd).toDF("from", "to")
>
> val vertices =
> edges.select("from").unionAll(edges.select("to")).distinct().cache()
>
> vertices.count
>
> [Stage 34:=> (65 + 4)
> / 200]
>
> [Stage 34:>  (90 + 5)
> / 200]
>
> [Stage 34:==>   (114 + 4)
> / 200]
>
> [Stage 34:> (137 + 4)
> / 200]
>
> [Stage 34:==>   (157 + 4)
> / 200]
>
> [Stage 34:=>(182 + 4)
> / 200]
>
>
>
> res25: Long = 5
>
> If I run count again, it recomputes it again instead of using the cached
> result:
>
> scala> vertices.count
>
> [Stage 37:=> (49 + 4)
> / 200]
>
> [Stage 37:==>(66 + 4)
> / 200]
>
> [Stage 37:>  (90 + 4)
> / 200]
>
> [Stage 37:=>(110 + 4)
> / 200]
>
> [Stage 37:===>  (133 + 4)
> / 200]
>
> [Stage 37:==>   (157 + 4)
> / 200]
>
> [Stage 37:> (178 + 5)
> / 200]
>
> res26: Long = 5
>
>
>
> Could you suggest how to schrink the DataFrame lineage ?
>
>
>
> Best regards, Alexander
>

Re: Decrease shuffle in TreeAggregate with coalesce ?

2016-04-27 Thread Joseph Bradley

Do you have code which can reproduce this performance drop in treeReduce?
It would be helpful to debug.  In the 1.6 release, we profiled it via the
various MLlib algorithms and did not see performance drops.

It's not just renumbering the partitions; it is reducing the number of
partitions by a factor of 1.0/scale (where scale > 1).  This creates a
"tree"-structured aggregation so that more of the work of merging during
aggregation is done on the workers, not the driver.

On Wed, Apr 27, 2016 at 4:46 AM, Guillaume Pitel  wrote:

> Hi,
>
> I've been looking at the code of RDD.treeAggregate, because we've seen a
> huge performance drop between 1.5.2 and 1.6.1 on a treeReduce. I think the
> treeAggregate code hasn't changed, so my message is not about the
> performance drop, but a more general remark about treeAggregate.
>
> In treeAggregate, after the aggregate is applied inside original
> partitions, we enter the tree :
>
>
> while (numPartitions > scale + math.ceil(numPartitions.toDouble / scale))
> {
>
> numPartitions /= scale
>
> val curNumPartitions = numPartitions
>
> * partiallyAggregated **=** partiallyAggregated.mapPartitionsWithIndex {*
>
> * (i, iter) **=>** iter.map((i **%** curNumPartitions, _))*
>
> }.reduceByKey(new HashPartitioner(curNumPartitions), cleanCombOp).values
>
> }
>
>
> The two lines where the partitions are numbered then renumbered, then
> reducedByKey seems below optimality to me. There is a huge shuffle cost,
> while a simple coalesce followed by a partition-level aggregation would
> probably perfectly do the job.
>
> Have I missed something that requires to do this reshuffle ?
>
> Best regards
> Guillaume Pitel
>

Re: net.razorvine.pickle.PickleException in Pyspark

2016-04-25 Thread Joseph Bradley

Thanks for your work on this.  Can we continue discussing on the JIRA?

On Sun, Apr 24, 2016 at 9:39 AM, Caique Marques 
wrote:

> Hello, everyone!
>
> I'm trying to implement the association rules in Python. I got implement
> an association by a frequent element, works as expected (example can be
> seen here
> ).
>
>
> Now, my challenge is to implement by a custom RDD. I study the structure
> of Spark and how it implement Python functions of machine learning
> algorithms. The implementations can be seen in the fork
> .
>
> The example for a custom RDD for association rule can be seen here
> ,
> in the line 33 the output is:
>
> MapPartitionsRDD[10] at mapPartitions at PythonMLLibAPI.scala:1533
>
> It is ok. Testing the Scala example, the structure returned is a
> MapPartitions. But, when I try use a *foreach* in this collection:
>
> net.razorvine.pickle.PickleException: expected zero arguments for
> construction of ClassDict (for numpy.core.multiarray._reconstruct)
> at
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
> at
> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1547)
> at
> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1546)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
> at org.apache.spark.scheduler.Task.run(Task.scala:81)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> What is this? What does mean? Any help or tip is welcome.
>
> Thanks,
> Caique.
>

Re: Organizing Spark ML example packages

2016-04-20 Thread Joseph Bradley

Sounds good to me.  I'd request we be strict during this process about
requiring *no* changes to the example itself, which will make review easier.

On Tue, Apr 19, 2016 at 11:12 AM, Bryan Cutler  wrote:

> +1, adding some organization would make it easier for people to find a
> specific example
>
> On Mon, Apr 18, 2016 at 11:52 PM, Yanbo Liang  wrote:
>
>> This sounds good to me, and it will make ML examples more neatly.
>>
>> 2016-04-14 5:28 GMT-07:00 Nick Pentreath :
>>
>>> Hey Spark devs
>>>
>>> I noticed that we now have a large number of examples for ML & MLlib in
>>> the examples project - 57 for ML and 67 for MLLIB to be precise. This is
>>> bound to get larger as we add features (though I know there are some PRs to
>>> clean up duplicated examples).
>>>
>>> What do you think about organizing them into packages to match the use
>>> case and the structure of the code base? e.g.
>>>
>>> org.apache.spark.examples.ml.recommendation
>>>
>>> org.apache.spark.examples.ml.feature
>>>
>>> and so on...
>>>
>>> Is it worth doing? The doc pages with include_example would need
>>> updating, and the run_example script input would just need to change the
>>> package slightly. Did I miss any potential issue?
>>>
>>> N
>>>
>>
>>
>

Re: Different maxBins value for categorical and continuous features in RandomForest implementation.

2016-04-12 Thread Joseph Bradley

That sounds useful.  Would you mind creating a JIRA for it?  Thanks!
Joseph

On Mon, Apr 11, 2016 at 2:06 AM, Rahul Tanwani 
wrote:

> Hi,
>
> Currently the RandomForest algo takes a single maxBins value to decide the
> number of splits to take. This sometimes causes training time to go very
> high when there is a single categorical column having sufficiently large
> number of unique values. This single column impacts all the numeric
> (continuous) columns even though such a high number of splits are not
> required.
>
> Encoding the  categorical column into features make the data very wide and
> this requires us to increase the maxMemoryInMB and puts more pressure on
> the
> GC as well.
>
> Keeping the separate maxBins values for categorial and continuous features
> should be useful in this regard.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Different-maxBins-value-for-categorical-and-continuous-features-in-RandomForest-implementation-tp17099.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Joseph Bradley

+1  By the way, the JIRA for tracking (Scala) API parity is:
https://issues.apache.org/jira/browse/SPARK-4591

On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia 
wrote:

> This sounds good to me as well. The one thing we should pay attention to
> is how we update the docs so that people know to start with the spark.ml
> classes. Right now the docs list spark.mllib first and also seem more
> comprehensive in that area than in spark.ml, so maybe people naturally
> move towards that.
>
> Matei
>
> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
>
> Yes, DB (cc'ed) is working on porting the local linear algebra library
> over (SPARK-13944). There are also frequent pattern mining algorithms we
> need to port over in order to reach feature parity. -Xiangrui
>
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Overall this sounds good to me. One question I have is that in
>> addition to the ML algorithms we have a number of linear algebra
>> (various distributed matrices) and statistical methods in the
>> spark.mllib package. Is the plan to port or move these to the spark.ml
>> namespace in the 2.x series ?
>>
>> Thanks
>> Shivaram
>>
>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>> > certainly better than two.
>> >
>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  wrote:
>> >> Hi all,
>> >>
>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>> built
>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>> API has
>> >> been developed under the spark.ml package, while the old RDD-based
>> API has
>> >> been developed in parallel under the spark.mllib package. While it was
>> >> easier to implement and experiment with new APIs under a new package,
>> it
>> >> became harder and harder to maintain as both packages grew bigger and
>> >> bigger. And new users are often confused by having two sets of APIs
>> with
>> >> overlapped functions.
>> >>
>> >> We started to recommend the DataFrame-based API over the RDD-based API
>> in
>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>> development
>> >> and the usage gradually shifting to the DataFrame-based API. Just
>> counting
>> >> the lines of Scala code, from 1.5 to the current master we added ~1
>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>> to
>> >> gather more resources on the development of the DataFrame-based API
>> and to
>> >> help users migrate over sooner, I want to propose switching RDD-based
>> MLlib
>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>> >>
>> >> * We do not accept new features in the RDD-based spark.mllib package,
>> unless
>> >> they block implementing new features in the DataFrame-based spark.ml
>> >> package.
>> >> * We still accept bug fixes in the RDD-based API.
>> >> * We will add more features to the DataFrame-based API in the 2.x
>> series to
>> >> reach feature parity with the RDD-based API.
>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>> deprecate
>> >> the RDD-based API.
>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>> 3.0.
>> >>
>> >> Though the RDD-based API is already in de facto maintenance mode, this
>> >> announcement will make it clear and hence important to both MLlib
>> developers
>> >> and users. So we’d greatly appreciate your feedback!
>> >>
>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>> >> DataFrame-based API or even the entire MLlib component. This also
>> causes
>> >> confusion. To be clear, “Spark ML” is not an official name and there
>> are no
>> >> plans to rename MLlib to “Spark ML” at this time.)
>> >>
>> >> Best,
>> >> Xiangrui
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>
>

Re: running lda in spark throws exception

2016-04-04 Thread Joseph Bradley

t; >> >>> > at
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:531)
> >> >> >>> > at
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix(LDAModel.scala:523)
> >> >> >>> > at
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> com.mobvoi.knowledgegraph.textmining.lda.ReviewLDA.main(ReviewLDA.java:89)
> >> >> >>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> >> >>> > at
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >> >> >>> > at
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> >> >>> > at java.lang.reflect.Method.invoke(Method.java:606)
> >> >> >>> > at
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > ==here is my codes==
> >> >> >>> >
> >> >> >>> > SparkConf conf = new
> >> >> >>> > SparkConf().setAppName(ReviewLDA.class.getName());
> >> >> >>> >
> >> >> >>> > JavaSparkContext sc = new JavaSparkContext(conf);
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > // Load and parse the data
> >> >> >>> >
> >> >> >>> > JavaRDD data = sc.textFile(inputDir + "/*");
> >> >> >>> >
> >> >> >>> > JavaRDD parsedData = data.map(new
> Function<String,
> >> >> >>> > VectorUrl>() {
> >> >> >>> >
> >> >> >>> >   public VectorUrl call(String s) {
> >> >> >>> >
> >> >> >>> > JsonParser parser = new JsonParser();
> >> >> >>> >
> >> >> >>> > JsonObject jo = parser.parse(s).getAsJsonObject();
> >> >> >>> >
> >> >> >>> > if (!jo.has("word_vec") || !jo.has("webpageUrl")) {
> >> >> >>> >
> >> >> >>> >   return null;
> >> >> >>> >
> >> >> >>> > }
> >> >> >>> >
> >> >> >>> > JsonArray word_vec =
> jo.get("word_vec").getAsJsonArray();
> >> >> >>> >
> >> >> >>> > String url = jo.get("webpageUrl").getAsString();
> >> >> >>> >
> >> >> >>> > double[] values = new double[word_vec.size()];
> >> >> >>> >
> >> >> >>> > for (int i = 0; i < values.length; i++)
> >> >> >>> >
> >> >> >>> >   values[i] = word_vec.get(i).getAsInt();
> >> >> >>> >
> >> >> >>> > return new VectorUrl(Vectors.dense(values), url);
> >> >> >>> >
> >> >> >>> >   }
> >> >> >>> >
> >> >> >>> > });
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > // Index documents with unique IDs
> >> >> >>> >
> >> >> >>> > JavaPairRDD<Long, VectorUrl> id2doc =
> >> >> >>> > JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
> >> >> >>> >
> >> >> &g

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-29 Thread Joseph Bradley

This is great feedback to hear.  I think there was discussion about moving
Pipelines outside of ML at some point, but I'll have to spend more time to
dig it up.

In the meantime, I thought I'd mention this JIRA here in case people have
feedback:
https://issues.apache.org/jira/browse/SPARK-14033
--> It's about merging the concepts of Estimator and Model.  It would be a
breaking change in 2.0, but it would help to simplify the API and reduce
code duplication.

Regarding making shared params public:
https://issues.apache.org/jira/browse/SPARK-7146
--> I'd like to do this for 2.0, though maybe not for all shared params

Joseph

On Mon, Mar 28, 2016 at 12:49 AM, Michał Zieliński <
zielinski.mich...@gmail.com> wrote:

> Hi Maciej,
>
> Absolutely. We had to copy HasInputCol/s, HasOutputCol/s (along with a
> couple of others like HasProbabilityCol) to our repo. Which for most
> use-cases is good enough, but for some (e.g. operating on any Transformer
> that accepts either our or Sparks HasInputCol) makes the code clunky.
> Opening those traits to the public would be a big gain.
>
> Thanks,
> Michal
>
> On 28 March 2016 at 07:44, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> Never develop any custom Transformer (or UnaryTransformer in particular),
>> but I'd be for it if that's the case.
>>
>> Jacek
>> 28.03.2016 6:54 AM "Maciej Szymkiewicz" <mszymkiew...@gmail.com>
>> napisał(a):
>>
>>> Hi Jacek,
>>>
>>> In this context, don't you think it would be useful, if at least some
>>> traits from org.apache.spark.ml.param.shared.sharedParams were
>>> public?HasInputCol(s) and HasOutputCol for example. These are useful
>>> pretty much every time you create custom Transformer.
>>>
>>> --
>>> Pozdrawiam,
>>> Maciej Szymkiewicz
>>>
>>>
>>> On 03/26/2016 10:26 AM, Jacek Laskowski wrote:
>>> > Hi Joseph,
>>> >
>>> > Thanks for the response. I'm one who doesn't understand all the
>>> > hype/need for Machine Learning...yet and through Spark ML(lib) glasses
>>> > I'm looking at ML space. In the meantime I've got few assignments (in
>>> > a project with Spark and Scala) that have required quite extensive
>>> > dataset manipulation.
>>> >
>>> > It was when I sinked into using DataFrame/Dataset for data
>>> > manipulation not RDD (I remember talking to Brian about how RDD is an
>>> > "assembly" language comparing to the higher-level concept of
>>> > DataFrames with Catalysts and other optimizations). After few days
>>> > with DataFrame I learnt he was so right! (sorry Brian, it took me
>>> > longer to understand your point).
>>> >
>>> > I started using DataFrames in far too many places than one could ever
>>> > accept :-) I was so...carried away with DataFrames (esp. show vs
>>> > foreach(println) and UDFs via udf() function)
>>> >
>>> > And then, when I moved to Pipeline API and discovered Transformers.
>>> > And PipelineStage that can create pipelines of DataFrame manipulation.
>>> > They read so well that I'm pretty sure people would love using them
>>> > more often, but...they belong to MLlib so they are part of ML space
>>> > (not many devs tackled yet). I applied the approach to using
>>> > withColumn to have better debugging experience (if I ever need it). I
>>> > learnt it after having watched your presentation about Pipeline API.
>>> > It was so helpful in my RDD/DataFrame space.
>>> >
>>> > So, to promote a more extensive use of Pipelines, PipelineStages, and
>>> > Transformers, I was thinking about moving that part to SQL/DataFrame
>>> > API where they really belong. If not, I think people might miss the
>>> > beauty of the very fine and so helpful Transformers.
>>> >
>>> > Transformers are *not* a ML thing -- they are DataFrame thing and
>>> > should be where they really belong (for their greater adoption).
>>> >
>>> > What do you think?
>>> >
>>> >
>>> > Pozdrawiam,
>>> > Jacek Laskowski
>>> > 
>>> > https://medium.com/@jaceklaskowski/
>>> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> > Follow me at https://twitter.com/jaceklaskowski
>>> >
>>> >
>>> > On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley <jos...@databricks.com>
>>> wrote:
>>> >> Th

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-25 Thread Joseph Bradley

There have been some comments about using Pipelines outside of ML, but I
have not yet seen a real need for it.  If a user does want to use Pipelines
for non-ML tasks, they still can use Transformers + PipelineModels.  Will
that work?

On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski  wrote:

> Hi,
>
> After few weeks with spark.ml now, I came to conclusion that
> Transformer concept from Pipeline API (spark.ml/MLlib) should be part
> of DataFrame (SQL) where they fit better. Are there any plans to
> migrate Transformer API (ML) to DataFrame (SQL)?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: SparkML algos limitations question.

2016-03-21 Thread Joseph Bradley

The indexing I mentioned is more restrictive than that: each index
corresponds to a unique position in a binary tree.  (I.e., the first index
of row 0 is 1, the first of row 1 is 2, the first of row 2 is 4, etc., IIRC)

You're correct that this restriction could be removed; with some careful
thought, we could probably avoid using indices altogether.  I just created
https://issues.apache.org/jira/browse/SPARK-14043  to track this.

On Mon, Mar 21, 2016 at 11:22 AM, Eugene Morozov <evgeny.a.moro...@gmail.com
> wrote:

> Hi, Joseph,
>
> I thought I understood, why it has a limit of 30 levels for decision tree,
> but now I'm not that sure. I thought that's because the decision tree
> stored in the array, which has length of type int, which cannot be more,
> than 2^31-1.
> But here are my new discoveries. I've trained two different random forest
> models of 50 trees and different maxDepth (20 and 30) and specified node
> size = 5. Here are couple of those trees
>
> Model with maxDepth = 20:
> depth=20, numNodes=471
> depth=19, numNodes=497
>
> Model with maxDepth = 30:
> depth=30, numNodes=11347
> depth=30, numNodes=10963
>
> It looks like the tree is not pretty balanced and I understand why that
> happens, but I'm surprised that actual number of nodes way less, than 2^31
> - 1. And now I'm not sure of why the limitation actually exists. With tree
> that consist of 2^31 nodes it'd required to have 8G of memory just to store
> those indexes, so I'd say that depth isn't the biggest issue in such a
> case.
>
> Is it possible to workaround or simply miss maxDepth limitation (without
> codebase modification) to train the tree until I hit the max number of
> nodes? I'd assume that in most cases I simply won't hit it, but the depth
> of the tree would be much more, than 30.
>
>
> --
> Be well!
> Jean Morozov
>
> On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> Hi Eugene,
>>
>> The maxDepth parameter exists because the implementation uses Integer
>> node IDs which correspond to positions in the binary tree.  This simplified
>> the implementation.  I'd like to eventually modify it to avoid depending on
>> tree node IDs, but that is not yet on the roadmap.
>>
>> There is not an analogous limit for the GLMs you listed, but I'm not very
>> familiar with the perceptron implementation.
>>
>> Joseph
>>
>> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> Hello!
>>>
>>> I'm currently working on POC and try to use Random Forest
>>> (classification and regression). I also have to check SVM and Multiclass
>>> perceptron (other algos are less important at the moment). So far I've
>>> discovered that Random Forest has a limitation of maxDepth for trees and
>>> just out of curiosity I wonder why such a limitation has been introduced?
>>>
>>> An actual question is that I'm going to use Spark ML in production next
>>> year and would like to know if there are other limitations like maxDepth in
>>> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>>>
>>> Thanks in advance for your time.
>>> --
>>> Be well!
>>> Jean Morozov
>>>
>>
>>
>

Merging ML Estimator and Model

2016-03-21 Thread Joseph Bradley

Spark devs & users,

I want to bring attention to a proposal to merge the MLlib (spark.ml)
concepts of Estimator and Model in Spark 2.0.  Please comment & discuss on
SPARK-14033  (not in
this email thread).

*TL;DR:*
*Proposal*: Merge Estimator and Model under a single abstraction
(Estimator).
*Goals*: Simplify API by combining the tightly coupled concepts of
Estimator & Model.  Match other ML libraries like scikit-learn.  Simplify
mutability semantics.

*Details*: See https://issues.apache.org/jira/browse/SPARK-14033 for a
design document (Google doc & PDF).

Thanks in advance for feedback!
Joseph

Re: pull request template

2016-03-15 Thread Joseph Bradley

+1 for keeping the template

I figure any template will require conscientiousness & enforcement.

On Sat, Mar 12, 2016 at 1:30 AM, Sean Owen  wrote:

> The template is a great thing as it gets instructions even more right
> in front of people.
>
> Another idea is to just write a checklist of items, like "did you
> describe your changes? did you test? etc." with instructions to delete
> the text and replace with a description. This keeps the boilerplate
> titles out of the commit message.
>
> The special character and post processing just takes that a step further.
>
> On Sat, Mar 12, 2016 at 1:31 AM, Marcelo Vanzin 
> wrote:
> > Hey all,
> >
> > Just wanted to ask: how do people like this new template?
> >
> > While I think it's great to have instructions for people to write
> > proper commit messages, I think the current template has a few
> > downsides.
> >
> > - I tend to write verbose commit messages already when I'm preparing a
> > PR. Now when I open the PR I have to edit the summary field to remove
> > all the boilerplate.
> > - The template ends up in the commit messages, and sometimes people
> > forget to remove even the instructions.
> >
> > Instead, what about changing the template a bit so that it just has
> > instructions prepended with some character, and have those lines
> > removed by the merge_spark_pr.py script? We could then even throw in a
> > link to the wiki as Sean suggested since it won't end up in the final
> > commit messages.
> >
> >
> > On Fri, Feb 19, 2016 at 11:53 AM, Reynold Xin 
> wrote:
> >> We can add that too - just need to figure out a good way so people don't
> >> leave a lot of the unnecessary "guideline" messages in the template.
> >>
> >> The contributing guide is great, but unfortunately it is not as
> noticeable
> >> and is often ignored. It's good to have this full-fledged contributing
> >> guide, and then have a very lightweight version of that in the form of
> >> templates to force contributors to think about all the important aspects
> >> outlined in the contributing guide.
> >>
> >>
> >>
> >>
> >> On Fri, Feb 19, 2016 at 2:36 AM, Sean Owen  wrote:
> >>>
> >>> All that seems fine. All of this is covered in the contributing wiki,
> >>> which is linked from CONTRIBUTING.md (and should be from the
> >>> template), but people don't seem to bother reading it. I don't mind
> >>> duplicating some key points, and even a more explicit exhortation to
> >>> read the whole wiki, before considering opening a PR. We spend way too
> >>> much time asking people to fix things they should have taken 60
> >>> seconds to do correctly in the first place.
> >>>
> >>> On Fri, Feb 19, 2016 at 10:33 AM, Iulian Dragoș
> >>>  wrote:
> >>> > It's a good idea. I would add in there the spec for the PR title. I
> >>> > always
> >>> > get wrong the order between Jira and component.
> >>> >
> >>> > Moreover, CONTRIBUTING.md is also lacking them. Any reason not to
> add it
> >>> > there? I can open PRs for both, but maybe you want to keep that info
> on
> >>> > the
> >>> > wiki instead.
> >>> >
> >>> > iulian
> >>> >
> >>> > On Thu, Feb 18, 2016 at 4:18 AM, Reynold Xin 
> >>> > wrote:
> >>> >>
> >>> >> Github introduced a new feature today that allows projects to define
> >>> >> templates for pull requests. I pushed a very simple template to the
> >>> >> repository:
> >>> >>
> >>> >>
> >>> >>
> https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
> >>> >>
> >>> >>
> >>> >> Over time I think we can see how this works and perhaps add a small
> >>> >> checklist to the pull request template so contributors are reminded
> >>> >> every
> >>> >> time they submit a pull request the important things to do in a pull
> >>> >> request
> >>> >> (e.g. having proper tests).
> >>> >>
> >>> >>
> >>> >>
> >>> >> ## What changes were proposed in this pull request?
> >>> >>
> >>> >> (Please fill in changes proposed in this fix)
> >>> >>
> >>> >>
> >>> >> ## How was the this patch tested?
> >>> >>
> >>> >> (Please explain how this patch was tested. E.g. unit tests,
> integration
> >>> >> tests, manual tests)
> >>> >>
> >>> >>
> >>> >> (If this patch involves UI changes, please attach a screenshot;
> >>> >> otherwise,
> >>> >> remove this)
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> >
> >>> > --
> >>> > Iulian Dragos
> >>> >
> >>> > --
> >>> > Reactive Apps on the JVM
> >>> > www.typesafe.com
> >>> >
> >>
> >>
> >
> >
> >
> > --
> > Marcelo
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Welcoming two new committers

2016-02-08 Thread Joseph Bradley

Congrats & welcome!

On Mon, Feb 8, 2016 at 12:19 PM, Ram Sriharsha 
wrote:

> great job guys! congrats and welcome!
>
> On Mon, Feb 8, 2016 at 12:05 PM, Amit Chavan  wrote:
>
>> Welcome.
>>
>> On Mon, Feb 8, 2016 at 2:50 PM, Suresh Thalamati <
>> suresh.thalam...@gmail.com> wrote:
>>
>>> Congratulations Herman and Wenchen!
>>>
>>> On Mon, Feb 8, 2016 at 10:59 AM, Andrew Or 
>>> wrote:
>>>
 Welcome!

 2016-02-08 10:55 GMT-08:00 Bhupendra Mishra :

> Congratulations to both. and welcome to group.
>
> On Mon, Feb 8, 2016 at 10:45 PM, Matei Zaharia <
> matei.zaha...@gmail.com> wrote:
>
>> Hi all,
>>
>> The PMC has recently added two new Spark committers -- Herman van
>> Hovell and Wenchen Fan. Both have been heavily involved in Spark SQL and
>> Tungsten, adding new features, optimizations and APIs. Please join me in
>> welcoming Herman and Wenchen.
>>
>> Matei
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

>>>
>>
>
>
> --
> Ram Sriharsha
> Architect, Spark and Data Science
> Hortonworks, 2550 Great America Way, 2nd Floor
> Santa Clara, CA 95054
> Ph: 408-510-8635
> email: har...@apache.org
>
> [image: https://www.linkedin.com/in/harsha340]
>  
> 
>
>

Re: Adding Naive Bayes sample code in Documentation

2016-01-29 Thread Joseph Bradley

JIRA created!  https://issues.apache.org/jira/browse/SPARK-13089
Feel free to pick it up if you're interested.  : )
Joseph

On Wed, Jan 27, 2016 at 8:43 AM, Vinayak Agrawal  wrote:

> Hi,
> I was reading through Spark ML package and I couldn't find Naive Bayes
> examples documented on the spark documentation page.
> http://spark.apache.org/docs/latest/ml-classification-regression.html
>
> However, the API exists and can be used.
>
> https://spark.apache.org/docs/1.5.2/api/python/pyspark.ml.html#module-pyspark.ml.classification
>
> Can the examples be added in the latest documentation?
>
> --
> Vinayak Agrawal
>
>
> "To Strive, To Seek, To Find and Not to Yield!"
> ~Lord Alfred Tennyson
>

Re: Spark LDA model reuse with new set of data

2016-01-26 Thread Joseph Bradley

Hi,

This is more a question for the user list, not the dev list, so I'll CC
user.

If you're using mllib.clustering.LDAModel (RDD API), then can you make sure
you're using a LocalLDAModel (or convert to it from DistributedLDAModel)?
You can then call topicDistributions() on the new data.

If you're using ml.clustering.LDAModel (DataFrame API), then you can call
transform() on new data.

Does that work?

Joseph

On Tue, Jan 19, 2016 at 6:21 AM, doruchiulan  wrote:

> Hi,
>
> Just so you know, I am new to Spark, and also very new to ML (this is my
> first contact with ML).
>
> Ok, I am trying to write a DSL where you can run some commands.
>
> I did a command that trains the Spark LDA and it produces the topics I want
> and I saved it using the save method provided by the LDAModel.
>
> Now I want to load this LDAModel and use it to predict on a new set of
> data.
> I call the load method, obtain the LDAModel instance but here I am stuck.
>
> Isnt this possible ? Am I wrong in the way I understood LDA and we cannot
> reuse trained LDA to analyse new data ?
>
> If its possible can you point me to some documentation, or give me a hint
> on
> how should I do that.
>
> Thx
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-LDA-model-reuse-with-new-set-of-data-tp16047.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: running lda in spark throws exception

2015-12-29 Thread Joseph Bradley

Hi Li,

I'm wondering if you're running into the same bug reported here:
https://issues.apache.org/jira/browse/SPARK-12488

I haven't figured out yet what is causing it.  Do you have a small corpus
which reproduces this error, and which you can share on the JIRA?  If so,
that would help a lot in debugging this failure.

Thanks!
Joseph

On Sun, Dec 27, 2015 at 7:26 PM, Li Li  wrote:

> I ran my lda example in a yarn 2.6.2 cluster with spark 1.5.2.
> it throws exception in line:   Matrix topics = ldaModel.topicsMatrix();
> But in yarn job history ui, it's successful. What's wrong with it?
> I submit job with
> .bin/spark-submit --class Myclass \
> --master yarn-client \
> --num-executors 2 \
> --driver-memory 4g \
> --executor-memory 4g \
> --executor-cores 1 \
>
>
> My codes:
>
>corpus.cache();
>
>
> // Cluster the documents into three topics using LDA
>
> DistributedLDAModel ldaModel = (DistributedLDAModel) new
>
> LDA().setOptimizer("em").setMaxIterations(iterNumber).setK(topicNumber).run(corpus);
>
>
> // Output topics. Each is a distribution over words (matching word
> count vectors)
>
> System.out.println("Learned topics (as distributions over vocab of
> " + ldaModel.vocabSize()
>
> + " words):");
>
>//Line81, exception here:Matrix topics = ldaModel.topicsMatrix();
>
> for (int topic = 0; topic < topicNumber; topic++) {
>
>   System.out.print("Topic " + topic + ":");
>
>   for (int word = 0; word < ldaModel.vocabSize(); word++) {
>
> System.out.print(" " + topics.apply(word, topic));
>
>   }
>
>   System.out.println();
>
> }
>
>
> ldaModel.save(sc.sc(), modelPath);
>
>
> Exception in thread "main" java.lang.IndexOutOfBoundsException:
> (1025,0) not in [-58,58) x [-100,100)
>
> at
> breeze.linalg.DenseMatrix$mcD$sp.update$mcD$sp(DenseMatrix.scala:112)
>
> at
> org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:534)
>
> at
> org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:531)
>
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
> at
> org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:531)
>
> at
> org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix(LDAModel.scala:523)
>
> at
> com.mobvoi.knowledgegraph.textmining.lda.ReviewLDA.main(ReviewLDA.java:81)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
>
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> 15/12/23 00:01:16 INFO spark.SparkContext: Invoking stop() from shutdown
> hook
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-16 Thread Joseph Bradley

This method is tested in the Spark 1.5 unit tests, so I'd guess it's a
problem with the Parquet dependency.  What version of Parquet are you
building Spark 1.5 off of?  (I'm not that familiar with Parquet issues
myself, but hopefully a SQL person can chime in.)

On Tue, Dec 15, 2015 at 3:23 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> I have recently upgraded spark version but when I try to run save a random 
> forest model using model save command I am getting nosuchmethoderror.  My 
> code works fine with 1.3x version.
>
>
>
> model.save(sc.sc(), "modelsavedir");
>
>
>
>
>
> ERROR:
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation -
> Aborting job.
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 22.0 (TID 230, localhost): java.lang.NoSuchMethodError:
> parquet.schema.Types$GroupBuilder.addField(Lparquet/schema/Type;)Lparquet/schema/Types$BaseGroupBuilder;
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convertField$1.apply(CatalystSchemaConverter.scala:517)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convertField$1.apply(CatalystSchemaConverter.scala:516)
>
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>
> at
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:516)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> at
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>
> at
> org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at
> org.apache.spark.sql.types.StructType.map(StructType.scala:92)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
>
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:287)
>
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:261)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
>
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
>
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-16 Thread Joseph Bradley

+1

On Wed, Dec 16, 2015 at 5:26 PM, Reynold Xin  wrote:

> +1
>
>
> On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra 
> wrote:
>
>> +1
>>
>> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust > > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc3
>>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>>> *
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>>
>>> The test repository (versioned as v1.6.0-rc3) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>>
>>> ===
>>> == How can I help test this release? ==
>>> ===
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> 
>>> == What justifies a -1 vote for this release? ==
>>> 
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==
>>> == Major changes to help you focus your testing ==
>>> ==
>>>
>>> Notable changes since 1.6 RC2
>>> - SPARK_VERSION has been set correctly
>>> - SPARK-12199 ML Docs are publishing correctly
>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>
>>> Notable changes since 1.6 RC1
>>> Spark Streaming
>>>
>>>- SPARK-2629  
>>>trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>- SPARK-12165 
>>>SPARK-12189  Fix
>>>bugs in eviction of storage memory by execution.
>>>- SPARK-12258  correct
>>>passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>- SPARK-11787  Parquet
>>>Performance - Improve Parquet scan performance when using flat
>>>schemas.
>>>- SPARK-10810 
>>>Session Management - Isolated devault database (i.e USE mydb) even
>>>on shared clusters.
>>>- SPARK-   Dataset
>>>API - A type-safe API (similar to RDDs) that performs many
>>>operations on serialized binary data and code generation (i.e. Project
>>>Tungsten).
>>>- SPARK-1  Unified
>>>Memory Management - Shared memory for execution and caching instead
>>>of exclusive division of the regions.
>>>- SPARK-11197  SQL
>>>Queries on Files - Concise syntax for running SQL queries over files
>>>of any supported format without registering a table.
>>>- SPARK-11745  Reading
>>>non-standard JSON files - Added options to read non-standard JSON
>>>files (e.g. single-quotes, unquoted attributes)
>>>- SPARK-10412

Re: BIRCH clustering algorithm

2015-12-15 Thread Joseph Bradley

Hi Dzeno,

I'm not familiar with the algorithm myself, but if you have an important
use case for it, you could open a JIRA to discuss it.  However, if it is a
less common algorithm, I'd recommend first submitting it as a Spark package
(but publicizing the package on the user list).  If it gains traction, then
it could become a higher priority item for MLlib.

Thanks,
Joseph

On Mon, Dec 14, 2015 at 7:56 AM, Dženan Softić  wrote:

> Hi,
>
> As a part of the project, we are trying to create parallel implementation
> of BIRCH clustering algorithm [1]. We are mostly getting idea how to do it
> from this paper, which used CUDA to make BIRCH parallel [2]. ([2] is short
> paper, just section 4. is relevant).
>
> We would like to implement BIRCH on Spark. Would this be an interesting
> contribution for MLlib? Is there anyone already who tried to implement
> BIRCH on Spark?
>
> Any suggestions for implementation itself would be very much appreciated!
>
>
> [1] http://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf
> [2] http://boyuan.global-optimization.com/Mypaper/IDEAL2013-88.pdf
>
>
> Best,
> Dzeno
>
>

Re: SparkML algos limitations question.

2015-12-15 Thread Joseph Bradley

Hi Eugene,

The maxDepth parameter exists because the implementation uses Integer node
IDs which correspond to positions in the binary tree.  This simplified the
implementation.  I'd like to eventually modify it to avoid depending on
tree node IDs, but that is not yet on the roadmap.

There is not an analogous limit for the GLMs you listed, but I'm not very
familiar with the perceptron implementation.

Joseph

On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov  wrote:

> Hello!
>
> I'm currently working on POC and try to use Random Forest (classification
> and regression). I also have to check SVM and Multiclass perceptron (other
> algos are less important at the moment). So far I've discovered that Random
> Forest has a limitation of maxDepth for trees and just out of curiosity I
> wonder why such a limitation has been introduced?
>
> An actual question is that I'm going to use Spark ML in production next
> year and would like to know if there are other limitations like maxDepth in
> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>
> Thanks in advance for your time.
> --
> Be well!
> Jean Morozov
>

Re: [ML] Missing documentation for the IndexToString feature transformer

2015-12-05 Thread Joseph Bradley

Thanks for reporting this!  I just added a JIRA:
https://issues.apache.org/jira/browse/SPARK-12159
That would be great if you could send a PR for it; thanks!
Joseph

On Sat, Dec 5, 2015 at 5:02 AM, Benjamin Fradet 
wrote:

> Hi,
>
> I was wondering why the IndexToString
> 
>  label
> transformer was not documented in ml-features.md
> .
>
> If it's not intentional, having used it a few times, I'd be happy to
> submit a jira and the pr associated.
>
> Best,
> Ben.
>
> --
> Ben Fradet.
>

Re: Python API for Association Rules

2015-12-02 Thread Joseph Bradley

If you're working on a feature, please comment on the JIRA first (to avoid
conflicts / duplicate work).  Could you please copy what your wrote to the
JIRA to discuss there?
Thanks,
Joseph

On Wed, Dec 2, 2015 at 4:51 AM, caiquermarques95  wrote:

> Hello everyone!
> I'm developing to the Python API for association rules (
> https://issues.apache.org/jira/browse/SPARK-8855), but I found a doubt.
>
> Following the description of the issue, it says that a important method is
> "*FPGrowthModel.generateAssociationRules()*", of course. However, is not
> clear if a wrapper for the association rules it will be in "
> *FPGrowthModelWrapper.scala*" and this is the problem.
>
> My idea is the following:
> 1) In the fpm.py file; class "Association Rules" with one method and a
> class:
> 1.1) Method train(data, minConfidence), that will generate the association
> rules for a data with a minConfidence specified (0.6 default). This method
> will call the "trainAssociationRules" from the *PythonMLLibAPI* with the
> parameters data and minConfidence. Later. will return a FPGrowthModel.
> 1.2) Class Rule, that will a namedtuple, represents an (antecedent,
> consequent) tuple.
>
> 2) Still in fpm.py, in the class FPGrowthModel, a new method will be
> added, called generateAssociationRules, that will map the Rules generated
> calling the method "getAssociationRule" from FPGrowthModelWrapper to the
> namedtuple.
>
> Now is my doubt, how to make trainAssociationRules returns a FGrowthModel
> to the Wrapper just maps the rule received to the antecedent/consequent? I
> could not do the method trainAssociationRules returns a FPGrowthModel. The
> wrapper for association rules is in FPGrowthModelWrapper, right?
>
> For illustration, I think something like this in *PythonMLLibAPI:*
>
> def trainAssociationRules(
>   data: JavaRDD[FPGrowth.FreqItemset[Any]],
>   minConfidence: Double): [return type] = {
>
> val model = new FPGrowthModel(data.rdd)
>   .generateAssociationRules(minConfidence)
>
> new FPGrowthModelWrapper(model)
>   }
>
> And in FPGrowthModelWrapper, something like:
>
>  def getAssociationRules: [return type] = {
> SerDe.fromTuple2RDD(rule.map(x => (x.javaAntecedent,
> x.javaConsequent)))
>  }
>
> I know that will fail, but, what is wrong with my idea?
> Any suggestions?
>
> Thanks for the help and the tips.
> Caique.
>
> --
> View this message in context: Python API for Association Rules
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>

Re: Problem in running MLlib SVM

2015-12-01 Thread Joseph Bradley

Oh, sorry about that.  I forgot that's the behavior when the threshold is
not set.  My guess would be that you need more iterations, or that the
regParam needs to be tuned.

I'd recommend testing on some of the LibSVM datasets.  They have a lot, and
you can find existing examples (and results) for many of them.

On Tue, Dec 1, 2015 at 12:02 PM, Tarek Elgamal <tarek.elga...@gmail.com>
wrote:

> Thanks, actually model.predict() gives a number between 0 and 1. However,
> model.predictPoint gives me a number from 0/1 but the accuracy is still
> very low. I am using the training data just to make sure that I am using it
> right. But it still seems not to work for me.
> @Joseph, do you have any benchmark data that you tried SVM on. I am
> attaching my toy data with just 100 examples. I tried it with different
> data and bigger data and still getting accuracy around 57% on training set.
>
> On Mon, Nov 30, 2015 at 6:33 PM, Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> model.predict should return a 0/1 predicted label.  The example code is
>> misleading when it calls the prediction a "score."
>>
>> On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem <fazl...@wso2.com> wrote:
>>
>>> You should never use the training data to measure your prediction
>>> accuracy. Always use a fresh dataset (test data) for this purpose.
>>>
>>> On Sun, Nov 29, 2015 at 8:36 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>
>>>> I think this should represent the label of LabledPoint (0 means
>>>> negative 1 means positive)
>>>> http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
>>>>
>>>> The document you mention is for the mathematical formula, not the
>>>> implementation.
>>>>
>>>> On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal <tarek.elga...@gmail.com
>>>> > wrote:
>>>>
>>>>> According to the documentation
>>>>> <http://spark.apache.org/docs/latest/mllib-linear-methods.html>, by
>>>>> default, if wTx≥0 then the outcome is positive, and negative otherwise. I
>>>>> suppose that wTx is the "score" in my case. If score is more than 0 and 
>>>>> the
>>>>> label is positive, then I return 1 which is correct classification and I
>>>>> return zero otherwise. Do you have any idea how to classify a point as
>>>>> positive or negative using this score or another function ?
>>>>>
>>>>> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>>>
>>>>>> if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>>>  {
>>>>>>   return 1; //correct classiciation
>>>>>>  }
>>>>>>  else
>>>>>>   return 0;
>>>>>>
>>>>>>
>>>>>>
>>>>>> I suspect score is always between 0 and 1
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <
>>>>>> tarek.elga...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am trying to run the straightforward example of SVm but I am
>>>>>>> getting low accuracy (around 50%) when I predict using the same data I 
>>>>>>> used
>>>>>>> for training. I am probably doing the prediction in a wrong way. My 
>>>>>>> code is
>>>>>>> below. I would appreciate any help.
>>>>>>>
>>>>>>>
>>>>>>> import java.util.List;
>>>>>>>
>>>>>>> import org.apache.spark.SparkConf;
>>>>>>> import org.apache.spark.SparkContext;
>>>>>>> import org.apache.spark.api.java.JavaRDD;
>>>>>>> import org.apache.spark.api.java.function.Function;
>>>>>>> import org.apache.spark.api.java.function.Function2;
>>>>>>> import org.apache.spark.mllib.classification.SVMModel;
>>>>>>> import org.apache.spark.mllib.classification.SVMWithSGD;
>>>>>>> import org.apache.spark.mllib.regression.LabeledPoint;
>>>>>>> import org.apache.spark.mllib.util.MLUtils;
>>>>>>>
>>>>>>> import s

Re: Grid search with Random Forest

2015-12-01 Thread Joseph Bradley

You can do grid search if you set the evaluator to a
MulticlassClassificationEvaluator, which expects a prediction column, not a
rawPrediction column.  There's a JIRA for making
BinaryClassificationEvaluator accept prediction instead of rawPrediction.
Joseph

On Tue, Dec 1, 2015 at 5:10 AM, Benjamin Fradet <benjamin.fra...@gmail.com>
wrote:

> Someone correct me if I'm wrong but no there isn't one that I am aware of.
>
> Unless someone is willing to explain how to obtain the raw prediction
> column with the GBTClassifier. In this case I'd be happy to work on a PR.
> On 1 Dec 2015 8:43 a.m., "Ndjido Ardo BAR" <ndj...@gmail.com> wrote:
>
>> Hi Benjamin,
>>
>> Thanks, the documentation you sent is clear.
>> Is there any other way to perform a Grid Search with GBT?
>>
>>
>> Ndjido
>> On Tue, 1 Dec 2015 at 08:32, Benjamin Fradet <benjamin.fra...@gmail.com>
>> wrote:
>>
>>> Hi Ndjido,
>>>
>>> This is because GBTClassifier doesn't yet have a rawPredictionCol like
>>> the. RandomForestClassifier has.
>>> Cf:
>>> http://spark.apache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1
>>> On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR" <ndj...@gmail.com> wrote:
>>>
>>>> Hi Joseph,
>>>>
>>>> Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting
>>>> a "rawPredictionCol field does not exist exception" on Spark 1.5.2 for
>>>> Gradient Boosting Trees classifier.
>>>>
>>>>
>>>> Ardo
>>>> On Tue, 1 Dec 2015 at 01:34, Joseph Bradley <jos...@databricks.com>
>>>> wrote:
>>>>
>>>>> It should work with 1.5+.
>>>>>
>>>>> On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar <ndj...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> Does anyone know whether the Grid Search capability is enabled since
>>>>>> the issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
>>>>>> column doesn't exist" when trying to perform a grid search with Spark 
>>>>>> 1.4.0.
>>>>>>
>>>>>> Cheers,
>>>>>> Ardo
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>

Re: Problem in running MLlib SVM

2015-11-30 Thread Joseph Bradley

model.predict should return a 0/1 predicted label.  The example code is
misleading when it calls the prediction a "score."

On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem  wrote:

> You should never use the training data to measure your prediction
> accuracy. Always use a fresh dataset (test data) for this purpose.
>
> On Sun, Nov 29, 2015 at 8:36 AM, Jeff Zhang  wrote:
>
>> I think this should represent the label of LabledPoint (0 means negative
>> 1 means positive)
>> http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
>>
>> The document you mention is for the mathematical formula, not the
>> implementation.
>>
>> On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal 
>> wrote:
>>
>>> According to the documentation
>>> , by
>>> default, if wTx≥0 then the outcome is positive, and negative otherwise. I
>>> suppose that wTx is the "score" in my case. If score is more than 0 and the
>>> label is positive, then I return 1 which is correct classification and I
>>> return zero otherwise. Do you have any idea how to classify a point as
>>> positive or negative using this score or another function ?
>>>
>>> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang  wrote:
>>>
 if((score >=0 && label == 1) || (score <0 && label == 0))
  {
   return 1; //correct classiciation
  }
  else
   return 0;



 I suspect score is always between 0 and 1



 On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <
 tarek.elga...@gmail.com> wrote:

> Hi,
>
> I am trying to run the straightforward example of SVm but I am getting
> low accuracy (around 50%) when I predict using the same data I used for
> training. I am probably doing the prediction in a wrong way. My code is
> below. I would appreciate any help.
>
>
> import java.util.List;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.api.java.function.Function2;
> import org.apache.spark.mllib.classification.SVMModel;
> import org.apache.spark.mllib.classification.SVMWithSGD;
> import org.apache.spark.mllib.regression.LabeledPoint;
> import org.apache.spark.mllib.util.MLUtils;
>
> import scala.Tuple2;
> import edu.illinois.biglbjava.readers.LabeledPointReader;
>
> public class SimpleDistSVM {
>   public static void main(String[] args) {
> SparkConf conf = new SparkConf().setAppName("SVM Classifier
> Example");
> SparkContext sc = new SparkContext(conf);
> String inputPath=args[0];
>
> // Read training data
> JavaRDD data = MLUtils.loadLibSVMFile(sc,
> inputPath).toJavaRDD();
>
> // Run training algorithm to build the model.
> int numIterations = 3;
> final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations);
>
> // Clear the default threshold.
> model.clearThreshold();
>
>
> // Predict points in test set and map to an RDD of 0/1 values
> where 0 is misclassication and 1 is correct classification
> JavaRDD classification = data.map(new
> Function() {
>  public Integer call(LabeledPoint p) {
>int label = (int) p.label();
>Double score = model.predict(p.features());
>if((score >=0 && label == 1) || (score <0 && label == 0))
>{
>return 1; //correct classiciation
>}
>else
> return 0;
>
>  }
>}
>  );
> // sum up all values in the rdd to get the number of correctly
> classified examples
>  int sum=classification.reduce(new Function2 Integer>()
> {
> public Integer call(Integer arg0, Integer arg1)
> throws Exception {
> return arg0+arg1;
> }});
>
>  //compute accuracy as the percentage of the correctly classified
> examples
>  double accuracy=((double)sum)/((double)classification.count());
>  System.out.println("Accuracy = " + accuracy);
>
> }
>   }
> );
>   }
> }
>



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Thanks & Regards,
>
> Fazlan Nazeem
>
> *Software Engineer*
>
> *WSO2 Inc*
> Mobile : +94772338839
> <%2B94%20%280%29%20773%20451194>
> fazl...@wso2.com
>

Re: Grid search with Random Forest

2015-11-30 Thread Joseph Bradley

It should work with 1.5+.

On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar  wrote:

>
> Hi folks,
>
> Does anyone know whether the Grid Search capability is enabled since the
> issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
> column doesn't exist" when trying to perform a grid search with Spark 1.4.0.
>
> Cheers,
> Ardo
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Unhandled case in VectorAssembler

2015-11-20 Thread Joseph Bradley

Yes, please, could you send a JIRA (and PR)?  A custom error message would
be better.
Thank you!
Joseph

On Fri, Nov 20, 2015 at 2:39 PM, BenFradet 
wrote:

> Hey there,
>
> I noticed that there is an unhandled case in the transform method of
> VectorAssembler if one of the input columns doesn't have one of the
> supported type DoubleType, NumericType, BooleanType or VectorUDT.
>
> So, if you try to transform a column of StringType you get a cryptic
> "scala.MatchError: StringType".
> I was wondering if we shouldn't throw a custom exception indicating that
> this is not a supported type.
>
> I can submit a jira and pr if needed.
>
> Best regards,
> Ben.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Unhandled-case-in-VectorAssembler-tp15302.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Joseph Bradley

Hi,
Could you please submit this via JIRA as a bug report?  It will be very
helpful if you include the Spark version, system details, and other info
too.
Thanks!
Joseph

On Thu, Nov 19, 2015 at 1:21 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> *Issue:*
>
> I have a random forest model that am trying to load during streaming using
> following code.  The code is working fine when I am running the code from
> Eclipse but getting NPE when running the code using spark-submit.
>
>
>
> JavaStreamingContext jssc = new JavaStreamingContext(*jsc*, Durations.
> *seconds*(duration));
>
> System.*out*.println("& trying to get the context
> &&& " );
>
> final RandomForestModel model = 
> RandomForestModel.*load*(jssc.sparkContext().sc(),
> *MODEL_DIRECTORY*);//line 116 causing the issue.
>
> System.*out*.println("& model debug
> &&& " + model.toDebugString());
>
>
>
>
>
> *Exception Details:*
>
> INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 2.0,
> whose tasks have all completed, from pool
>
> Exception in thread "main" java.lang.NullPointerException
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData.toSplit(DecisionTreeModel.scala:144)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$16.apply(DecisionTreeModel.scala:291)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$16.apply(DecisionTreeModel.scala:291)
>
> at scala.Option.map(Option.scala:145)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:291)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:286)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:287)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:286)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructTree(DecisionTreeModel.scala:268)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$12.apply(DecisionTreeModel.scala:251)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$12.apply(DecisionTreeModel.scala:250)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructTrees(DecisionTreeModel.scala:250)
>
> at
> org.apache.spark.mllib.tree.model.TreeEnsembleModel$SaveLoadV1_0$.loadTrees(treeEnsembleModels.scala:340)
>
> at
> org.apache.spark.mllib.tree.model.RandomForestModel$.load(treeEnsembleModels.scala:72)
>
> at
> org.apache.spark.mllib.tree.model.RandomForestModel.load(treeEnsembleModels.scala)
>
> at
> com.markmonitor.antifraud.ce.KafkaURLStreaming.main(KafkaURLStreaming.java:116)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
>
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>
> at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>
> at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Nov 19, 2015 1:10:56 PM WARNING: parquet.hadoop.ParquetRecordReader: Can
> not initialize counter due

Re: slightly more informative error message in MLUtils.loadLibSVMFile

2015-11-16 Thread Joseph Bradley

That sounds useful; would you mind submitting a JIRA (and a PR if you're
willing)?
Thanks,
Joseph

On Fri, Oct 23, 2015 at 12:43 PM, Robert Dodier 
wrote:

> Hi,
>
> MLUtils.loadLibSVMFile verifies that indices are 1-based and
> increasing, and otherwise triggers an error. I'd like to suggest that
> the error message be a little more informative. I ran into this when
> loading a malformed file. Exactly what gets printed isn't too crucial,
> maybe you would want to print something else, all that matters is to
> give some context so that the user can find the problem more quickly.
>
> Hope this helps in some way.
>
> Robert Dodier
>
> PS.
>
> diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
> b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
> index 81c2f0c..6f5f680 100644
> --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
> +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
> @@ -91,7 +91,7 @@ object MLUtils {
>  val indicesLength = indices.length
>  while (i < indicesLength) {
>val current = indices(i)
> -  require(current > previous, "indices should be one-based
> and in ascending order" )
> +  require(current > previous, "indices should be one-based
> and in ascending order; found current=" + current + ", previous=" +
> previous + "; line=\"" + line + "\"" )
>previous = current
>i += 1
>  }
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley

One comment about
"""
1) I agree the sorting method you suggested is a very efficient way to
handle the unordered categorical variables in binary classification
and regression. I propose we have a Spark ML Transformer to do the
sorting and encoding, bringing the benefits to many tree based
methods. How about I open a jira for this?
"""

--> MLlib trees do this currently, so you could check out that code as an
example.
I'm not sure how this would work as a generic transformer, though; it seems
more like an internal part of space-partitioning algorithms.



On Tue, Oct 27, 2015 at 5:04 PM, Meihua Wu 
wrote:

> Hi DB Tsai,
>
> Thank you again for your insightful comments!
>
> 1) I agree the sorting method you suggested is a very efficient way to
> handle the unordered categorical variables in binary classification
> and regression. I propose we have a Spark ML Transformer to do the
> sorting and encoding, bringing the benefits to many tree based
> methods. How about I open a jira for this?
>
> 2) For L2/L1 regularization vs Learning rate (I use this name instead
> shrinkage to avoid confusion), I have the following observations:
>
> Suppose G and H are the sum (over the data assigned to a leaf node) of
> the 1st and 2nd derivative of the loss evaluated at f_m, respectively.
> Then for this leaf node,
>
> * With a learning rate eta, f_{m+1} = f_m - G/H*eta
>
> * With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda)
>
> If H>0 (convex loss), both approach lead to "shrinkage":
>
> * For the learning rate approach, the percentage of shrinkage is
> uniform for any leaf node.
>
> * For L2 regularization, the percentage of shrinkage would adapt to
> the number of instances assigned to a leaf node: more instances =>
> larger G and H => less shrinkage. This behavior is intuitive to me. If
> the value estimated from this node is based on a large amount of data,
> the value should be reliable and less shrinkage is needed.
>
> I suppose we could have something similar for L1.
>
> I am not aware of theoretical results to conclude which method is
> better. Likely to be dependent on the data at hand. Implementing
> learning rate is on my radar for version 0.2. I should be able to add
> it in a week or so. I will send you a note once it is done.
>
> Thanks,
>
> Meihua
>
> On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai  wrote:
> > Hi Meihua,
> >
> > For categorical features, the ordinal issue can be solved by trying
> > all kind of different partitions 2^(q-1) -1 for q values into two
> > groups. However, it's computational expensive. In Hastie's book, in
> > 9.2.4, the trees can be trained by sorting the residuals and being
> > learnt as if they are ordered. It can be proven that it will give the
> > optimal solution. I have a proof that this works for learning
> > regression trees through variance reduction.
> >
> > I'm also interested in understanding how the L1 and L2 regularization
> > within the boosting works (and if it helps with overfitting more than
> > shrinkage).
> >
> > Thanks.
> >
> > Sincerely,
> >
> > DB Tsai
> > --
> > Web: https://www.dbtsai.com
> > PGP Key ID: 0xAF08DF8D
> >
> >
> > On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu 
> wrote:
> >> Hi DB Tsai,
> >>
> >> Thank you very much for your interest and comment.
> >>
> >> 1) feature sub-sample is per-node, like random forest.
> >>
> >> 2) The current code heavily exploits the tree structure to speed up
> >> the learning (such as processing multiple learning node in one pass of
> >> the training data). So a generic GBM is likely to be a different
> >> codebase. Do you have any nice reference of efficient GBM? I am more
> >> than happy to look into that.
> >>
> >> 3) The algorithm accept training data as a DataFrame with the
> >> featureCol indexed by VectorIndexer. You can specify which variable is
> >> categorical in the VectorIndexer. Please note that currently all
> >> categorical variables are treated as ordered. If you want some
> >> categorical variables as unordered, you can pass the data through
> >> OneHotEncoder before the VectorIndexer. I do have a plan to handle
> >> unordered categorical variable using the approach in RF in Spark ML
> >> (Please see roadmap in the README.md)
> >>
> >> Thanks,
> >>
> >> Meihua
> >>
> >>
> >>
> >> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
> >>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> >>> you think you can implement generic GBM and have it merged as part of
> >>> Spark codebase?
> >>>
> >>> Sincerely,
> >>>
> >>> DB Tsai
> >>> --
> >>> Web: https://www.dbtsai.com
> >>> PGP Key ID: 0xAF08DF8D
> >>>
> >>>
> >>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
> >>>  wrote:
>  Hi Spark User/Dev,
> 
>  Inspired by the success

Re: Unchecked contribution (JIRA and PR)

2015-11-16 Thread Joseph Bradley

Hi Sergio,

Apart from apologies about limited review bandwidth (from me too!), I
wanted to add: It would be interesting to hear what feedback you've gotten
from users of your package.  Perhaps you could collect feedback by (a)
emailing the user list and (b) adding a note in the Spark Packages pointing
to the JIRA, and encouraging users to add their comments directly to the
JIRA.  That'd be a nice way to get a sense of use cases and priority.

Thanks for your patience,
Joseph

On Wed, Nov 4, 2015 at 7:23 AM, Sergio Ramírez  wrote:

> OK, for me, time is not a problem. I was just worried about there was no
> movement in those issues. I think they are good contributions. For example,
> I have found no complex discretization algorithm in MLlib, which is rare.
> My algorithm, a Spark implementation of the well-know discretizer developed
> by Fayyad and Irani, could be considered a good starting point for the
> discretization part. Furthermore, this is also supported by two scientific
> articles.
>
> Anyway, I uploaded these two algorithms as two different packages to
> spark-packages.org, but I would like to contribute directly to MLlib. I
> understand you have a lot of requests, and it is not possible to include
> all the contributions made by the Spark community.
>
> I'll be patient and ready to collaborate.
>
> Thanks again
>
>
> On 03/11/15 16:30, Jerry Lam wrote:
>
> Sergio, you are not alone for sure. Check the RowSimilarity implementation
> [SPARK-4823]. It has been there for 6 months. It is very likely those which
> don't merge in the version of spark that it was developed will never merged
> because spark changes quite significantly from version to version if the
> algorithm depends a lot of internal api.
>
> On Tue, Nov 3, 2015 at 10:24 AM, Reynold Xin  wrote:
>
>> Sergio,
>>
>> Usually it takes a lot of effort to get something merged into Spark
>> itself, especially for relatively new algorithms that might not have
>> established itself yet. I will leave it to mllib maintainers to comment on
>> the specifics of the individual algorithms proposed here.
>>
>> Just another general comment: we have been working on making packages be
>> as easy to use as possible for Spark users. Right now it only requires a
>> simple flag to pass to the spark-submit script to include a package.
>>
>>
>> On Tue, Nov 3, 2015 at 2:49 AM, Sergio Ramírez < 
>> sramire...@ugr.es> wrote:
>>
>>> Hello all:
>>>
>>> I developed two packages for MLlib in March. These have been also upload
>>> to the spark-packages repository. Associated to these packages, I created
>>> two JIRA's threads and the correspondent pull requests, which are listed
>>> below:
>>>
>>> https://github.com/apache/spark/pull/5184
>>> https://github.com/apache/spark/pull/5170
>>>
>>> https://issues.apache.org/jira/browse/SPARK-6531
>>> https://issues.apache.org/jira/browse/SPARK-6509
>>>
>>> These remain unassigned in JIRA and unverified in GitHub.
>>>
>>> Could anyone explain why are they in this state yet? Is it normal?
>>>
>>> Thanks!
>>>
>>> Sergio R.
>>>
>>> --
>>>
>>> Sergio Ramírez Gallego
>>> Research group on Soft Computing and Intelligent Information Systems,
>>> Dept. Computer Science and Artificial Intelligence,
>>> University of Granada, Granada, Spain.
>>> Email: srami...@decsai.ugr.es
>>> Research Group URL: http://sci2s.ugr.es/
>>>
>>> -
>>>
>>> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
>>> contiene información de carácter confidencial exclusivamente dirigida a
>>> su destinatario o destinatarios. Si no es vd. el destinatario indicado,
>>> queda notificado que la lectura, utilización, divulgación y/o copia sin
>>> autorización está prohibida en virtud de la legislación vigente. En el
>>> caso de haber recibido este correo electrónico por error, se ruega
>>> notificar inmediatamente esta circunstancia mediante reenvío a la
>>> dirección electrónica del remitente.
>>> Evite imprimir este mensaje si no es estrictamente necesario.
>>>
>>> This email and any file attached to it (when applicable) contain(s)
>>> confidential information that is exclusively addressed to its
>>> recipient(s). If you are not the indicated recipient, you are informed
>>> that reading, using, disseminating and/or copying it without
>>> authorisation is forbidden in accordance with the legislation in effect.
>>> If you have received this email by mistake, please immediately notify
>>> the sender of the situation by resending it to their email address.
>>> Avoid printing this message if it is not absolutely necessary.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: 
>>> dev-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
>
> Sergio Ramírez Gallego
> Research group

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-07 Thread Joseph Bradley

+1 tested on OS X

On Sat, Nov 7, 2015 at 10:25 AM, Reynold Xin  wrote:

> +1 myself too
>
> On Sat, Nov 7, 2015 at 12:01 AM, Robin East 
> wrote:
>
>> +1
>> Mac OS X 10.10.5 Yosemite
>>
>> mvn clean package -DskipTests (13min)
>>
>> Basic graph tests
>>   Load graph using edgeListFile...SUCCESS
>>   Run PageRank...SUCCESS
>> Connected Components tests
>>   Kaggle social circles competition...SUCCESS
>> Minimum Spanning Tree Algorithm
>>   Run basic Minimum Spanning Tree algorithmSUCCESS
>>   Run Minimum Spanning Tree taxonomy creation...SUCCESS
>>
>>
>> ---
>> Robin East
>> *Spark GraphX in Action* Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>>
>>
>>
>>
>>
>> On 6 Nov 2015, at 17:27, Chester Chen  wrote:
>>
>> +1
>> Test against CDH5.4.2 with hadoop 2.6.0 version using yesterday's code,
>> build locally.
>>
>> Regression running in Yarn Cluster mode against few internal ML (
>> logistic regression, linear regression, random forest and statistic
>> summary) as well Mlib KMeans. all seems to work fine.
>>
>> Chester
>>
>>
>> On Tue, Nov 3, 2015 at 3:22 PM, Reynold Xin  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a
>>> majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.5.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The release fixes 59 known issues in Spark 1.5.1, listed here:
>>> http://s.apache.org/spark-1.5.2
>>>
>>> The tag to be voted on is v1.5.2-rc2:
>>> https://github.com/apache/spark/releases/tag/v1.5.2-rc2
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> - as version 1.5.2-rc2:
>>> https://repository.apache.org/content/repositories/orgapachespark-1153
>>> - as version 1.5.2:
>>> https://repository.apache.org/content/repositories/orgapachespark-1152
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-docs/
>>>
>>>
>>> ===
>>> How can I help test this release?
>>> ===
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> 
>>> What justifies a -1 vote for this release?
>>> 
>>> -1 vote should occur for regressions from Spark 1.5.1. Bugs already
>>> present in 1.5.1 will not block this release.
>>>
>>>
>>>
>>
>>
>

Re: Gradient Descent with large model size

2015-10-15 Thread Joseph Bradley

For those numbers of partitions, I don't think you'll actually use tree
aggregation.  The number of partitions needs to be over a certain threshold
(>= 7) before treeAggregate really operates on a tree structure:
https://github.com/apache/spark/blob/9808052b5adfed7dafd6c1b3971b998e45b2799a/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1100

Do you see a slower increase in running time with more partitions?  For 5
partitions, do you find things improve if you tell treeAggregate to use
depth > 2?

Joseph

On Wed, Oct 14, 2015 at 1:18 PM, Ulanov, Alexander  wrote:

> Dear Spark developers,
>
>
>
> I have noticed that Gradient Descent is Spark MLlib takes long time if the
> model is large. It is implemented with TreeAggregate. I’ve extracted the
> code from GradientDescent.scala to perform the benchmark. It allocates the
> Array of a given size and the aggregates it:
>
>
>
> val dataSize = 1200
>
> val n = 5
>
> val maxIterations = 3
>
> val rdd = sc.parallelize(0 until n, n).cache()
>
> rdd.count()
>
> var avgTime = 0.0
>
> for (i <- 1 to maxIterations) {
>
>   val start = System.nanoTime()
>
>   val result = rdd.treeAggregate((new Array[Double](dataSize), 0.0, 0L))(
>
> seqOp = (c, v) => {
>
>   // c: (grad, loss, count)
>
>   val l = 0.0
>
>   (c._1, c._2 + l, c._3 + 1)
>
> },
>
> combOp = (c1, c2) => {
>
>   // c: (grad, loss, count)
>
>   (c1._1, c1._2 + c2._2, c1._3 + c2._3)
>
> })
>
>   avgTime += (System.nanoTime() - start) / 1e9
>
>   assert(result._1.length == dataSize)
>
> }
>
> println("Avg time: " + avgTime / maxIterations)
>
>
>
> If I run on my cluster of 1 master and 5 workers, I get the following
> results (given the array size = 12M):
>
> n = 1: Avg time: 4.55570966733
>
> n = 2 Avg time: 7.05972458467
>
> n = 3 Avg time: 9.93711737767
>
> n = 4 Avg time: 12.687526233
>
> n = 5 Avg time: 12.93952612967
>
>
>
> Could you explain why the time becomes so big? The data transfer of 12M
> array of double should take ~ 1 second in 1Gbit network. There might be
> other overheads, however not that big as I observe.
>
> Best regards, Alexander
>

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread Joseph Bradley

Hi YiZhi Liu,

The spark.ml classes are part of the higher-level "Pipelines" API, which
works with DataFrames.  When creating this API, we decided to separate it
from the old API to avoid confusion.  You can read more about it here:
http://spark.apache.org/docs/latest/ml-guide.html

For (3): We use Breeze, but we have to modify it in order to do distributed
optimization based on Spark.

Joseph

On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu  wrote:

> Hi everyone,
>
> I'm curious about the difference between
> ml.classification.LogisticRegression and
> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
> optimized using LBFGS, the only difference I see is LogisticRegression
> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
>
> So I wonder,
> 1. Why not simply add a DataFrame training interface to
> LogisticRegressionWithLBFGS?
> 2. Whats the difference between ml.classification and
> mllib.classification package?
> 3. Why doesn't ml.classification.LogisticRegression call
> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
> it uses breeze.optimize.LBFGS and re-implements most of the procedures
> in mllib.optimization.{LBFGS,OWLQN}.
>
> Thank you.
>
> Best,
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

1 2 >

1 - 100 of 144 matches

Mail list logo