Re: Spark JIRA Report

2014-12-13 Thread Andrew Ash
The goal of increasing visibility on open issues is a good one.  How is
this different from just a link to Jira though?  Some might say this adds
noise to the mailing list and doesn't contain any information not already
available in Jira.

The idea seems good but the formatting leaves a little to be desired.  If
you aren't opposed to using HTML, I might suggest this more compact format:

SPARK-2044 
Pluggable interface
for shuffles
SPARK-2365  Add
IndexedRDD, an efficient updatable key-value
SPARK-3561  Allow
for pluggable
execution contexts in Spark

Andrew

On Sat, Dec 13, 2014 at 11:31 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> What do y’all think of a report like this emailed out to the dev list on a
> monthly basis?
>
> The goal would be to increase visibility into our open issues and encourage
> developers to tend to our issue tracker more frequently.
>
> Nick
>
> There are 1,236 unresolved issues
> <
> https://issues.apache.org/jira/issues/?jql=project+%3D+SPARK+AND+resolution+%3D+Unresolved+ORDER+BY+updated+DESC
> >
> in the Spark project on JIRA.
> Recently Updated Issues
> <
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20updated%20DESC
> >
> Type Key Priority Summary Last Updated   Bug SPARK-4841
>  Major Batch serializer
> bug in PySpark’s RDD.zip Dec 14, 2014  Question SPARK-4810
>  Major Failed to run
> collect Dec 14, 2014  Bug SPARK-785
>  Major ClosureCleaner not
> invoked on most PairRDDFunctions Dec 14, 2014  New Feature SPARK-3405
>  Minor EC2 cluster
> creation on VPC Dec 13, 2014  Improvement SPARK-1555
>  Minor enable
> ec2/spark_ec2.py to stop/delete cluster non-interactively Dec 13, 2014
>  Stale
> Issues
> <
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20updated%20%3C%3D%20-90d%20ORDER%20BY%20updated%20ASC
> >
> Type Key Priority Summary Last Updated   Bug SPARK-560
>  None Specialize RDDs /
> iterators Oct 22, 2012  New Feature SPARK-540
>  None Add API to
> customize
> in-memory representation of RDDs Oct 22, 2012  Improvement SPARK-573
>  None Clarify semantics
> of
> the parallelized closures Oct 22, 2012  New Feature SPARK-609
>  Minor Add instructions
> for enabling Akka debug logging Nov 06, 2012  New Feature SPARK-636
>  Major Add mechanism to
> run system management/configuration tasks on all workers Dec 17, 2012
>  Most
> Watched Issues
> <
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20watchers%20DESC
> >
> Type Key Priority Summary Watchers   New Feature SPARK-3561
>  Major Allow for
> pluggable execution contexts in Spark 75  New Feature SPARK-2365
>  Major Add IndexedRDD,
> an
> efficient updatable key-value store 33  Improvement SPARK-2044
>  Major Pluggable
> interface for shuffles 30  New Feature SPARK-1405
>  Critical parallel
> Latent
> Dirichlet Allocation (LDA) atop of spark in MLlib 26  New Feature
> SPARK-1406
>  Major PMML model
> evaluation support via MLib 21   Most Voted Issues
> <
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20votes%20DESC
> >
> Type Key Priority Summary Votes   Bug SPARK-2541
>  Major Standalone mode
> can’t access secure HDFS anymore 12  New Feature SPARK-2365
>  Major Add IndexedRDD,
> an
> efficient updatable key-value store 9  Improvement SPARK-3533
>  Major Add
> saveAsTextFileByKey() method to RDDs 8  Bug SPARK-2883
>  Blocker Spark Support
> for ORCFile format 6  New Feature SPARK-1442
>  Major Add Window
> function support 6
> ​
>


Governance of the Jenkins whitelist

2014-12-13 Thread Andrew Ash
Jenkins is a really valuable tool for increasing quality of incoming
patches to Spark, but I've noticed that there are often a lot of patches
waiting for testing because they haven't been approved for testing.

Certain users can instruct Jenkins to run on a PR, or add other users to a
whitelist. How does governance work for that list of admins?  Meaning who
is currently on it, and what are the requirements to be on that list?

Can I be permissioned to allow Jenkins to run on certain PRs?  I've often
come across well-intentioned PRs that are languishing because Jenkins has
yet to run on them.

Andrew


Spark JIRA Report

2014-12-13 Thread Nicholas Chammas
What do y’all think of a report like this emailed out to the dev list on a
monthly basis?

The goal would be to increase visibility into our open issues and encourage
developers to tend to our issue tracker more frequently.

Nick

There are 1,236 unresolved issues

in the Spark project on JIRA.
Recently Updated Issues

Type Key Priority Summary Last Updated   Bug SPARK-4841
 Major Batch serializer
bug in PySpark’s RDD.zip Dec 14, 2014  Question SPARK-4810
 Major Failed to run
collect Dec 14, 2014  Bug SPARK-785
 Major ClosureCleaner not
invoked on most PairRDDFunctions Dec 14, 2014  New Feature SPARK-3405
 Minor EC2 cluster
creation on VPC Dec 13, 2014  Improvement SPARK-1555
 Minor enable
ec2/spark_ec2.py to stop/delete cluster non-interactively Dec 13, 2014   Stale
Issues

Type Key Priority Summary Last Updated   Bug SPARK-560
 None Specialize RDDs /
iterators Oct 22, 2012  New Feature SPARK-540
 None Add API to customize
in-memory representation of RDDs Oct 22, 2012  Improvement SPARK-573
 None Clarify semantics of
the parallelized closures Oct 22, 2012  New Feature SPARK-609
 Minor Add instructions
for enabling Akka debug logging Nov 06, 2012  New Feature SPARK-636
 Major Add mechanism to
run system management/configuration tasks on all workers Dec 17, 2012   Most
Watched Issues

Type Key Priority Summary Watchers   New Feature SPARK-3561
 Major Allow for
pluggable execution contexts in Spark 75  New Feature SPARK-2365
 Major Add IndexedRDD, an
efficient updatable key-value store 33  Improvement SPARK-2044
 Major Pluggable
interface for shuffles 30  New Feature SPARK-1405
 Critical parallel Latent
Dirichlet Allocation (LDA) atop of spark in MLlib 26  New Feature SPARK-1406
 Major PMML model
evaluation support via MLib 21   Most Voted Issues

Type Key Priority Summary Votes   Bug SPARK-2541
 Major Standalone mode
can’t access secure HDFS anymore 12  New Feature SPARK-2365
 Major Add IndexedRDD, an
efficient updatable key-value store 9  Improvement SPARK-3533
 Major Add
saveAsTextFileByKey() method to RDDs 8  Bug SPARK-2883
 Blocker Spark Support
for ORCFile format 6  New Feature SPARK-1442
 Major Add Window
function support 6
​


Re: Nabble mailing list mirror errors: "This post has NOT been accepted by the mailing list yet"

2014-12-13 Thread Yana Kadiyska
Since you mentioned this, I had a related quandry recently -- it also says
that the forum archives "*u...@spark.incubator.apache.org
"/* *d...@spark.incubator.apache.org
 *respectively, yet the "Community page"
clearly says to email the @spark.apache.org list (but the nabble archive is
linked right there too). IMO even putting a clear explanation at the top

"Posting here requires that you create an account via the UI. Your message
will be sent to both spark.incubator.apache.org and spark.apache.org (if
that is the case, i'm not sure which alias nabble posts get sent to)" would
make things a lot more clear.

On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen  wrote:
>
> I've noticed that several users are attempting to post messages to Spark's
> user / dev mailing lists using the Nabble web UI (
> http://apache-spark-user-list.1001560.n3.nabble.com/).  However, there
> are many posts in Nabble that are not posted to the Apache lists and are
> flagged with "This post has NOT been accepted by the mailing list yet."
> errors.
>
> I suspect that the issue is that users are not completing the sign-up
> confirmation process (
> http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1),
> which is preventing their emails from being accepted by the mailing list.
>
> I wanted to mention this issue to the Spark community to see whether there
> are any good solutions to address this.  I have spoken to users who think
> that our mailing list is unresponsive / inactive because their un-posted
> messages haven't received any replies.
>
> - Josh
>


Nabble mailing list mirror errors: "This post has NOT been accepted by the mailing list yet"

2014-12-13 Thread Josh Rosen
I've noticed that several users are attempting to post messages to Spark's
user / dev mailing lists using the Nabble web UI (
http://apache-spark-user-list.1001560.n3.nabble.com/).  However, there are
many posts in Nabble that are not posted to the Apache lists and are
flagged with "This post has NOT been accepted by the mailing list yet."
errors.

I suspect that the issue is that users are not completing the sign-up
confirmation process (
http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1),
which is preventing their emails from being accepted by the mailing list.

I wanted to mention this issue to the Spark community to see whether there
are any good solutions to address this.  I have spoken to users who think
that our mailing list is unresponsive / inactive because their un-posted
messages haven't received any replies.

- Josh


Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-13 Thread Sean McNamara
+1 tested on OS X and deployed+tested our apps via YARN into our staging 
cluster.

Sean


> On Dec 11, 2014, at 10:40 AM, Reynold Xin  wrote:
> 
> +1
> 
> Tested on OS X.
> 
> On Wednesday, December 10, 2014, Patrick Wendell  wrote:
> 
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.2.0!
>> 
>> The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
>> 
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
>> 
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc2/
>> 
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>> 
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1055/
>> 
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
>> 
>> Please vote on releasing this package as Apache Spark 1.2.0!
>> 
>> The vote is open until Saturday, December 13, at 21:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>> 
>> [ ] +1 Release this package as Apache Spark 1.2.0
>> [ ] -1 Do not release this package because ...
>> 
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> 
>> == What justifies a -1 vote for this release? ==
>> This vote is happening relatively late into the QA period, so
>> -1 votes should only occur for significant regressions from
>> 1.0.2. Bugs already present in 1.1.X, minor
>> regressions, or bugs related to new features will not block this
>> release.
>> 
>> == What default changes should I be aware of? ==
>> 1. The default value of "spark.shuffle.blockTransferService" has been
>> changed to "netty"
>> --> Old behavior can be restored by switching to "nio"
>> 
>> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
>> --> Old behavior can be restored by setting "spark.shuffle.manager" to
>> "hash".
>> 
>> == How does this differ from RC1 ==
>> This has fixes for a handful of issues identified - some of the
>> notable fixes are:
>> 
>> [Core]
>> SPARK-4498: Standalone Master can fail to recognize completed/failed
>> applications
>> 
>> [SQL]
>> SPARK-4552: Query for empty parquet table in spark sql hive get
>> IllegalArgumentException
>> SPARK-4753: Parquet2 does not prune based on OR filters on partition
>> columns
>> SPARK-4761: With JDBC server, set Kryo as default serializer and
>> disable reference tracking
>> SPARK-4785: When called with arguments referring column fields, PMOD
>> throws NPE
>> 
>> - Patrick
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>> For additional commands, e-mail: dev-h...@spark.apache.org 
>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-13 Thread slcclimber
I am building and testing using sbt.
I get a lot of 
"Job aborted due to stage failure: Master removed our application: FAILED"
did not contain "cancelled", and "Job aborted due to stage failure: Master
removed our application: FAILED" did not contain "killed"
errors trying to run tests.  (JobCancellationSuite.scala:236)
I have never experienced this before so it is concerning.

I  was able to successfully run all the python examples for spark and Mllib
successfully.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-0-RC2-tp9713p9770.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-13 Thread Nick Pentreath
+1

—
Sent from Mailbox

On Sat, Dec 13, 2014 at 3:12 PM, GuoQiang Li  wrote:

> +1 (non-binding).  Tested on CentOS 6.4
> -- Original --
> From:  "Patrick Wendell";;
> Date:  Thu, Dec 11, 2014 05:08 AM
> To:  "dev发送@spark.apache.org";
> Subject:  [VOTE] Release Apache Spark 1.2.0 (RC2)
> Please vote on releasing the following candidate as Apache Spark version 
> 1.2.0!
> The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.0-rc2/
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1055/
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
> Please vote on releasing this package as Apache Spark 1.2.0!
> The vote is open until Saturday, December 13, at 21:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
> [ ] +1 Release this package as Apache Spark 1.2.0
> [ ] -1 Do not release this package because ...
> To learn more about Apache Spark, please see
> http://spark.apache.org/
> == What justifies a -1 vote for this release? ==
> This vote is happening relatively late into the QA period, so
> -1 votes should only occur for significant regressions from
> 1.0.2. Bugs already present in 1.1.X, minor
> regressions, or bugs related to new features will not block this
> release.
> == What default changes should I be aware of? ==
> 1. The default value of "spark.shuffle.blockTransferService" has been
> changed to "netty"
> --> Old behavior can be restored by switching to "nio"
> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
> --> Old behavior can be restored by setting "spark.shuffle.manager" to "hash".
> == How does this differ from RC1 ==
> This has fixes for a handful of issues identified - some of the
> notable fixes are:
> [Core]
> SPARK-4498: Standalone Master can fail to recognize completed/failed
> applications
> [SQL]
> SPARK-4552: Query for empty parquet table in spark sql hive get
> IllegalArgumentException
> SPARK-4753: Parquet2 does not prune based on OR filters on partition columns
> SPARK-4761: With JDBC server, set Kryo as default serializer and
> disable reference tracking
> SPARK-4785: When called with arguments referring column fields, PMOD throws 
> NPE
> - Patrick
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-13 Thread GuoQiang Li
+1 (non-binding).  Tested on CentOS 6.4


-- Original --
From:  "Patrick Wendell";;
Date:  Thu, Dec 11, 2014 05:08 AM
To:  "dev@spark.apache.org";


Subject:  [VOTE] Release Apache Spark 1.2.0 (RC2)



Please vote on releasing the following candidate as Apache Spark version 1.2.0!

The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1055/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/

Please vote on releasing this package as Apache Spark 1.2.0!

The vote is open until Saturday, December 13, at 21:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== What justifies a -1 vote for this release? ==
This vote is happening relatively late into the QA period, so
-1 votes should only occur for significant regressions from
1.0.2. Bugs already present in 1.1.X, minor
regressions, or bugs related to new features will not block this
release.

== What default changes should I be aware of? ==
1. The default value of "spark.shuffle.blockTransferService" has been
changed to "netty"
--> Old behavior can be restored by switching to "nio"

2. The default value of "spark.shuffle.manager" has been changed to "sort".
--> Old behavior can be restored by setting "spark.shuffle.manager" to "hash".

== How does this differ from RC1 ==
This has fixes for a handful of issues identified - some of the
notable fixes are:

[Core]
SPARK-4498: Standalone Master can fail to recognize completed/failed
applications

[SQL]
SPARK-4552: Query for empty parquet table in spark sql hive get
IllegalArgumentException
SPARK-4753: Parquet2 does not prune based on OR filters on partition columns
SPARK-4761: With JDBC server, set Kryo as default serializer and
disable reference tracking
SPARK-4785: When called with arguments referring column fields, PMOD throws NPE

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-13 Thread Tom Graves
+1 built and tested on Yarn on Hadoop 2.x cluster.
Tom 

 On Saturday, December 13, 2014 12:48 AM, Denny Lee  
wrote:
   

 +1 Tested on OSX

Tested Scala 2.10.3, SparkSQL with Hive 0.12 / Hadoop 2.5, Thrift Server,
MLLib SVD


On Fri Dec 12 2014 at 8:57:16 PM Mark Hamstra 
wrote:

> +1
>
> On Fri, Dec 12, 2014 at 8:00 PM, Josh Rosen  wrote:
> >
> > +1.  Tested using spark-perf and the Spark EC2 scripts.  I didn’t notice
> > any performance regressions that could not be attributed to changes of
> > default configurations.  To be more specific, when running Spark 1.2.0
> with
> > the Spark 1.1.0 settings of spark.shuffle.manager=hash and
> > spark.shuffle.blockTransferService=nio, there was no performance
> regression
> > and, in fact, there were significant performance improvements for some
> > workloads.
> >
> > In Spark 1.2.0, the new default settings are spark.shuffle.manager=sort
> > and spark.shuffle.blockTransferService=netty.  With these new settings,
> I
> > noticed a performance regression in the scala-sort-by-key-int spark-perf
> > test.  However, Spark 1.1.0 and 1.1.1 exhibit a similar performance
> > regression for that same test when run with spark.shuffle.manager=sort,
> so
> > this regression seems explainable by the change of defaults.  Besides
> this,
> > most of the other tests ran at the same speeds or faster with the new
> 1.2.0
> > defaults.  Also, keep in mind that this is a somewhat artificial micro
> > benchmark; I have heard anecdotal reports from many users that their real
> > workloads have run faster with 1.2.0.
> >
> > Based on these results, I’m comfortable giving a +1 on 1.2.0 RC2.
> >
> > - Josh
> >
> > On December 11, 2014 at 9:52:39 AM, Sandy Ryza (sandy.r...@cloudera.com)
> > wrote:
> >
> > +1 (non-binding). Tested on Ubuntu against YARN.
> >
> > On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin 
> wrote:
> >
> > > +1
> > >
> > > Tested on OS X.
> > >
> > > On Wednesday, December 10, 2014, Patrick Wendell 
> > > wrote:
> > >
> > > > Please vote on releasing the following candidate as Apache Spark
> > version
> > > > 1.2.0!
> > > >
> > > > The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
> > > >
> > > >
> > >
> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
> a428c446e23e628b746e0626cc02b7b3cadf588e
> > > >
> > > > The release files, including signatures, digests, etc. can be found
> at:
> > > > http://people.apache.org/~pwendell/spark-1.2.0-rc2/
> > > >
> > > > Release artifacts are signed with the following key:
> > > > https://people.apache.org/keys/committer/pwendell.asc
> > > >
> > > > The staging repository for this release can be found at:
> > > >
> > https://repository.apache.org/content/repositories/orgapachespark-1055/
> > > >
> > > > The documentation corresponding to this release can be found at:
> > > > http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
> > > >
> > > > Please vote on releasing this package as Apache Spark 1.2.0!
> > > >
> > > > The vote is open until Saturday, December 13, at 21:00 UTC and passes
> > > > if a majority of at least 3 +1 PMC votes are cast.
> > > >
> > > > [ ] +1 Release this package as Apache Spark 1.2.0
> > > > [ ] -1 Do not release this package because ...
> > > >
> > > > To learn more about Apache Spark, please see
> > > > http://spark.apache.org/
> > > >
> > > > == What justifies a -1 vote for this release? ==
> > > > This vote is happening relatively late into the QA period, so
> > > > -1 votes should only occur for significant regressions from
> > > > 1.0.2. Bugs already present in 1.1.X, minor
> > > > regressions, or bugs related to new features will not block this
> > > > release.
> > > >
> > > > == What default changes should I be aware of? ==
> > > > 1. The default value of "spark.shuffle.blockTransferService" has
> been
> > > > changed to "netty"
> > > > --> Old behavior can be restored by switching to "nio"
> > > >
> > > > 2. The default value of "spark.shuffle.manager" has been changed to
> > > "sort".
> > > > --> Old behavior can be restored by setting "spark.shuffle.manager"
> to
> > > > "hash".
> > > >
> > > > == How does this differ from RC1 ==
> > > > This has fixes for a handful of issues identified - some of the
> > > > notable fixes are:
> > > >
> > > > [Core]
> > > > SPARK-4498: Standalone Master can fail to recognize completed/failed
> > > > applications
> > > >
> > > > [SQL]
> > > > SPARK-4552: Query for empty parquet table in spark sql hive get
> > > > IllegalArgumentException
> > > > SPARK-4753: Parquet2 does not prune based on OR filters on partition
> > > > columns
> > > > SPARK-4761: With JDBC server, set Kryo as default serializer and
> > > > disable reference tracking
> > > > SPARK-4785: When called with arguments referring column fields, PMOD
> > > > throws NPE
> > > >
> > > > - Patrick
> > > >
> > > > 
> -
> > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > 
> > > > For additional

Re: one hot encoding

2014-12-13 Thread Sandy Ryza
Hi Lochana,

We haven't yet added this in 1.2.
https://issues.apache.org/jira/browse/SPARK-4081 tracks adding categorical
feature indexing, which one-hot encoding can be built on.
https://issues.apache.org/jira/browse/SPARK-1216 also tracks a version of
this prior to the ML pipelines work.

-Sandy

On Fri, Dec 12, 2014 at 6:16 PM, Lochana Menikarachchi 
wrote:
>
> Do we have one-hot encoding in spark MLLib 1.1.1 or 1.2.0 ? It wasn't
> available in 1.1.0.
> Thanks.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>