from:"Shannon Quinn"


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1540:
--
Affects Version/s: (was: 1.0)
   0.9
Fix Version/s: (was: 1.0)
   0.10.1

 Reuters example for spectral clustering
 ---

 Key: MAHOUT-1540
 URL: https://issues.apache.org/jira/browse/MAHOUT-1540
 Project: Mahout
  Issue Type: Improvement
  Components: Examples
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
  Labels: DSL, scala, spark
 Fix For: 0.10.1


 Once MAHOUT-1538 and MAHOUT-1539 are complete, create a working example of 
 spectral clustering using the Reuters dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1538) Port spectral clustering to Mahout DSL


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1538:
--
Fix Version/s: (was: 0.10.0)
   0.10.1

 Port spectral clustering to Mahout DSL
 --

 Key: MAHOUT-1538
 URL: https://issues.apache.org/jira/browse/MAHOUT-1538
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
  Labels: DSL, Spark, scala
 Fix For: 0.10.1


 Move spectral clustering logic to Mahout DSL. Dependencies include SSVD 
 (already ported) and K-means (currently in progress, or can use Spark MLlib 
 implementation as a temporary fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1539) Implement affinity matrix computation in Mahout DSL


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1539:
--
Affects Version/s: (was: 1.0)
   0.9
Fix Version/s: (was: 0.10.0)
   0.10.1

 Implement affinity matrix computation in Mahout DSL
 ---

 Key: MAHOUT-1539
 URL: https://issues.apache.org/jira/browse/MAHOUT-1539
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
  Labels: DSL, scala, spark
 Fix For: 0.10.1

 Attachments: ComputeAffinities.scala


 This has the same goal as MAHOUT-1506, but rather than code the pairwise 
 computations in MapReduce, this will be done in the Mahout DSL.
 An orthogonal issue is the format of the raw input (vectors, text, images, 
 SequenceFiles), and how the user specifies the distance equation and any 
 associated parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1540) Reuters example for spectral clustering


[ 
https://issues.apache.org/jira/browse/MAHOUT-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386041#comment-14386041
 ] 

Shannon Quinn commented on MAHOUT-1540:
---

Given that this issue has explicit dependencies on MAHOUT-1538, and Saikat is 
still working on MAHOUT-1539, I propose bumping this to 0.10.1.

Plus, I'll need some assistance from everyone in familiarizing myself with the 
process of converting the Reuters dataset to something I can compute affinities 
from to construct the similarity matrix.

 Reuters example for spectral clustering
 ---

 Key: MAHOUT-1540
 URL: https://issues.apache.org/jira/browse/MAHOUT-1540
 Project: Mahout
  Issue Type: Improvement
  Components: Examples
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
  Labels: DSL, scala, spark
 Fix For: 0.10.1


 Once MAHOUT-1538 and MAHOUT-1539 are complete, create a working example of 
 spectral clustering using the Reuters dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy


[ 
https://issues.apache.org/jira/browse/MAHOUT-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386005#comment-14386005
 ] 

Shannon Quinn commented on MAHOUT-1659:
---

Pull request created: https://github.com/apache/mahout/pull/88

 Remove deprecated Lanczos solver from spectral clustering in mr-legacy
 --

 Key: MAHOUT-1659
 URL: https://issues.apache.org/jira/browse/MAHOUT-1659
 Project: Mahout
  Issue Type: Task
  Components: Clustering, mrlegacy
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.10.0


 Spectral clustering still has the option of using either SSVD or the Lanczos 
 solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Mahout 0.10.0 Bug bash

2015-03-28 Thread Shannon Quinn

Wait, I thought all DSL work on spectral clustering was waiting until 0.10.1?

iPhone'd

 On Mar 28, 2015, at 13:49, Suneel Marthi suneel.mar...@gmail.com wrote:
 
 Seems like we are stretched pretty thin given the work load, not to mention
 that Mahout work is completely orthogonal to our paychecks.
 
 Ted, Grant, Shannon - possible you guys could take some of the load??
 
 On Sat, Mar 28, 2015 at 1:25 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Today's:
 
 Andrew Palumbo
 --
 M-1648: Update CMS for Mahout 0.10.0
 M-1638: H2O bindings fail at drmParallelizeWithRowLabels
 M-1477: Clean up website on Logistic Regression
 M-1564: Naive Bayes classifier for new Text Documents
 M-1635: Getting an exception when I provide classification labels manually
 for Naive Bayes
 M-1493: Port Naive Bayes to Spark DSL(Patch available)
 M-1559: Documentation and cleanup for Naive Bayes Example
 M-1609: NullPointerException
 M-1607: Spark-shell DAG scheduler
 
 Andrew Musselman
 -
 M-1655: Refactor module dependencies
 M-1522: Handle logging levels via log4j.xml
 M-1563: cleanup Warnings during Build
 M-1470: LDA Topic dump
 M-1462: Cleaning up Random Forests documentation on Mahout website
 
 Dmitriy Lyubimov
 --
 M-1646: Refactor out all legacy MR dependencies from scala code
 
 Frank Scholten
 -
 M-1649: Lucene 5 upgrade
 M-1625: lucene2seq: failure to convert a document that does not contain a
 field (the field is not required)
 
 Pat Ferrel
 -
 M-1589: mahout.cmd has duplicated content(Patch available)
 M-1618: co-occurence recommender example
 
 Suneel Marthi
 -
 M-1586: Collections downloads must have hash signatures
 M-1647: The release build is incomplete
 M-1652: Java 7 update
 M-1512: Hadoop 2 compatibility
 M-1469: Streaming KMeans fails when executed in MR mode and
 REDUCE_STREAMING_KMEANS
 set to true
 M-1443: Update How to Release page(Tagged 0.10.1)
 M-1585: Javadocs not hosted by Mahout-Quality
 M-1612: NPE during JSON outputformatter for clusterdump
 M-1656: Change SNAPSHOT version from 1.0 to 0.10
 M-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf
 M-1619: HighDFWordsPruner overwrites cache files
 
 Stevo Slavic
 
 M-1650: upgrade 3rd party jars
 M-1602: Euclidean Distance Similarity Math
 M-1278: Improve inheritance of apache parent pom
 M-1562: Publish Scaladocs
 M-1277: Lose dependency on custom commons-cli
 
 Shannon Quinn
 ---
 M-1538: Port spectral clustering to Mahout DSL
 M-1593: Implement affinity matrix computation in Mahout DSL
 M-1540: Reuters Example spectral clustering Also online docs for Spectral
 clustering
 M-1659: Remove deprecated Lanczos solver from spectral clustering in
 mr-legacy
 
 Ted Dunning
 ---
 M-1636: Class dependencies for Spark module are put in job.jar, which is
 inefficient
 
 Sebastian Schelter
 --
 M-1584: Create a detailed example of how to index an arbitrary dataset and
 run LDA on it(Patch available)
 
 Gokhan Capan
 --
 M-1626: Support for required quasi-algebraic operations and starting with
 aggregating rows/blocks
 
 Unassigned
 --
 M-1594: Example factorize-movielens-1M.sh does not use HDFS(Patch
 available)
 M-1593: cluster-reuters.sh does not work complaining
 java.lang.IllegalStateException(Patch available)
 M-1557: Add support for sparse training vectors in MLP(Patch available)
 M-1516: run classify-20newsgroups.sh failed cause by
 /tmp/mahout-work-jpan/20news-all does not exists in hdfs.(Patch
 available)
 M-1643: CLI arguments are not being processed in spark-shell
 M-1637: RecommenderJob of ALS fails in the mapper because it uses the
 instance of other class
 M-1634: ALS don't work when it adds new files in Distributed Cache
 (Patch available)
 M-1633: Failure to execute query when solr index contains documents with
 different fields
 M-1551: Add document to describe how to use mlp with command line(Patch
 available)
 
 On Thu, Mar 26, 2015 at 7:07 PM, Suneel Marthi suneel.mar...@gmail.com
 wrote:
 
 Ok here's the bug bash as of today
 
 Andrew Palumbo
 --
 M-1648: Update CMS for Mahout 0.10.0
 M-1638: H2O bindings fail at drmParallelizeWithRowLabels
 M-1564: Naive Bayes classifier for new Text Documents
 M-1635: Exception when providing classification Labels
 M-1493: Port Naive Bayes to Spark DSL
 M-1559: Documentation and cleanup for Naive Bayes Example
 M-1609: NullPointerException
 M-1607: Spark-shell DAG scheduler
 
 Andrew Musselman
 -
 M-1655: Refactor module dependencies
 M-1563: cleanup Warnings during Build
 M-1470: LDA Topic dump
 
 Dmitriy Lyubimov
 --
 M-1646: Refactor out all legacy MR dependencies from scala code
 
 Frank Scholten

Re: Mahout 0.10.0 Bug bash

2015-03-28 Thread Shannon Quinn

Ah no worries, just got a bit panicked when I saw that. 

Summer will be better for me but for now these tickets have about maxed me out; 
3 months into the new tenure-track shtick is grueling. 

iPhone'd

 On Mar 28, 2015, at 14:27, Andrew Musselman andrew.mussel...@gmail.com 
 wrote:
 
 Okay, go ahead and move it; I was just moving things from 1.0 to 0.10.0
 almost indiscriminately.
 
 On Sat, Mar 28, 2015 at 11:22 AM, Shannon Quinn squ...@gatech.edu wrote:
 
 Wait, I thought all DSL work on spectral clustering was waiting until
 0.10.1?
 
 iPhone'd
 
 On Mar 28, 2015, at 13:49, Suneel Marthi suneel.mar...@gmail.com
 wrote:
 
 Seems like we are stretched pretty thin given the work load, not to
 mention
 that Mahout work is completely orthogonal to our paychecks.
 
 Ted, Grant, Shannon - possible you guys could take some of the load??
 
 On Sat, Mar 28, 2015 at 1:25 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Today's:
 
 Andrew Palumbo
 --
 M-1648: Update CMS for Mahout 0.10.0
 M-1638: H2O bindings fail at drmParallelizeWithRowLabels
 M-1477: Clean up website on Logistic Regression
 M-1564: Naive Bayes classifier for new Text Documents
 M-1635: Getting an exception when I provide classification labels
 manually
 for Naive Bayes
 M-1493: Port Naive Bayes to Spark DSL(Patch available)
 M-1559: Documentation and cleanup for Naive Bayes Example
 M-1609: NullPointerException
 M-1607: Spark-shell DAG scheduler
 
 Andrew Musselman
 -
 M-1655: Refactor module dependencies
 M-1522: Handle logging levels via log4j.xml
 M-1563: cleanup Warnings during Build
 M-1470: LDA Topic dump
 M-1462: Cleaning up Random Forests documentation on Mahout website
 
 Dmitriy Lyubimov
 --
 M-1646: Refactor out all legacy MR dependencies from scala code
 
 Frank Scholten
 -
 M-1649: Lucene 5 upgrade
 M-1625: lucene2seq: failure to convert a document that does not contain
 a
 field (the field is not required)
 
 Pat Ferrel
 -
 M-1589: mahout.cmd has duplicated content(Patch available)
 M-1618: co-occurence recommender example
 
 Suneel Marthi
 -
 M-1586: Collections downloads must have hash signatures
 M-1647: The release build is incomplete
 M-1652: Java 7 update
 M-1512: Hadoop 2 compatibility
 M-1469: Streaming KMeans fails when executed in MR mode and
 REDUCE_STREAMING_KMEANS
 set to true
 M-1443: Update How to Release page(Tagged 0.10.1)
 M-1585: Javadocs not hosted by Mahout-Quality
 M-1612: NPE during JSON outputformatter for clusterdump
 M-1656: Change SNAPSHOT version from 1.0 to 0.10
 M-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf
 M-1619: HighDFWordsPruner overwrites cache files
 
 Stevo Slavic
 
 M-1650: upgrade 3rd party jars
 M-1602: Euclidean Distance Similarity Math
 M-1278: Improve inheritance of apache parent pom
 M-1562: Publish Scaladocs
 M-1277: Lose dependency on custom commons-cli
 
 Shannon Quinn
 ---
 M-1538: Port spectral clustering to Mahout DSL
 M-1593: Implement affinity matrix computation in Mahout DSL
 M-1540: Reuters Example spectral clustering Also online docs for
 Spectral
 clustering
 M-1659: Remove deprecated Lanczos solver from spectral clustering in
 mr-legacy
 
 Ted Dunning
 ---
 M-1636: Class dependencies for Spark module are put in job.jar, which is
 inefficient
 
 Sebastian Schelter
 --
 M-1584: Create a detailed example of how to index an arbitrary dataset
 and
 run LDA on it(Patch available)
 
 Gokhan Capan
 --
 M-1626: Support for required quasi-algebraic operations and starting
 with
 aggregating rows/blocks
 
 Unassigned
 --
 M-1594: Example factorize-movielens-1M.sh does not use HDFS(Patch
 available)
 M-1593: cluster-reuters.sh does not work complaining
 java.lang.IllegalStateException(Patch available)
 M-1557: Add support for sparse training vectors in MLP(Patch
 available)
 M-1516: run classify-20newsgroups.sh failed cause by
 /tmp/mahout-work-jpan/20news-all does not exists in hdfs.(Patch
 available)
 M-1643: CLI arguments are not being processed in spark-shell
 M-1637: RecommenderJob of ALS fails in the mapper because it uses the
 instance of other class
 M-1634: ALS don't work when it adds new files in Distributed Cache
 (Patch available)
 M-1633: Failure to execute query when solr index contains documents with
 different fields
 M-1551: Add document to describe how to use mlp with command line
 (Patch
 available)
 
 On Thu, Mar 26, 2015 at 7:07 PM, Suneel Marthi suneel.mar...@gmail.com
 
 wrote:
 
 Ok here's the bug bash as of today
 
 Andrew Palumbo
 --
 M-1648: Update CMS for Mahout 0.10.0
 M-1638: H2O bindings fail at drmParallelizeWithRowLabels
 M-1564: Naive Bayes classifier for new Text Documents
 M-1635: Exception when providing

Re: Mahout 0.10.0 Bug bash

2015-03-27 Thread Shannon Quinn


Yes--removing the Lanczos solver from spectral clustering.

On 3/27/15 10:29 AM, Suneel Marthi wrote:

and this is for 0.10.0 ???

On Fri, Mar 27, 2015 at 10:27 AM, Shannon Quinn squ...@gatech.edu wrote:


Created M-1659 and assigned it to myself to reflect current work.

Shannon


On 3/26/15 10:07 PM, Suneel Marthi wrote:


Ok here's the bug bash as of today

Andrew Palumbo
--
M-1648: Update CMS for Mahout 0.10.0
M-1638: H2O bindings fail at drmParallelizeWithRowLabels
M-1564: Naive Bayes classifier for new Text Documents
M-1635: Exception when providing classification Labels
M-1493: Port Naive Bayes to Spark DSL
M-1559: Documentation and cleanup for Naive Bayes Example
M-1609: NullPointerException
M-1607: Spark-shell DAG scheduler

Andrew Musselman
-
M-1655: Refactor module dependencies
M-1563: cleanup Warnings during Build
M-1470: LDA Topic dump

Dmitriy Lyubimov
--
M-1646: Refactor out all legacy MR dependencies from scala code

Frank Scholten
-
M-1649: Lucene 5 upgrade

Pat Ferrel
-
M-1589: mahout.cmd has duplicated content
M-1618: co-occurence recommender example

Suneel Marthi
-
M-1586: Collections downloads must have hash signatures
M-1647: Release build
M-1652: Java 7 update
M-1512: Hadoop 2 compatibility
M-1469: Streaming KMeans fails when executed in MR mode and
REDUCE_STREAMING_KMEANS set to true
M-1443: Update How to Release page
M-1585: Javadocs not hosted by Mahout-Quality
M-1612: NPE during JSON outputformatter for clusterdump

Stevo Slavic

M-1650: upgrade 3rd party jars
M-1602: Euclidean Distance Similarity Math
M-1278: Improve inheritance of apache parent pom

Shannon Quinn
---
M-1540: Reuters Example spectral clustering
Also online docs for Spectral clustering

Ted Dunning
---
M-1636: Class dependencies for Spark module are put in job.jar, which is
inefficient

[jira] [Created] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-27 Thread Shannon Quinn (JIRA)

Shannon Quinn created MAHOUT-1659:
-

 Summary: Remove deprecated Lanczos solver from spectral clustering 
in mr-legacy
 Key: MAHOUT-1659
 URL: https://issues.apache.org/jira/browse/MAHOUT-1659
 Project: Mahout
  Issue Type: Task
  Components: Clustering, mrlegacy
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.10.0


Spectral clustering still has the option of using either SSVD or the Lanczos 
solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Mahout 0.10.0 Bug bash

2015-03-27 Thread Shannon Quinn


Created M-1659 and assigned it to myself to reflect current work.

Shannon

On 3/26/15 10:07 PM, Suneel Marthi wrote:

Ok here's the bug bash as of today

Andrew Palumbo
--
M-1648: Update CMS for Mahout 0.10.0
M-1638: H2O bindings fail at drmParallelizeWithRowLabels
M-1564: Naive Bayes classifier for new Text Documents
M-1635: Exception when providing classification Labels
M-1493: Port Naive Bayes to Spark DSL
M-1559: Documentation and cleanup for Naive Bayes Example
M-1609: NullPointerException
M-1607: Spark-shell DAG scheduler

Andrew Musselman
-
M-1655: Refactor module dependencies
M-1563: cleanup Warnings during Build
M-1470: LDA Topic dump

Dmitriy Lyubimov
--
M-1646: Refactor out all legacy MR dependencies from scala code

Frank Scholten
-
M-1649: Lucene 5 upgrade

Pat Ferrel
-
M-1589: mahout.cmd has duplicated content
M-1618: co-occurence recommender example

Suneel Marthi
-
M-1586: Collections downloads must have hash signatures
M-1647: Release build
M-1652: Java 7 update
M-1512: Hadoop 2 compatibility
M-1469: Streaming KMeans fails when executed in MR mode and
REDUCE_STREAMING_KMEANS set to true
M-1443: Update How to Release page
M-1585: Javadocs not hosted by Mahout-Quality
M-1612: NPE during JSON outputformatter for clusterdump

Stevo Slavic

M-1650: upgrade 3rd party jars
M-1602: Euclidean Distance Similarity Math
M-1278: Improve inheritance of apache parent pom

Shannon Quinn
---
M-1540: Reuters Example spectral clustering
Also online docs for Spectral clustering

Ted Dunning
---
M-1636: Class dependencies for Spark module are put in job.jar, which is
inefficient

Re: [jira] [Created] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-27 Thread Shannon Quinn


Are the slides from these talks going to be posted somewhere?

On 3/27/15 1:10 PM, Suneel Marthi wrote:

Different Topic: There's a talk this afternoon by Cloudera's Data Scientist
at MlConf NYC about Mahout's LanczosSolver, SSVD and MlLib SSVD.

See http://mlconf.com/mlconf-2015-nyc/

Here we r talking about purging Mahout's LanczosSolver for 2 years now.
Seems like the talk will be about the old MapReduce based SSVD and
LanczosSolver while we have
the new non-MR distributed SSVD stuff. I hope I am wrong here but will see.

On Fri, Mar 27, 2015 at 1:02 PM, Shannon Quinn squ...@gatech.edu wrote:


Honestly not sure, as I haven't had a chance to play around with the scala
dsl much yet. Suneel suggested we save that for 0.10.1.


On 3/27/15 12:00 PM, Dmitriy Lyubimov wrote:


Shannon,

How difficult would it be to port spectral clustering to our scala alg and
math? We have ssvd there as well.
On Mar 27, 2015 7:26 AM, Shannon Quinn (JIRA) j...@apache.org wrote:

  Shannon Quinn created MAHOUT-1659:

-

   Summary: Remove deprecated Lanczos solver from spectral
clustering in mr-legacy
   Key: MAHOUT-1659
   URL: https://issues.apache.org/jira/browse/MAHOUT-1659
   Project: Mahout
Issue Type: Task
Components: Clustering, mrlegacy
  Affects Versions: 0.9
  Reporter: Shannon Quinn
  Assignee: Shannon Quinn
  Priority: Minor
   Fix For: 0.10.0


Spectral clustering still has the option of using either SSVD or the
Lanczos solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Created] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-27 Thread Shannon Quinn

Honestly not sure, as I haven't had a chance to play around with the 
scala dsl much yet. Suneel suggested we save that for 0.10.1.


On 3/27/15 12:00 PM, Dmitriy Lyubimov wrote:

Shannon,

How difficult would it be to port spectral clustering to our scala alg and
math? We have ssvd there as well.
On Mar 27, 2015 7:26 AM, Shannon Quinn (JIRA) j...@apache.org wrote:


Shannon Quinn created MAHOUT-1659:
-

  Summary: Remove deprecated Lanczos solver from spectral
clustering in mr-legacy
  Key: MAHOUT-1659
  URL: https://issues.apache.org/jira/browse/MAHOUT-1659
  Project: Mahout
   Issue Type: Task
   Components: Clustering, mrlegacy
 Affects Versions: 0.9
 Reporter: Shannon Quinn
 Assignee: Shannon Quinn
 Priority: Minor
  Fix For: 0.10.0


Spectral clustering still has the option of using either SSVD or the
Lanczos solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: 0.10 release Hangout

2015-03-23 Thread Shannon Quinn

Will be teaching until 9:30 PT, at which point I have another meeting 
until 11. Would love to get a summary of the meeting; also happy to help 
with some of the tasks.


Shannon

On 3/23/15 3:56 PM, Andrew Musselman wrote:

We'll be getting on a Google Hangout tomorrow, Tuesday, from 9-11 a.m.
Pacific, to work through open questions for what should be in the release,
go through Jira, and do some delegation of tasks.

Here's the Hangout URL
https://plus.google.com/hangouts/_/calendar/YW5kcmV3Lm11c3NlbG1hbkBnbWFpbC5jb20.glvu1gfv3kvj5241n9bsg3clrc

See you then!

Re: Release

2015-03-17 Thread Shannon Quinn


+1

On 3/17/15 8:19 PM, Andrew Musselman wrote:

How about 0.10 is the first block and 0.10.1 is the second?

On Wed, Mar 18, 2015 at 1:12 AM, Andrew Palumbo ap@outlook.com wrote:


I like this timeline... though mid April is coming up quickly.. Going back
to Pat's list for 0.10.0:

  1) refactor mrlegacy out of scala deps.

2) build fixes for release.
3) docs — might be good to guinea-pig the new CMS with git pubsub so we
don’t have to do svn, not sure when that will be ready


I would add:

  4) Fix any remaining legacy bugs.

5) docs, docs, docs


along with just some general cleanup.

Is anything else missing?




On 03/17/2015 07:16 PM, Andrew Musselman wrote:


I'm good with that timing pending scope..

On Wed, Mar 18, 2015 at 12:13 AM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

  i was thinking 0.10.0 mid-april, update 0.10.1 end of spring.

   i would suggest feature extraction topics for 0.11.x. Esp. w.r.t.
SchemaRDD aka DataFrame -- vectorizing, hashing, ML schema support,
imputation of missing data, outlier cleanups etc. There's a lot.

Hardware backs integration -- i will certainly be looking at those,
but perhaps the easiest is to start with automatic detection and
configuration of capabilities via netlib, since it is already in the
path and it seems likely that it will (eventually) support cuda as
well in some form. This is for 0.11 or 0.12.x, depends on
availability.

Higher order methods are somewhat a matter of inspiration. I think i
could offer some stuff there too as I already have implemented a lot
of those on top of Mahout before. I did bayesian optimization (aka
spearmint, GP-EI etc.) on Mahout algebra, line search, (L)bfgs,
stats including Gaussian Process support. BFGS and line search are
fairly simple methods and i will give a reference if anybody is
interested. also, breeze also has line search with strong wolfe
conditions (if a coded reference is needed). All that is up for grabs
as a fairly well understood subject.

(5-6 months out) Once GP-EI is available, it becomes a fairly
interesting topic to resurrect implicit feedback issue. Important
insight there is that in fact feature incoding can be done by a custom
scheme (not necessarily using encoding schme done in paper; in fact,
there are 2 of them there; or the way mllib encodes that as well).
once custom encoding schemes are adjusted, using bayesian optimization
is increasingly important, especially if there are more than just 2
parameters there.

Re: Release

2015-03-17 Thread Shannon Quinn

I think we need a better idea of what the release will contain, then we 
can start narrowing the range of possible release dates.


If we take what Pat outlined, an April release might be somewhat 
ambitious but we probably wouldn't miss by much.


On 3/17/15 11:51 AM, David Starina wrote:

Hi guys,

Do you have any specific release date in mind? Guys at Bigtop are planning
an april release, is there any chance there will be a Hadoop 2.x compatible
Mahout release by then to be included with Bigtop?



On Sunday, March 15, 2015, Pat Ferrel p...@occamsmachete.com wrote:


Lots of discussion off the record about doing a release but shouldn’t we
plan this?

What has to be in a release of Mahout 0.10?

Seems like we could release as-is but it would be nice to have some of the
already completed work that isn’t committed yet:
* mrlegacy refactored out of scala, is it possible to get this in Dmitriy?

One question is how to package, with which version of Spark. There is a
bug in Spark 1.2.1 and I think in 1.2 (this is the big distro build) that
requires any class that uses the JavaSerializer to set a specific SparkConf
key/value to point to the guava jar on all workers. This only effects
IndexedDatasets since they use Guava’s BiMap. Rumor has it that 1.3 fixes
this but I haven’t tried it yet.

So we are currently stuck on 1.1.1 but could document how to work around
to use 1.2 for a user who want’s to build Mahout from scratch. A user
source build on 1.3 may not require a work around. We seem to be good on
hadoop 2.x, which in itself is a good reason to release since 0.9 was not.

What else needs to be done:
* rename module math-scala to core?
* create the distribution build. Currently this does not publish the
scaladocs and does not create artifacts for H2O or and Scala.
* is H2O really in a form to publish?

Docs
* IMO we should name the Mahout Spark-Scala DSL and shell. More unique
names are easier to find in searches. Maybe Suneel can polish off his
sanskrit and suggest something.
* we should be ready to do some work here to restructure the CMS since it
is very 0.9 centric with Scala stuff almost an afterthought.

Re: Codebase refactoring proposal

2015-01-23 Thread Shannon Quinn

Also +1

iPhone'd

 On Jan 23, 2015, at 18:38, Andrew Palumbo ap@outlook.com wrote:
 
 +1
 
 
 Sent from my Verizon Wireless 4G LTE smartphone
 
 div Original message /divdivFrom: Dmitriy Lyubimov 
 dlie...@gmail.com /divdivDate:01/23/2015  6:06 PM  (GMT-05:00) 
 /divdivTo: dev@mahout.apache.org /divdivSubject: Codebase refactoring 
 proposal /divdiv
 /div
 So right now mahout-spark depends on mr-legacy.
 I did quick refactoring and it turns out it only _irrevocably_ depends on
 the following classes there:
 
 MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ...
 *sigh* o.a.m.common.Pair
 
 So  I just dropped those five classes into new a new tiny mahout-hadoop
 module (to signify stuff that is directly relevant to serializing thigns to
 DFS API) and completely removed mrlegacy and its transients from spark and
 spark-shell dependencies.
 
 So non-cli applications (shell scripts and embedded api use) actually only
 need spark dependencies (which come from SPARK_HOME classpath, of course)
 and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
 optionally mahout-spark-shell (for running shell)).
 
 This of course still doesn't address driver problems that want to throw
 more stuff into front-end classpath (such as cli parser) but at least it
 renders transitive luggage of mr-legacy (and the size of worker-shipped
 jars) much more tolerable.
 
 How does that sound?

Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-18 Thread Shannon Quinn


Saikat,

Spark has the cartesian() method that will align all pairs of points; 
that's the nontrivial part of determining an RBF kernel. After that it's 
a simple matter of performing the equation that's given on the 
scikit-learn doc page.


However, like you said it'll also have to be implemented using the 
Mahout DSL. I can envision that users would like to compute pairwise 
metrics for a lot more than just RBF kernels (pairwise Euclidean 
distance, etc), so my guess would be a DSL implementation of cartesian() 
is what you're looking for. You can build the other methods on top of that.


Correct me if I'm wrong.

Shannon

On 9/18/14, 3:28 PM, Saikat Kanjilal wrote:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html
I need to implement the above in the scala world and expose a DSL API to call 
the computation when computing the affinity matrix.


From: ted.dunn...@gmail.com
Date: Thu, 18 Sep 2014 10:04:34 -0700
Subject: Re: Mahout-1539-computation of gaussian kernel between 2 arrays of 
shapes
To: dev@mahout.apache.org

There are number of non-traditional linear algebra operations like this
that are important to implement.

Can you describe what you intend to do so that we can discuss the shape of
the API and computation?



On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal sxk1...@hotmail.com
wrote:


Dmitry et al,As part of the above JIRA I need to calculate the gaussian
kernel between 2 shapes, I looked through mahout-math-scala and didnt see
anything to do this, any objections to me adding some code under
scalabindings to do this?
Thanks in advance.

Re: Affinity matrix computation

2014-09-13 Thread Shannon Quinn

Since it's an input processing method--rather than strictly an algorithm 
in the category of SVMs, K-means, etc--and since you're early in the 
development cycle, wherever makes it easiest is probably best for now. 
We can always merge it elsewhere once you're ready to submit a PR.


Of course someone please correct me if I'm mistaken.

On 9/13/14, 1:14 PM, Saikat Kanjilal wrote:

Hi Committers,I'm beginning some work on the affinity matrix computation in 
mahout-dsl, I was wondering where in the directory structure I should put this 
effort, are we placing all our algorithms in mahout-dsl in a specific 
area?Thanks in advance.

Re: Upgrade to spark 1.0.x

2014-08-08 Thread Shannon Quinn


+1

On 8/8/14, 3:58 PM, Suneel Marthi wrote:

+1


On Fri, Aug 8, 2014 at 3:48 PM, Ted Dunning ted.dunn...@gmail.com wrote:


+1 to merge




On Fri, Aug 8, 2014 at 12:36 PM, Gokhan Capan gkhn...@gmail.com wrote:


+1 to merging spark-1.0.x to master

Sent from my iPhone


On Aug 8, 2014, at 22:06, Dmitriy Lyubimov dlie...@gmail.com wrote:

Current master is still at Spark 0.9.x . MAHOUT-1603 (PR #40) is

making a

number of valuable tweaks to enable Spark 1.0.x and (Spark SQL code, by
extension. I did a quick test, SQL seems to work for my simple tests in
Mahout environment).

This squashed PR is pushed to apache/mahout branch spark-1.0.x rather

than

master. Whenever (if) folks are ready, i can merge it to the master.

Alternative approach would be to maintain both 1.0.x and 0.9.x branches

for

some time. I don't see it as valuable as the costs would likely overrun

any

benefit here, but if anyone still clings to spark 0.9.x dependency,

please

let me know in this thread.

thanks.
-d

Re: Git Migration

2014-05-22 Thread Shannon Quinn


Works for me.

Shannon

On 5/22/14, 3:45 PM, Gokhan Capan wrote:

Works for me as well

Gokhan


On Thu, May 22, 2014 at 9:23 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:


Thanks; I just pushed successfully.


On Thu, May 22, 2014 at 10:55 AM, Dmitriy Lyubimov dlie...@gmail.com

wrote:
did you read Jake's email earlier at dev/infra discussion? he describes

and

makes references here.

It is two-fold: first  we can push whatever commits to master of
https://git-wip-us.apache.org/repos/asf?p=mahout.git

However the other side of the coin is that significant commits should go
thru pull requests directly to (if i understand it correctly)

apache/mahout

mirror on github. Such pull requests are managed thru commits to git-wp

as

well by specific messages (again, see references in Jake's email). My
understanding is that github integration features are not yet enabled,

only

commits to master of git-wp-us.a.o are at this point.

At this point I simply would like everyone to verify they can push

commits

to master branch of git-wp-us.a.o per instructions in INFRA- and

report

back there (I can push).

I guess someone (perhaps me) will have to write the manual for working

with

github pull requests (mainly, merging them to git-wp-us.o.a and closing
them).


On Thu, May 22, 2014 at 10:47 AM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:


What's the workflow to commit a change?  I'm totally in the dark about
that.


On Thu, May 22, 2014 at 10:14 AM, Dmitriy Lyubimov dlie...@gmail.com

wrote:
Hi,

(1) git migration of the project is now complete. Any volunteers to

verify

per INFRA-? If you do, please report back to the issue.

(2) Anybody knows what to do with jenkins now? i still don't have

proper

privileges on it. thanks.



[1] https://issues.apache.org/jira/browse/INFRA-

Re: Proposal for additional features in Mahout (minkowski Distance, mahalobnis Distance and K-nearest neighbor classifier)

2014-05-18 Thread Shannon Quinn


Hi Arunav,

Contributions are certainly welcome. If you can post a patch on JIRA ( 
https://issues.apache.org/jira/browse/MAHOUT ), we can have a look at 
it. I don't know if you've been monitoring our mailing lists or have 
otherwise heard, but Mahout is no longer accepting new MapReduce code. 
We're still in discussions regarding the next-generation Mahout 
backends, but we're moving instead towards engine-agnostic (e.g. Mahout 
DSL, see http://mahout.apache.org/users/sparkbindings/home.html ) 
implementations.


As for Minkowski distance, I'm not sure if someone else is working on 
it, but as I mentioned you're welcome to post a patch and we can discuss 
it from there. Thanks!


Shannon

On 5/18/14, 1:29 PM, Arunav Sanyal wrote:

Hi

I am new to apache mahout and would like to contribute in whatever humble
way I can.

I see that the Vector class in Apache Mahout does not have the
functionality of minkowski distance.

http://en.wikipedia.org/wiki/Minkowski_distance

is a distance metric which generalizes distance measures between any two
vectors. It can represent hamming distance, euclidean distance depending on
parameters. I already have a simple solution ready for review if this is
approved. Similarly I am working on the more generic Mahalobnis distance
measure.

My primary motive for introducing these distance measures is to come up
with a generic implementation of the K-nearest neighbor classifier (not to
be confused K-means clustering). I will be working on that as well shortly.

If somebody else is working towards these features, I would like to
collaborate and donate whatever code patches that they deem necessary. If
not, I humbly request that the community approve these for inclusion into
apache mahout.


Yours sincerely
Arunav Sanyal

Re: VOTE: moving commits to git-wp.o.a github PR features.

2014-05-16 Thread Shannon Quinn

+1

iPhone'd

 On May 16, 2014, at 14:46, Andrew Musselman andrew.mussel...@gmail.com 
 wrote:
 
 +1
 
 
 On Fri, May 16, 2014 at 11:02 AM, Dmitriy Lyubimov dlie...@gmail.comwrote:
 
 Hi,
 
 I would like to initiate a procedural vote moving to git as our primary
 commit system, and using github PRs as described in Jake Farrel's email to
 @dev [1]
 
 [1]
 
 https://blogs.apache.org/infra/entry/improved_integration_between_apache_and
 
 If voting succeeds, i will file a ticket with infra to commence necessary
 changes and to move our project to git-wp as primary source for commits as
 well as add github integration features [1]. (I assume pure git commits
 will be required after that's done, with no svn commits allowed).
 
 The motivation is to engage GIT and github PR features as described, and
 avoid git mirror history messes like we've seen associated with authors.txt
 file fluctations.
 
 PMC and committers have binding votes, so please vote. Lazy consensus with
 minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time
 for weekend (i.e. Tuesday afternoon PST) .
 
 here is my +1
 
 -d

Re: consensus statement?

2014-05-06 Thread Shannon Quinn

+1

iPhone'd

 On May 6, 2014, at 12:23, Ted Dunning ted.dunn...@gmail.com wrote:
 
 I have been involved in side conversations to try to build a bit of unity
 among our community and would like to propose this as a statement of what
 we are doing:
 
 
 Apache Mahout is moving immediately to a faster execution model. The first
 of these is Spark. Outside contributions are always encouraged.
 
 
 As a bit of commentary, it is clear that what the committers are working on
 is Spark and it is clear that Spark will be the first new platform for
 Mahout.  It is also clear that there are non-committers (the 0xdata crew
 for one) who are working with the community to extend Mahout beyond just
 Spark.  As a statement of where the community is *right* now, however, I
 don't think we need to say much more than that we encourage contributions.
 
 Sound fair?  Correct?

[jira] [Commented] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website

2014-05-03 Thread Shannon Quinn (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988666#comment-13988666
 ] 

Shannon Quinn commented on MAHOUT-1441:
---

If no one has any objections in the next couple of days, I can close this 
ticket.

 Add documentation for Spectral KMeans to Mahout Website
 ---

 Key: MAHOUT-1441
 URL: https://issues.apache.org/jira/browse/MAHOUT-1441
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0
Reporter: Suneel Marthi
Assignee: Shannon Quinn
 Fix For: 1.0

 Attachments: MAHOUT-1441.diff


 Need to update the Website with Design, user guide and any relevant 
 documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MAHOUT-1538) Port spectral clustering to Mahout DSL

Shannon Quinn created MAHOUT-1538:
-

 Summary: Port spectral clustering to Mahout DSL
 Key: MAHOUT-1538
 URL: https://issues.apache.org/jira/browse/MAHOUT-1538
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 1.0


Move spectral clustering logic to Mahout DSL. Dependencies include SSVD 
(already ported) and K-means (currently in progress, or can use Spark MLlib 
implementation as a temporary fix).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MAHOUT-1539) Implement affinity matrix computation in Mahout DSL

Shannon Quinn created MAHOUT-1539:
-

 Summary: Implement affinity matrix computation in Mahout DSL
 Key: MAHOUT-1539
 URL: https://issues.apache.org/jira/browse/MAHOUT-1539
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 1.0


This has the same goal as MAHOUT-1506 
(https://issues.apache.org/jira/browse/MAHOUT-1506), but rather than code the 
pairwise computations in MapReduce, this will be done in the Mahout DSL.

An orthogonal issue is the format of the raw input (vectors, text, images, 
SequenceFiles), and how the user specifies the distance equation and any 
associated parameters.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1539) Implement affinity matrix computation in Mahout DSL


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1539:
--

Description: 
This has the same goal as MAHOUT-1506, but rather than code the pairwise 
computations in MapReduce, this will be done in the Mahout DSL.

An orthogonal issue is the format of the raw input (vectors, text, images, 
SequenceFiles), and how the user specifies the distance equation and any 
associated parameters.

  was:
This has the same goal as MAHOUT-1506 
(https://issues.apache.org/jira/browse/MAHOUT-1506), but rather than code the 
pairwise computations in MapReduce, this will be done in the Mahout DSL.

An orthogonal issue is the format of the raw input (vectors, text, images, 
SequenceFiles), and how the user specifies the distance equation and any 
associated parameters.


 Implement affinity matrix computation in Mahout DSL
 ---

 Key: MAHOUT-1539
 URL: https://issues.apache.org/jira/browse/MAHOUT-1539
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 1.0


 This has the same goal as MAHOUT-1506, but rather than code the pairwise 
 computations in MapReduce, this will be done in the Mahout DSL.
 An orthogonal issue is the format of the raw input (vectors, text, images, 
 SequenceFiles), and how the user specifies the distance equation and any 
 associated parameters.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MAHOUT-1540) Reuters example for spectral clustering

Shannon Quinn created MAHOUT-1540:
-

 Summary: Reuters example for spectral clustering
 Key: MAHOUT-1540
 URL: https://issues.apache.org/jira/browse/MAHOUT-1540
 Project: Mahout
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 1.0


Once MAHOUT-1538 and MAHOUT-1539 are complete, create a working example of 
spectral clustering using the Reuters dataset.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1441:
--

Attachment: MAHOUT-1441.diff

Update on the documentation. It specifies a brief overview of spectral 
clustering theory (with a link to further reading), a guide for how to run the 
algorithm in Mahout, and a small toy example. Also linked are the outstanding 
issues for improving the algorithm and what those changes will be.

Ready to commit.

 Add documentation for Spectral KMeans to Mahout Website
 ---

 Key: MAHOUT-1441
 URL: https://issues.apache.org/jira/browse/MAHOUT-1441
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0
Reporter: Suneel Marthi
Assignee: Shannon Quinn
 Fix For: 1.0

 Attachments: MAHOUT-1441.diff


 Need to update the Website with Design, user guide and any relevant 
 documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1538) Port spectral clustering to Mahout DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988137#comment-13988137
 ] 

Shannon Quinn commented on MAHOUT-1538:
---

That's fine, though until k-means is fully ported in Mahout this will remain 
incomplete. I was thinking of Spark as more of a drop-in temp replacement until 
the former is complete (unless it already is and I missed it?).

 Port spectral clustering to Mahout DSL
 --

 Key: MAHOUT-1538
 URL: https://issues.apache.org/jira/browse/MAHOUT-1538
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 1.0


 Move spectral clustering logic to Mahout DSL. Dependencies include SSVD 
 (already ported) and K-means (currently in progress, or can use Spark MLlib 
 implementation as a temporary fix).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website


[ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988167#comment-13988167
 ] 

Shannon Quinn commented on MAHOUT-1441:
---

Published the new content. All seems well except for the inline latex; what's 
the correct syntax?

 Add documentation for Spectral KMeans to Mahout Website
 ---

 Key: MAHOUT-1441
 URL: https://issues.apache.org/jira/browse/MAHOUT-1441
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0
Reporter: Suneel Marthi
Assignee: Shannon Quinn
 Fix For: 1.0

 Attachments: MAHOUT-1441.diff


 Need to update the Website with Design, user guide and any relevant 
 documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website


[ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988190#comment-13988190
 ] 

Shannon Quinn commented on MAHOUT-1441:
---

Yep, realized that a few moments ago. Thanks, both of you. It should be good 
now.

 Add documentation for Spectral KMeans to Mahout Website
 ---

 Key: MAHOUT-1441
 URL: https://issues.apache.org/jira/browse/MAHOUT-1441
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0
Reporter: Suneel Marthi
Assignee: Shannon Quinn
 Fix For: 1.0

 Attachments: MAHOUT-1441.diff


 Need to update the Website with Design, user guide and any relevant 
 documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website

2014-05-01 Thread Shannon Quinn (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986977#comment-13986977
 ] 

Shannon Quinn commented on MAHOUT-1441:
---

It's in progress. Is there a deadline for this? I was hoping to finish it next 
week.

However I do have a couple of questions. Obviously the Eigencuts docs will be 
stripped out entirely, but there are still other components that need to be 
added for the full pipeline to function: a DSL-based affinity matrix input, and 
a working example on the Reuters dataset. Should these items be completed 
*first*, or should I just leave notes in the documentation to JIRA tickets for 
these issues? If the latter, the documentation just needs some basic cleaning 
up and can be done pretty quickly, albeit without specifics on how aspects of 
it actually work in practice. If the latter, I'll need a little more time to 
port the algorithm to Mahout DSL.

 Add documentation for Spectral KMeans to Mahout Website
 ---

 Key: MAHOUT-1441
 URL: https://issues.apache.org/jira/browse/MAHOUT-1441
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0
Reporter: Suneel Marthi
Assignee: Shannon Quinn
 Fix For: 1.0


 Need to update the Website with Design, user guide and any relevant 
 documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: CMS still not working

2014-04-25 Thread Shannon Quinn

Broken for me on Chrome in OS X. Noticed the mathjax was also broken on 
other pages (e.g. Spark  Scala) on the same environment.


On 4/25/14, 12:06 PM, Andrew Musselman wrote:

Broken for me in Chrome on Ubuntu.


On Fri, Apr 25, 2014 at 9:02 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:


Hm... mathjax doesn't render for me correctly . Is it just me or this is
also now broken? https://mahout.apache.org/users/dim-reduction/ssvd.html


On Fri, Apr 25, 2014 at 1:40 AM, Sebastian Schelter s...@apache.org
wrote:


Fyi: filed a ticket with infra as our CMS is still not working...

https://issues.apache.org/jira/browse/INFRA-7628

Re: CMS still not working

2014-04-25 Thread Shannon Quinn


Hmm. Here's what I see on Naive Bayes:

https://dl.dropboxusercontent.com/u/1377610/nb.png

Here's what I see on SSVD (under *https*):

https://dl.dropboxusercontent.com/u/1377610/ssvd_https.png

And here's SSVD under *http*. Looks fine! (NB looks the same either way 
for me, though)


https://dl.dropboxusercontent.com/u/1377610/ssvd_http.png

Chrome on OS X.

On 4/25/14, 12:28 PM, Suneel Marthi wrote:

I remember something like that, obviously this issue is only with Naive
Bayes page. You could compare NAive Bayes with SSVD to see what's missing.



On Fri, Apr 25, 2014 at 12:24 PM, ap.dev ap@outlook.com wrote:


@dimitri you said something once about having to double escape Mathjax
formatted lines.  I didn't do this in the markdown editor I was using for
the Naive Bayes page.  Maybe that has something to do with it?


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Suneel Marthi smar...@apache.org
Date:04/25/2014  12:20 PM  (GMT-05:00)
To: mahout dev@mahout.apache.org
Subject: Re: CMS still not working

SSVD page renders fine and so do others except for Naive Bayes (on MacOS
with all browsers - Chrome, Safari, Firefox, Opera).

It couldn't be a mathjax issue, some weird tag or something on Naive Bayes
page??


On Fri, Apr 25, 2014 at 12:12 PM, Dmitriy Lyubimov dlie...@gmail.com

wrote:
it's strange. ubuntu is all i ever used and I swear it was working just
last week. i wonder if mathjax guys did something that broke it, perhaps

in

the light of recent heartbleed bugs. javascript seems to be in place.


On Fri, Apr 25, 2014 at 9:09 AM, ap.dev ap@outlook.com wrote:


Mathjax formatting looks good on Firefox from a windows machine for

scala

spark bindings page.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Andrew Musselman andrew.mussel...@gmail.com
Date:04/25/2014  12:06 PM  (GMT-05:00)
To: dev@mahout.apache.org
Subject: Re: CMS still not working

Broken for me in Chrome on Ubuntu.


On Fri, Apr 25, 2014 at 9:02 AM, Dmitriy Lyubimov dlie...@gmail.com
wrote:


Hm... mathjax doesn't render for me correctly . Is it just me or this

is

also now broken?

https://mahout.apache.org/users/dim-reduction/ssvd.html


On Fri, Apr 25, 2014 at 1:40 AM, Sebastian Schelter s...@apache.org
wrote:


Fyi: filed a ticket with infra as our CMS is still not working...

https://issues.apache.org/jira/browse/INFRA-7628

Re: Welcome Pat Ferrel as new committer on Mahout

2014-04-24 Thread Shannon Quinn

Congratulations Pat! Been enjoying your discussions so far. Looking 
forward to working with you.


On 4/24/14, 6:22 AM, Frank Scholten wrote:

Congratulations Pat! :-)

On Apr 24, 2014, at 12:19, Sebastian Schelter s...@apache.org wrote:


Hi,

this is to announce that the Project Management Committee (PMC) for Apache 
Mahout has asked Pat Ferrel to become committer and we are pleased to announce 
that he has accepted.

Being a committer enables easier contribution to the project since in addition 
to posting patches on JIRA it also gives write access to the code repository. 
That also means that now we have yet another person who can commit patches 
submitted by others to our repo *wink*

Pat, we look forward to working with you in the future. Welcome! It would be 
great if you could introduce yourself with a few words.

-s

[jira] [Commented] (MAHOUT-1506) Creation of affinity matrix for spectral clustering

2014-04-18 Thread Shannon Quinn (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13974008#comment-13974008
 ] 

Shannon Quinn commented on MAHOUT-1506:
---

That's fine. This still needs to get done but I'll open up another ticket 
specifying scala DSL instead.

 Creation of affinity matrix for spectral clustering
 ---

 Key: MAHOUT-1506
 URL: https://issues.apache.org/jira/browse/MAHOUT-1506
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 1.0


 I wanted to get this discussion going, since I think this is a critical 
 blocker for any kind of documentation update on spectral clustering (I can't 
 update the documentation until the algorithm is useful, and it won't be 
 useful until there's a built-in method for converting raw data to an affinity 
 matrix).
 Namely, I'm wondering what kind of raw data should this algorithm be 
 expecting (anything that k-means expects, basically?), and what are the data 
 structures associated with this? I've created a proof-of-concept for how 
 pairwise affinity generation could work.
 https://github.com/magsol/Hadoop-Affinity
 It's a two-step job, but if the data structures in the input data format 
 provides 1) the total number of data points, and 2) for each data point to 
 know its index in the overall set, then the first job can be scrapped 
 entirely and affinity generation will consist of 1 MR task.
 (discussions on Spark / h20 pending, of course)
 Mainly this is an engineering problem at this point. Let me know your 
 thoughts and I'll get this done (I'm out of town the next 10 days for my 
 wedding/honeymoon, will get to this on my return).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MAHOUT-1506) Creation of affinity matrix for spectral clustering

2014-04-03 Thread Shannon Quinn (JIRA)

Shannon Quinn created MAHOUT-1506:
-

 Summary: Creation of affinity matrix for spectral clustering
 Key: MAHOUT-1506
 URL: https://issues.apache.org/jira/browse/MAHOUT-1506
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn


I wanted to get this discussion going, since I think this is a critical blocker 
for any kind of documentation update on spectral clustering (I can't update the 
documentation until the algorithm is useful, and it won't be useful until 
there's a built-in method for converting raw data to an affinity matrix).

Namely, I'm wondering what kind of raw data should this algorithm be 
expecting (anything that k-means expects, basically?), and what are the data 
structures associated with this? I've created a proof-of-concept for how 
pairwise affinity generation could work.

https://github.com/magsol/Hadoop-Affinity

It's a two-step job, but if the data structures in the input data format 
provides 1) the total number of data points, and 2) for each data point to know 
its index in the overall set, then the first job can be scrapped entirely and 
affinity generation will consist of 1 MR task.

(discussions on Spark / h20 pending, of course)

Mainly this is an engineering problem at this point. Let me know your thoughts 
and I'll get this done (I'm out of town the next 10 days for my 
wedding/honeymoon, will get to this on my return).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (MAHOUT-1473) Cleanup website on Spectral Clustering

2014-03-22 Thread Shannon Quinn (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn reassigned MAHOUT-1473:
-

Assignee: Shannon Quinn

 Cleanup website on Spectral Clustering
 --

 Key: MAHOUT-1473
 URL: https://issues.apache.org/jira/browse/MAHOUT-1473
 Project: Mahout
  Issue Type: Improvement
  Components: Documentation
Reporter: Sebastian Schelter
Assignee: Shannon Quinn
 Fix For: 1.0


 The website on spectral clustering needs clean up. We need to go through the 
 text, remove dead links and check whether the information is still consistent 
 with the current code.
 https://mahout.apache.org/users/clustering/spectral-clustering.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website

2014-03-09 Thread Shannon Quinn (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13925272#comment-13925272
 ] 

Shannon Quinn commented on MAHOUT-1441:
---

The experiment section of that paper would be fairly straightforward to 
reproduce, and I do agree that we should do that. However, the advantage with 
the reuters dataset is that most of the other algorithms use this as well as an 
example of how the algorithm works in the first place, e.g. comparing one to 
another with the same dataset. My impression is that whether or not the 
algorithm is well-suited to the reuters dataset, though certainly important, is 
secondary to being able to compare multiple Mahout algorithms with the same 
dataset. The hard part with spectral clustering is designing the initial 
affinity matrix from the reuters data.

 Add documentation for Spectral KMeans to Mahout Website
 ---

 Key: MAHOUT-1441
 URL: https://issues.apache.org/jira/browse/MAHOUT-1441
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0
Reporter: Suneel Marthi
Assignee: Shannon Quinn
 Fix For: 1.0


 Need to update the Website with Design, user guide and any relevant 
 documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Mahout 0.9 Release

2014-01-29 Thread Shannon Quinn


LGTM

On 1/29/14, 4:27 PM, peng wrote:

+1, can't see a bad side.

On Wed 29 Jan 2014 11:33:02 AM EST, Suneel Marthi wrote:

+1 from me





On Wednesday, January 29, 2014 8:58 AM, Sebastian Schelter 
s...@apache.org wrote:


+1


On 01/29/2014 05:25 AM, Andrew Musselman wrote:

Looks good.

+1


On Tue, Jan 28, 2014 at 8:07 PM, Andrew Palumbo ap@outlook.com 
wrote:



a), b), c), d) all passed here.

CosineDistance of clustered points from cluster-reuters.sh -1 
kmeans were

within the range [0,1].


Date: Tue, 28 Jan 2014 16:45:42 -0800
From: suneel_mar...@yahoo.com
Subject: Mahout 0.9 Release
To: u...@mahout.apache.org; dev@mahout.apache.org

Fixed the issues that were reported with Clustering code this past 
week,

upgraded codebase to Lucene 4.6.1 that was released today.


Here's the URL for the 0.9 release in staging:-

https://repository.apache.org/content/repositories/orgapachemahout-1004/org/apache/mahout/mahout-distribution/0.9/ 



The artifacts have been signed with the following key:
https://people.apache.org/keys/committer/smarthi.asc

Please:-
a) Verify that u can unpack the release (tar or zip)
b) Verify u r able to compile the distro
c)  Run through the unit tests: mvn clean test
d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please 
run

through all the different options in each script.


Need a minimum of 3 '+1' votes from PMC for the release to be 
finalized.

Re: cluster-reuters.sh broken in trunk

2014-01-24 Thread Shannon Quinn

Does Mahout still support Hadoop 0.20.2x? I know we had some discussions on 
this but I can't find them at the moment. 

iPhone'd

 On Jan 24, 2014, at 16:43, Suneel Marthi suneel_mar...@yahoo.com wrote:
 
 I assume u r running this in MR mode??  Could u clear up your 
 /tmp/mahout-work- folder and try again.
 
 
 
 
 On Friday, January 24, 2014 1:56 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Actually, getting the same error with a fresh svn checkout:
 
 14/01/24 09:42:13 INFO driver.MahoutDriver: Program took 291353 ms
 (Minutes: 4.8558834)
 Running on hadoop, using /home/akm/hadoop-0.20.205.0/bin/hadoop and
 HADOOP_CONF_DIR=
 MAHOUT-JOB:
 /home/akm/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
 14/01/24 09:42:16 INFO common.AbstractJob: Command line arguments:
 {--clustering=null,
 --clusters=[/tmp/mahout-work-akm/reuters-kmeans-clusters],
 --convergenceDelta=[0.5],
 --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],
 --endPhase=[2147483647],
 --input=[/tmp/mahout-work-akm/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/],
 --maxIter=[10], --method=[mapreduce], --numClusters=[20],
 --output=[/tmp/mahout-work-akm/reuters-kmeans], --overwrite=null,
 --startPhase=[0], --tempDir=[temp]}
 14/01/24 09:42:17 INFO common.HadoopUtil: Deleting
 /tmp/mahout-work-akm/reuters-kmeans-clusters
 14/01/24 09:42:17 WARN util.NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/01/24 09:42:17 INFO compress.CodecPool: Got brand-new compressor
 14/01/24 09:42:17 INFO kmeans.RandomSeedGenerator: Wrote 20 Klusters to
 /tmp/mahout-work-akm/reuters-kmeans-clusters/part-randomSeed
 14/01/24 09:42:17 INFO kmeans.KMeansDriver: Input:
 /tmp/mahout-work-akm/reuters-out-seqdir-sparse-kmeans/tfidf-vectors
 Clusters In: /tmp/mahout-work-akm/reuters-kmeans-clusters/part-randomSeed
 Out: /tmp/mahout-work-akm/reuters-kmeans Distance:
 org.apache.mahout.common.distance.CosineDistanceMeasure
 14/01/24 09:42:17 INFO kmeans.KMeansDriver: convergence: 0.5 max
 Iterations: 10
 14/01/24 09:42:17 INFO compress.CodecPool: Got brand-new decompressor
 Exception in thread main java.lang.IllegalStateException: No input
 clusters found in
 /tmp/mahout-work-akm/reuters-kmeans-clusters/part-randomSeed. Check your -c
 argument.
 at
 org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:212)
 at
 org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
 at
 org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:103)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at
 org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:47)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at
 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 
 
 
 
 On Fri, Jan 24, 2014 at 10:07 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Yeah, disregard, my repo was out of whack.
 
 
 On Fri, Jan 24, 2014 at 10:00 AM, ap.dev ap@outlook.com wrote:
 
 I'm not getting any exceptions there.
 
  Original message 
 From: Andrew Musselman andrew.mussel...@gmail.com
 Date:01/24/2014  11:38 AM  (GMT-05:00)
 To: dev@mahout.apache.org
 Subject: cluster-reuters.sh broken in trunk
 
 Last night I had this issue when testing out cluster-reuters.sh with no
 flags; anyone seen this recently?
 
 14/01/23 22:03:54 INFO driver.MahoutDriver: Program took 286799 ms
 (Minutes: 4.7799833)
 Running on hadoop, using /home/akm/hadoop-0.20.205.0/bin/hadoop and
 HADOOP_CONF_DIR=
 MAHOUT-JOB:
 /home/akm/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
 14/01/23 22:03:57 INFO common.AbstractJob: Command line arguments:
 {--clustering=null,
 --clusters=[/tmp/mahout-work-akm/reuters-kmeans-clusters],
 --convergenceDelta=[0.5],
 
 --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],
 --endPhase=[2147483647],
 
 --input=[/tmp/mahout-work-akm/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/],
 --maxIter=[10], --method=[mapreduce], --numClusters=[20],
 --output=[/tmp/mahout-work-akm/reuters-kmeans], --overwrite=null,
 --startPhase=[0], --tempDir=[temp]}
 14/01/23

Re: MAHOUT 0.9 Release - New URL

2014-01-16 Thread Shannon Quinn

a), b), and c) all pass for me. Don't have the setup yet at work to go 
through d), will wait for others to verify.


On 1/16/14, 9:41 AM, Suneel Marthi wrote:

Third time's a Charm!!!


Here's the new URL for Mahout 0.9 Release:
https://repository.apache.org/content/repositories/orgapachemahout-1002/org/apache/mahout/mahout-distribution/0.9/

For those volunteering to test this, some of the things to be verified:

a) Verify that u can unpack the release (tar or zip)
b) Verify u r able to compile the distro
c)  Run through the unit tests: mvn clean test
d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run through 
all the different options in each script.
  


Committers
  and PMC members:
---

Need 'at least 3 +1 votes' for the Release to pass.


Thanks and Regards.

Re: MAHOUT 0.9 Release - New URL

2014-01-16 Thread Shannon Quinn


OS X 10.9.1, java version 1.6.0_65.

On 1/16/14, 10:41 AM, Sergey Svinarchuk wrote:

I tested mahout 0.9 on Ubuntu 12.04 64bit, java version 1.6.0_27

a) Verify that u can unpack the release (tar or zip) - passed
b) Verify u r able to compile the distro - passed
c)  Run through the unit tests: mvn clean test -passed
d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run
through all the different options in each script. - will update later


On Thu, Jan 16, 2014 at 5:35 PM, Sotiris Salloumis i...@eprice.gr wrote:


Hi Suneel,

Below first round of tests,

Environment: SMP Debian 3.2.51-1 x86_64
Machine: Intel(R) Core(TM) i7 CPU 950  @ 3.07GHz stepping 05 12GB
RAM
OpenJDK: javac 1.6.0_27

a) Verify that u can unpack the release (tar or zip)  [ Passed: tar -zxvf ]
b) Verify u r able to compile the distro  [ Passed: With OpenJDK, Latest
Maven on LatestDebian ]
c)  Run through the unit tests: mvn clean test [ Passed: 370 milliseconds]

d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run
through all the different options in each script. [Ongoing will update
later]

Regards
Sotiris

-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com]
Sent: Thursday, January 16, 2014 4:41 PM
To: u...@mahout.apache.org; mahout
Subject: MAHOUT 0.9 Release - New URL

Third time's a Charm!!!


Here's the new URL for Mahout 0.9 Release:

https://repository.apache.org/content/repositories/orgapachemahout-1002/org/
apache/mahout/mahout-distribution/0.9/

For those volunteering to test this, some of the things to be verified:

a) Verify that u can unpack the release (tar or zip)
b) Verify u r able to compile the distro
c)  Run through the unit tests: mvn clean test
d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run
through all the different options in each script.


Committers
  and PMC members:
---

Need 'at least 3 +1 votes' for the Release to pass.


Thanks and Regards.

Re: Mahout 0.9 release

2013-11-28 Thread Shannon Quinn

I'll aim to get the documentation on spectral clustering done by 0.9, and the 
code fixes and improvements in for 1.0.

iPhone'd

 On Nov 28, 2013, at 12:15, Suneel Marthi suneel_mar...@yahoo.com wrote:
 
 Yes, lets defer the arbitrary properties to next release.
 
 
 
 
 
 On Thursday, November 28, 2013 11:02 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Was going to open M-1030 this weekend; I think doing the quick fix can be 
 done in time and the more involved job of putting arbitrary properties on 
 vectors should be pushed to 1.0.
 
 Sound reasonable?
 
 
 
 On Thu, Nov 28, 2013 at 7:58 AM, Suneel Marthi suneel_mar...@yahoo.com 
 wrote:
 
 Forgot to add 
 
 
 M-1288 Solr Recommender - Pat Ferrell
 
 to my earlier email.
 
 
 
 
 On Thursday, November 28, 2013 10:38 AM, Suneel Marthi 
 suneel_mar...@yahoo.com wrote:
 
 Adding Mahout-1349 to the list of JIRAs .
 
 
 
 
 
 On Thursday, November 28, 2013 10:37 AM, Suneel Marthi 
 suneel_mar...@yahoo.com wrote:
 
 Update on Open JIRAs for 0.9:
 
 Mahout-1245, Mahout-1304, Mahout-1305, Mahout-1307, Mahout-1326 - all 
 related to Wiki updates, please see Isabel's updates.
 
 M-1286 - Peng and
  Sebastian, we had talked about this during the last hangout. Can this be 
 included in 0.9?
 
 M-1030- Andrew Musselman, its critical that we get this into 0.9, its been 
 deferred for last 2 Mahout releases.
 
 M-1319, M-1328, M-1347, M-1350 - Suneel
 
 
 M-1265 - Multi Layer Perceptron, Yexi please look at my comments on 
 Reviewboard.
 
 M-1273 - Kun Yung, Ted, defer this to next release ???
 
 
 
 M-1312, M-1256 - Stevo, could u take one of them
 
 
 On Thursday, November 28, 2013 5:01 AM, Isabel Drost-Fromm 
 isa...@apache.org wrote:
 
 On Wed, 27 Nov 2013 14:23:11 -0800
  (PST)
 Suneel Marthi suneel_mar...@yahoo.com wrote:
 Below are the Open issues for 0.9:-
 
 This looks like we should be targeting Dec. 9th as code freeze to me.
 What do you all think?
 
 
 Mahout-1245, Mahout-1304, Mahout-1305, Mahout-1307, Mahout-1326 - All
 related to Wiki updates, missing Wiki documentation and Wiki
 migration to new CMS.  Isabel's working on M-1245 (migrating to new
 CMS). Could some of the others be consolidated with that?
 
 I believe MAHOUT-1245 essentially is ready to be published - all I want
 before notifying INFRA to
 switch to the new cms based site is one other
 person to take at least a brief look.
 
 For MAHOUT-1304 - Sebastian, can you please check that the cms based
 site actually does fit on 1280px? We can close this issue then.
 
 MAHOUT-1305 - I think this should be turned into a task to actually
 delete most of the pages that have been migrated to the new CMS (almost
 all of them). Once 1245 is shipped, it would be great if a few more
 people could lend a hand in getting this done.
 
 MAHOUT-1307 - Can be closed once switched to CMS
 
 MAHOUT-1326 - This really relates to the old Confluence export plugin
 we once have been using to generate static pages out of our wiki that
 is no longer active. Unless anyone on the Mahout dev list
 knows how to
 fully
  delete all exported static pages we should file an issue with
 INFRA to ask for help getting those deleted. They definitely are
 confusing to users.
 
 
 
 M-1286 - Peng and ssc, we had talked about this during the last
 hangout. Can this be included in 0.9?
 
 M-1030 - Andrew Musselman? Any updates on this, its important that we
 fix this for 0.9
 
 M-1319, M-1328,
   M-1347, M-1364 - Suneel
 
 M-1273 - Kun Yung, remember talking about this in one of the earlier
 hangouts; can't recall what was decided?
 
 M-1312, M-1256 - Dan Filimon (or Stevo??)
 
 M-996  someone could pick
  this up (if its still relevant with present
 codebase i.e.)
 
 I think this can move to the next release - according to the
 contributor and Sebastian the patch is rather hacky and there for
 illustration purposes only. I'd rather see some more thought go into
 that instead of pushing to have this in 0.9.
 
 
 M-1265 Yexi had submitted a patch for this, it would be good if this
 could go in as part of 0.9 
 
 M-1288 Solr Recommender - Pat Ferrell
 
 M-1285: Any takers for this?
 
 Would be nice to have - in particular if someone on dev@ (not
 necessarily a committer) wants to get started with the code base.
 Otherwise I'd say fix for next release
  if time gets short.
 
 
 M-1356: Isabel's started on this, Stevo could u review this?
 
 We definitely can punt that for the next release or even thereafter. It
 would be great if someone who has some knowledge of Java security
 policies would take a look. The implication of not fixing this
 essentially is that in case someone commits test code that writes
 outside of target or to some globally shared directory we might end up
 having randomly failing tests due to the parallel setup again. But as
 these will occur shortly after the commit it should be easy enough to
 find the code change that caused the breakage.
 
 
 
 M-1329: Support for Hadoop 2
 
 Is that truly

Re: Mahout 0.9 release

2013-11-28 Thread Shannon Quinn

Possibly. I'll know more after Monday (got a few big deadlines then). 

iPhone'd

 On Nov 28, 2013, at 13:32, Suneel Marthi suneel_mar...@yahoo.com wrote:
 
 Shannon,
 
 Would it be possible to add Spectral clustering to 
 examples/bin/cluster-reuters.sh (for 0.9)?
 
 
 
 
 
 
 On Thursday, November 28, 2013 12:59 PM, Shannon Quinn squ...@gatech.edu 
 wrote:
 
 I'll aim to get the documentation on spectral clustering done by 0.9, and the 
 code fixes and improvements in for 1.0.
 
 iPhone'd
 
 
 On Nov 28, 2013, at 12:15, Suneel Marthi suneel_mar...@yahoo.com wrote:
 
 Yes, lets defer the arbitrary properties to next release.
 
 
 
 
 
 On Thursday, November 28, 2013 11:02 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Was going to open M-1030 this weekend; I think doing the quick fix can be 
 done in time and the more involved job of putting arbitrary properties on 
 vectors should be pushed to 1.0.
 
 Sound reasonable?
 
 
 
 On Thu, Nov 28, 2013 at 7:58 AM, Suneel Marthi suneel_mar...@yahoo.com 
 wrote:
 
 Forgot to add 
 
 
 M-1288 Solr Recommender - Pat Ferrell
 
 to my earlier email.
 
 
 
 
 On Thursday, November 28, 2013 10:38 AM, Suneel Marthi 
 suneel_mar...@yahoo.com wrote:
 
 Adding Mahout-1349 to the list of JIRAs .
 
 
 
 
 
 On Thursday, November 28, 2013 10:37 AM, Suneel Marthi 
 suneel_mar...@yahoo.com wrote:
 
 Update on Open JIRAs for 0.9:
 
 Mahout-1245, Mahout-1304, Mahout-1305, Mahout-1307, Mahout-1326 - all 
 related to Wiki updates, please see Isabel's updates.
 
 M-1286 - Peng and
   Sebastian, we had talked about this during the last hangout. Can this be 
 included in 0.9?
 
 M-1030- Andrew Musselman, its critical that we get this into 0.9, its been 
 deferred for last 2 Mahout releases.
 
 M-1319, M-1328, M-1347, M-1350 - Suneel
 
 
 M-1265 - Multi Layer Perceptron, Yexi please look at my comments on 
 Reviewboard.
 
 M-1273 - Kun Yung, Ted, defer this to next release ???
 
 
 
 M-1312, M-1256 - Stevo, could u take one of them
 
 
 On Thursday, November 28, 2013 5:01 AM, Isabel Drost-Fromm 
 isa...@apache.org wrote:
 
 On Wed, 27 Nov 2013 14:23:11 -0800
   (PST)
 Suneel Marthi suneel_mar...@yahoo.com wrote:
 Below are the Open issues for 0.9:-
 
 This looks like we should be targeting Dec. 9th as code freeze to me.
 What do you all think?
 
 
 Mahout-1245, Mahout-1304, Mahout-1305, Mahout-1307, Mahout-1326 - All
 related to Wiki updates, missing Wiki documentation and Wiki
 migration to new CMS.  Isabel's working on M-1245 (migrating to new
 CMS). Could some of the others be consolidated with that?
 
 I believe MAHOUT-1245 essentially is ready to be published - all I want
 before notifying INFRA to
 switch to the new cms based site is one other
 person to take at least a brief look.
 
 For MAHOUT-1304 - Sebastian, can you please check that the cms based
 site actually does fit on 1280px? We can close this issue then.
 
 MAHOUT-1305 - I think this should be turned into a task to actually
 delete most of the pages that have been migrated to the new CMS (almost
 all of them). Once 1245 is shipped, it would be great if a few more
 people could lend a hand in getting this done.
 
 MAHOUT-1307 - Can be closed once switched to CMS
 
 MAHOUT-1326 - This really relates to the old Confluence export plugin
 we once have been using to generate static pages out of our wiki that
 is no longer active. Unless anyone on the Mahout dev list
 knows how to
 fully
   delete all exported static pages we should file an issue with
 INFRA to ask for help getting those deleted. They definitely are
 confusing to users.
 
 
 
 M-1286 - Peng and ssc, we had talked about this during the last
 hangout. Can this be included in 0.9?
 
 M-1030 - Andrew Musselman? Any updates on this, its important that we
 fix this for 0.9
 
 M-1319, M-1328,
M-1347, M-1364 - Suneel
 
 M-1273 - Kun Yung, remember talking about this in one of the earlier
 hangouts; can't recall what was decided?
 
 M-1312, M-1256 - Dan Filimon (or Stevo??)
 
 M-996  someone could pick
   this up (if its still relevant with present
 codebase i.e.)
 
 I think this can move to the next release - according to the
 contributor and Sebastian the patch is rather hacky and there for
 illustration purposes only. I'd rather see some more thought go into
 that instead of pushing to have this in 0.9.
 
 
 M-1265 Yexi had submitted a patch for this, it would be good if this
 could go in as part of 0.9 
 
 M-1288 Solr Recommender - Pat Ferrell
 
 M-1285: Any takers for this?
 
 Would be nice to have - in particular if someone on dev@ (not
 necessarily a committer) wants to get started with the code base.
 Otherwise I'd say fix for next release
   if time gets short.
 
 
 M-1356: Isabel's started on this, Stevo could u review this?
 
 We definitely can punt that for the next release or even thereafter. It
 would be great if someone who has some knowledge of Java security
 policies would take a look. The implication of not fixing

Re: spectral clustering additions [was: Mahout 0.9 release]

2013-11-21 Thread Shannon Quinn


Excellent. My todo list, then:

1: post docs for the algorithm on the Apache CMS
2: create an example to demonstrate how to use it
3: code a job to process raw input into a similarity matrix (will create 
a JIRA for it)


I have a question for #3 that can be a separate thread; mainly, what are 
the primary input formats I should be concerned with processing?


On 11/21/13, 1:09 PM, Isabel Drost-Fromm wrote:

On Thu, 21 Nov 2013 09:42:28 -0800 (PST)
Suneel Marthi suneel_mar...@yahoo.com wrote:


We are missing wiki docs for both Streaming kmeans and Spectral clustering.

I can pull something together for streaming kmeans.

Speaking of which we need to add a wiki page for Ted's t-digest once we figure 
out how it plays into Mahout (maybe as a measure of Streaming kmeans 
clustering, Ted??).

Given that we are in the process of migrating substantial parts of our wiki to 
the main website soon to be hosted in Apache CMS it would be great if you could 
add your content there. See also MAHOUT-1245 and 
http://markmail.org/thread/5ixlclhlh3acgcoq for some details.

Isabel

Re: spectral clustering additions [was: Mahout 0.9 release]

2013-11-21 Thread Shannon Quinn


That also gives me at least one answer for #3 :)

On 11/21/13, 4:03 PM, Suneel Marthi wrote:

On #2, it would be good if could add Spectral KMeans to 
examples/bin/cluster-reuters.sh to process Reuters dataset.





On Thursday, November 21, 2013 3:50 PM, Shannon Quinn squ...@gatech.edu wrote:
  
Excellent. My todo list, then:


1: post docs for the algorithm on the Apache CMS
2: create an example to demonstrate how to use it
3: code a job to process raw input into a similarity matrix (will create
a JIRA for it)

I have a question for #3 that can be a separate thread; mainly, what are
the primary input formats I should be concerned with processing?


On 11/21/13, 1:09 PM, Isabel Drost-Fromm wrote:

On Thu, 21 Nov 2013 09:42:28 -0800 (PST)
Suneel Marthi suneel_mar...@yahoo.com wrote:


We are missing wiki docs for both Streaming kmeans and Spectral clustering.

I can pull something together for streaming kmeans.

Speaking of which we need to add a wiki page for Ted's t-digest once we figure 
out how it plays into Mahout (maybe as a measure of Streaming kmeans 
clustering, Ted??).

Given that we are in the process of migrating substantial parts of our wiki to 
the main website soon to be hosted in Apache CMS it would be great if you could 
add your content there. See also MAHOUT-1245 and 
http://markmail.org/thread/5ixlclhlh3acgcoq for some details.

Isabel

Re: spectral clustering additions [was: Mahout 0.9 release]

2013-11-20 Thread Shannon Quinn

Right; I won't propose its re-integration until I'm confident it works 
as advertised. I'm referring to the vanilla spectral clustering that's 
still in Mahout.


An example sounds good, will do.

On 11/20/13, 4:29 PM, Suneel Marthi wrote:

Shannon,

Eigencuts has been deprecated and removed from the present codebase. Do we need 
to revert that?

On Spectral clustering, please do add an example to 
examples/bin/cluster-reuters.sh.





On Wednesday, November 20, 2013 4:05 PM, Shannon Quinn squ...@gatech.edu 
wrote:
  
On that note, I wanted to ask: what does everyone feel needs to be done

to make the standard spectral clustering  robust enough to be considered
a core algorithm? For me the biggest item was to have a job that
computes the pairwise similarities required (I've recently started
this), and I'd love to know what sort of input formats it should support
for conversion to a similarity matrix. Is there anything else?

Eigencuts is another matter; I'm working on streamlining the data
structures to make that more efficient.


 Original Message 
Subject: Re: Mahout 0.9 release
Date: Wed, 20 Nov 2013 21:39:18 +0100
From: Isabel Drost-Fromm isa...@apache.org
Reply-To: dev@mahout.apache.org
To: dev@mahout.apache.org



On Wed, 20 Nov 2013 10:32:42 -0800 (PST)
Suneel Marthi suneel_mar...@yahoo.com wrote:


We are presently targeting 0.9 for Dec 9.

Speaking of which: Any helping hand (be it on fixing issues, reviewing patches, 
adding to the documentation) is highly welcome to make this happen! If you are 
unsure what tasks exactly the project urgently needs help with do not be afraid 
to ask on the mailing list.


Isabel

Re: Eigencuts version of spectral clustering

2013-09-04 Thread Shannon Quinn

Eigencuts was removed from 0.8. The fixed version was never released due to 
the bottleneck you described.

Off the books, it's still a work in progress, but I won't be petitioning the 
PMC to put it back in until it scales properly. 

iPhone'd

On Sep 4, 2013, at 16:10, Andrew Musselman andrew.mussel...@gmail.com wrote:

 Looks like this is finished as of May of this year, but is there still the
 bottleneck performance issue with it?  I.e., is it useful in production?
 
 Thanks
 Andrew

Re: You are invited to Apache Mahout meet-up

2013-08-22 Thread Shannon Quinn


I'm only sorry I'm not in the Bay area. Sounds great!

On 8/22/13 3:38 AM, Stevo Slavić wrote:

Retweeted meetup invite. Have fun!

Kind regards,
Stevo Slavic.


On Thu, Aug 22, 2013 at 8:34 AM, Ted Dunning ted.dunn...@gmail.com wrote:


Very cool.

Would love to see folks turn out for this.


On Wed, Aug 21, 2013 at 9:38 PM, Ellen Friedman
b.ellen.fried...@gmail.comwrote:


The Apache Mahout user group has been re-activated. If you are in the Bay
Area in California, join us on Aug 27 (Redwood City).

Sebastian Schelter will be the main speaker, talking about new directions
with Mahout recommendation. Grant Ingersoll, Ted Dunning and I be there

to

do a short introduction for the meet-up and update on the 0.8 release.

Here's the link to rsvp: http://bit.ly/16K32hg

Hope you can come, and please spread the word.

Ellen

Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-12 Thread Shannon Quinn


Meant to add: my vote is also for bi-weekly

On 6/12/13 7:26 AM, Grant Ingersoll wrote:

Hi,

One of the things we kicked around at Buzzwords was having a 
weekly/bi-weekly/monthly dev session via Google hangout (Drill does this with 
good success, I believe).  Since we are so spread out, I thought I would throw 
out a Doodle (scheduling tool for those unfamiliar) to see what times work best 
for the majority of people interested in such a thing.  Anyone is free to 
participate, but this is not a Q and A session, but is instead focused on 
writing code, fixing bugs, triaging JIRA, releasing, etc.

If you are interested, please fill out http://doodle.com/gatxxkm7f25fq5y8  
(note, all times are Eastern Time Zone since I did the poll!)  I just grabbed a 
sampling of hours throughout the day.  I also picked 1 week as being 
representative of this being on a repeating schedule.  If none of the times 
work for you, but you are still interested, please respond here.  I would 
imagine we would meet for 1-2 hours.

Also, please reply with the frequency at which you would like to meet:

[]  Weekly
[]  Bi-weekly (every 2 weeks)
[]  Monthly

My vote is every two weeks.

-Grant

Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-12 Thread Shannon Quinn


Angel and Suneel, you may want to re-fill out the new doodle.

FYI, this week won't be representative of my schedule; I'm in the last 
few weeks of a job at ORNL where I travel every weekend. Normally I'll 
have more flexibility than just 6pm on weeknights.


On 6/12/13 8:26 AM, Grant Ingersoll wrote:

On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote:


+1, awesome idea

One question: the poll, while set to GMT -5, does say it's in Central Time. Is 
this a daylight savings thing?

I turned on Time Zone support, so not sure how it will look to others, but it 
sounds like it adjusts based on your location...  I see: 8 am, 10, 1, so on.

I also realize, that I messed it up.  I meant 9 pm, not 9 am.

Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv

Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-12 Thread Shannon Quinn

We have a good spread of people filling out both versions of the Doodle 
:) Here's the one Grant said is the correct one:


http://doodle.com/ymqaiwbh7khisnyv

On 6/12/13 1:44 PM, Andrew Musselman wrote:

Bi-weekly is good for me; I'm in Seattle and just filled out the poll.

Great idea!


On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.comwrote:


+1, am in Seattle as well and would love to attend and be involved.

Sent from my iPhone

On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com
wrote:


Good idea on recurring meetings. Im very interested in participating.
Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8.

An agenda for the meetings ahead of time will help us get the most of our
time at the meetings.

Thanks.
On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org wrote:


On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu wrote:


Angel and Suneel, you may want to re-fill out the new doodle.

FYI, this week won't be representative of my schedule; I'm in the last

few weeks of a job at ORNL where I travel every weekend. Normally I'll

have

more flexibility than just 6pm on weeknights.

Yeah, Doodle makes you pick dates, but I just want it to be

representative

a week long period of time and not tied to a specific set of dates.  So,
just put in what your ideal times are in general and ignore the fact

that

it is set to next week.


On 6/12/13 8:26 AM, Grant Ingersoll wrote:

On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote:


+1, awesome idea

One question: the poll, while set to GMT -5, does say it's in Central

Time. Is this a daylight savings thing?

I turned on Time Zone support, so not sure how it will look to others,

but it sounds like it adjusts based on your location...  I see: 8 am,

10,

1, so on.

I also realize, that I messed it up.  I meant 9 pm, not 9 am.

Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv


Grant Ingersoll | @gsingers
http://www.lucidworks.com

Re: 0.8 progress

2013-06-09 Thread Shannon Quinn


M-1250 lgtm.

On 6/9/13 4:58 PM, Grant Ingersoll wrote:

7 issues remaining:

M-833 -- Suneel
M-975 -- Ted
M-1030 -- Suneel
M-1067 -- Dmitriy  --  This is an enhancement, should we push?
M-1147 -- Jake
M-1233 -- Yannis (Grant?)
M-1250 -- Sebastian (but all of us should chime in)

In theory, 833 and 1067 can be pushed, but I think all others are blockers.

-Grant


On Jun 9, 2013, at 8:51 AM, Grant Ingersoll gsing...@apache.org wrote:


I'm on M-1211 and 1247 (M-992 is related)  Will be on IRC for a few hours this 
morning.

-Grant

On Jun 9, 2013, at 1:48 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:


Working on M-833.

From: Suneel Marthi suneel_mar...@yahoo.com
To: dev@mahout.apache.org dev@mahout.apache.org
Sent: Saturday, June 8, 2013 6:09 PM
Subject: Re: 0.8 progress

I will be looking at M-833 and M-1030 tonight.

I can get the initial limited functionality for M-884 as part of 0.8 release by 
tomorrow. Thanks to Robin for reviewing.







From: Grant Ingersoll gsing...@apache.org
To: dev@mahout.apache.org
Sent: Saturday, June 8, 2013 5:09 PM
Subject: Re: 0.8 progress


I've got 1103 and 1126 close to done.  Should be in by tomorrow.

On Jun 8, 2013, at 4:18 PM, Robin Anil robin.a...@gmail.com wrote:


Down to 15.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sat, Jun 8, 2013 at 12:30 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:


I am done with M-1026.





From: Grant Ingersoll gsing...@apache.org
To: dev@mahout.apache.org
Sent: Saturday, June 8, 2013 10:42 AM
Subject: Re: 0.8 progress


Hmm, JIRA seems to be down...

1084 is in.  I'm pretty close to being done on 1103.

I'm on #mahout on Freenode if anyone wants to coordinate, and will be
there for the next 1 hour or so.

On Jun 8, 2013, at 7:21 AM, Grant Ingersoll gsing...@apache.org wrote:


We are down to 18 issues!  Let's keep cranking.

I'm working on 1103 and 1084 at the moment.

On Jun 6, 2013, at 12:00 PM, Grant Ingersoll gsing...@apache.org

wrote:

On Jun 6, 2013, at 12:12 PM, Sebastian Schelter 

ssc.o...@googlemail.com wrote:

Hi Grant,

Here's my take:

Will/Must be finished:
M-944[include]

^ Committed.


M-958 [include]
M-975[include]
M-1084 [include]
M-1098  [include]
M-1103 [include]
M-1126[push if no one steps up]
M-1147  [include]
M-1211  [push if no one steps up]
M-1233  [push if no one steps up]
M-1241  [include]

Can be pushed if no one steps up:
M-627 [push if no one steps up]
M-833 [push if no one steps up]
M-1163 [push if no one steps up]
M-1164[push if no one steps up]
M-1243[include]
M-992 [include]

^ Working on this now.


M-996 [push if no one steps up]
M-1067[include]

Unsure:
M-974 [push if no one steps up]
M-1026 [push if no one steps up]
M-1030 [unsure]


On 06.06.2013 11:26, Grant Ingersoll wrote:

Working from the link below, we are down to 22 issues.



https://issues.apache.org/jira/issues/?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.8%22%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

Here's my opinion (and only my opinion, please vote, change as you

see fit) based on a cursory glance of the state of these as to what needs
to be in the release and what can be pushed:

Will/Must be finished:
M-944
M-958
M-975
M-1084
M-1098
M-1103
M-1126
M-1147
M-1211
M-1233
M-1241

Can be pushed if no one steps up:
M-627
M-833
M-1163
M-1164
M-1243
M-992
M-996
M-1067

Unsure:
M-974
M-1026
M-1030



Grant Ingersoll | @gsingers
http://www.lucidworks.com








Grant Ingersoll | @gsingers
http://www.lucidworks.com







Grant Ingersoll | @gsingers
http://www.lucidworks.com







Grant Ingersoll | @gsingers
http://www.lucidworks.com



Grant Ingersoll | @gsingers
http://www.lucidworks.com



Grant Ingersoll | @gsingers
http://www.lucidworks.com







Grant Ingersoll | @gsingers
http://www.lucidworks.com

Re: [DRAFT] 0.8 Release Announcement + Future Plans Discussion

2013-06-08 Thread Shannon Quinn




Clustering

- Fuzzy k-Means o.a.m.clustering.fuzzykmeans
- Spectral k-Means in o.a.m.clustering.spectral

-1 on spectral being dropped as that seems to receive decent traction.
Agreed, given recent activity in particular. However I would put forth 
deprecating Eigencuts (o.a.m.clustering.eigencuts) until such time that 
it can be made scalable.

Re: [DRAFT] 0.8 Release Announcement + Future Plans Discussion

2013-06-08 Thread Shannon Quinn

Sorry, that's o.a.m.clustering.spectral.eigencuts. Then move the .kmeans 
package to simply be o.a.m.clustering.spectral .


On 6/8/13 1:37 PM, Shannon Quinn wrote:



Clustering

- Fuzzy k-Means o.a.m.clustering.fuzzykmeans
- Spectral k-Means in o.a.m.clustering.spectral

-1 on spectral being dropped as that seems to receive decent traction.
Agreed, given recent activity in particular. However I would put forth 
deprecating Eigencuts (o.a.m.clustering.eigencuts) until such time 
that it can be made scalable.

[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-05 Thread Shannon Quinn (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675836#comment-13675836
]

Shannon Quinn commented on MAHOUT-1214:
---

@Yiqun: I would suggest making this as general as possible. Don't confine it to
just spectral k-means. Submit a patch and we can look it over.

@Grant: Unless the patch came in today, I don't think we could have it ready
for inclusion in 0.8.

Improve the accuracy of the Spectral KMeans Method
--

Key: MAHOUT-1214
URL: https://issues.apache.org/jira/browse/MAHOUT-1214
Project: Mahout
Issue Type: Improvement
Components: Clustering
Affects Versions: 0.7
Environment: Mahout 0.7
Reporter: Yiqun Hu
Labels: clustering, improvement
Fix For: Backlog

The current implementation of the spectral KMeans algorithm (Andrew Ng. etc.
NIPS 2002) in version 0.7 has two serious issues. These two incorrect
implementations make it fail even for a very obvious trivial dataset. We have
implemented a solution to resolve these two issues and hope to contribute
back to the community.
# Issue 1:
The EigenVerificationJob in version 0.7 does not check the orthogonality of
eigenvectors, which is necessary to obtain the correct clustering results for
the case of K1; We have an idea and implementation to select based on
cosAngle/orthogonality;
# Issue 2:
The random seed initialization of KMeans algorithm is not optimal and
sometimes a bad initialization will generate wrong clustering result. In this
case, the selected K eigenvector actually provides a better way to initalize
cluster centroids because each selected eigenvector is a relaxed indicator of
the memberships of one cluster. For every selected eigenvector, we use the
data point whose eigen component achieves the maximum absolute value.
We have already verified our improvement on synthetic dataset and it shows
that the improved version get the optimal clustering result while the current
0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-05 Thread Shannon Quinn (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675902#comment-13675902
]

Shannon Quinn commented on MAHOUT-1214:
---

Developing a better input format for spectral kmeans has been on my to-do list
ever since writing the algorithm. Unfortunately, to handle any sort of raw data
format, it requires n^2 pairwise comparisons which is not trivial in a Hadoop
setting. [1] describes various methods of achieving an efficient MapReduce
implementation for computing the affinity matrix. This is ultimately the route
we should go, ideally creating it as a separate job with tunable parameters
that spectral kmeans will invoke.

In the meantime, we can probably put a check in the job that reads the affinity
matrix to find zeros and ignore them.

[1] http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=arnumber=5444877

Improve the accuracy of the Spectral KMeans Method
--

[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-05-23 Thread Shannon Quinn (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13665190#comment-13665190
 ] 

Shannon Quinn commented on MAHOUT-1214:
---

This all looks great. With the work I did on Eigencuts this semester, there are 
some optimizations in the data structures I'd like to test that might further 
help spectral kmeans' performance, in addition to looking into ball kmeans, 
streaming kmeans, and SSVD.

I still have a question to Yiqun: if you've implemented an orthogonality check 
in EigenVerificationJob, how is this not something that can be applied to 
EigenVerificationJob in general, as opposed to only spectral kmeans?

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
  Labels: clustering, improvement

 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-05-23 Thread Shannon Quinn (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13665203#comment-13665203
]

Shannon Quinn commented on MAHOUT-1177:
---

Yu Lee and Yexi: For the time being, I'd be on board with shelving the addition
of any new clustering algorithms, and instead focusing on improving
documentation and unifying the APIs for the existing ones. I think that would
help scope your work a little more effectively, while still providing an
extremely valuable body of work. Plus, it would greatly aid the development of
new algorithms to have a specific interface to build into. Beyond that, I think
your ideas are good and would encourage you to start laying out your specific
plans.

Ravi: I would suggest browsing the open JIRAs for Mahout and to submit a patch
for one you think you can tackle. Please feel free to ping our email list if
you have specific questions, though for general ones please submit them to the
list rather than on JIRA.

GSOC 2013: Reform and simplify the clustering APIs
--

Key: MAHOUT-1177
URL: https://issues.apache.org/jira/browse/MAHOUT-1177
Project: Mahout
Issue Type: Improvement
Reporter: Dan Filimon
Labels: gsoc2013, mentor

Clustering is one of the most used features in Mahout and has many
applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
We have of lots clustering algorithms. There's:
- basic k-means
- canopy clustering
- Dirichlet clustering
- Fuzzy k-means
- Spectral k-means
- Streaming k-means [coming soon]
We want to make them easier to use by updating the APIs and make sure they
all work in the same way have consistent inputs, outputs, diagnostics and
documentation.
This is a great way to gain an in-depth understanding of clustering
algorithms, familiarize yourself with Hadoop, Mahout clustering and good
software engineering principles.

[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-05-19 Thread Shannon Quinn (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13661587#comment-13661587
]

Shannon Quinn commented on MAHOUT-1214:
---

Ted,

I'm not sure I follow. You mean use SSVD exclusively in place of Lanczos?

I'd love to assess performance and accuracy with ball or streaming k-means
instead. That's an excellent idea.

Improve the accuracy of the Spectral KMeans Method
--

[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-05-16 Thread Shannon Quinn (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13659680#comment-13659680
]

Shannon Quinn commented on MAHOUT-1214:
---

1: Examining the orthogonality of eigenvectors has to do with
EigenVerificationJob, a part of the distributed Lanczos pipeline. It's used in
spectral KMeans, but also elsewhere in Mahout (essentially any time the
distributed Lanczos solver is used). Unless you're referring to a check that's
specific to the spectral KMeans domain?

2: This is an excellent point of improvement. I look forward to seeing the
patch.

Improve the accuracy of the Spectral KMeans Method
--

Re: [jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-05-02 Thread Shannon Quinn

This sounds excellent. I'd be happy to assist in unifying the interfaces 
of the spectral methods in particular.


On 5/2/13 3:54 PM, Yu Lee (JIRA) wrote:

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647841#comment-13647841
 ]

Yu Lee commented on MAHOUT-1177:


Hello Robin Anil, Jeff Eastman, Dan Filimon, and Ted Dunning,

Yexi and I (Yu Lee) are new to this Mahout community. We want to contribute to 
the improvement of Mahout by reforming and simplifying the clustering APIs per 
the following link:
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644120#comment-13644120

We have gone through the code of Mahout clustering. Now we have some ideas 
about improving it:

=
Addressing the problems in the current interface:

Testing cases are missing. For example, in spectral kmeans clustering, the run 
methods of SpectralKmeansDriver and EigencutsDriver are not tested

Documentations are missing for some methods. For example: in the run method of 
DirichletDriver, the description of parameter 'numModels' is missing; in the 
run method of SpectralKmeansDriver, the description of some arguments are 
missing

Some testing methods do not contain the specific description of some arguments. For example: in the 
run method of FuzzyKmeansDriver, the description of an argument of m (fuzzification 
factor) is missing. Although a wiki link regarding Clustering Analysis is given, it is 
not clear enough.

-

Implementing some new clustering algorithms

Agglomerative hierarchical clustering, which will cluster the data points into 
a dendragram, so that user could indicate whatever number of clusters as they 
want. (http://en.wikipedia.org/wiki/Hierarchical_clustering)

Dbscan, which is a density based clustering method being able to identify 
clusters with arbitrary shapes, and is useful in spatial clustering. 
(http://en.wikipedia.org/wiki/DBSCAN)

-

Providing a new unified interface

Currently, each clustering algorithm has its own implemented class with 
different interfaces (i.e., run methods in different Drivers have different 
argument list). However, it is better to have a unified interface to execute 
all available clustering methods, and an example interface is as follows:

Clustering-run(input, output, methodClass,clusteringConfig)

Here, the methodClass indicates a specific clustering method, while 
clusteringConfig indicates the configuration for this specific clustering method.

=

Could you please let us know what you think about our ideas?


 

GSOC 2013: Reform and simplify the clustering APIs
--

 Key: MAHOUT-1177
 URL: https://issues.apache.org/jira/browse/MAHOUT-1177
 Project: Mahout
  Issue Type: Improvement
Reporter: Dan Filimon
  Labels: gsoc2013, mentor

Clustering is one of the most used features in Mahout and has many applications 
[http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
We have of lots clustering algorithms. There's:
- basic k-means
- canopy clustering
- Dirichlet clustering
- Fuzzy k-means
- Spectral k-means
- Streaming k-means [coming soon]
We want to make them easier to use by updating the APIs and make sure they all 
work in the same way have consistent inputs, outputs, diagnostics and 
documentation.
This is a great way to gain an in-depth understanding of clustering algorithms, 
familiarize yourself with Hadoop, Mahout clustering and good software 
engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Gsoc 2013 question

2013-04-09 Thread Shannon Quinn


Hi there.

If you don't have a fully-formed project idea or are otherwise looking 
for suggestions, feel free to post your question here.


Shannon

On 4/9/13 1:38 PM, George Zografos wrote:

Hello mahout dev community.
I have a question regarding a project idea for GSOC 2013.
Should I post it here or to JIRA as a comment?

Re: Welcome Suneel Marthi and Dan Filimon

2013-04-04 Thread Shannon Quinn


Congratulations! :)

On 4/4/13 6:30 AM, Grant Ingersoll wrote:

In recognition of the contributions of Suneel Marthi and Dan Filimon to the 
Mahout project, the PMC is pleased to announce both have accepted our 
invitations to join the Mahout project as committers.

As is customary, I will leave it to Suneel and Dan to provide a little bit of 
background on who they are.

Congratulations!

-Grant


Grant Ingersoll | @gsingers
http://www.lucidworks.com

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-04-04 Thread Shannon Quinn

According to the GSoC calendar, accepted organizations aren't posted 
until April 8 (Monday), at which point (assuming Apache is accepted...I 
can't imagine it wouldn't be) slots will be doled out internally. This 
will probably take at least a day or two, so probably by middle of next 
week we'll know how many slots Mahout has.


Speaking of which: how do the various subprojects negotiate for slots? 
Is there a central spreadsheet, or an IRC meeting to attend? Or did I 
miss the email detailing this?


On 4/4/13 2:43 PM, Dan Filimon wrote:

Any news on this front? Did we get approved/assigned a slot/anything?


On Fri, Mar 29, 2013 at 7:44 PM, Dan Filimon dangeorge.fili...@gmail.comwrote:


Ok, updated!


On Fri, Mar 29, 2013 at 7:36 PM, Andy Twigg andy.tw...@gmail.com wrote:


Dan,

I think what you've written is fine (I wanted to edit to remove the
'?' around random forests but couldn't).

ok?



On 29 March 2013 11:14, Dan Filimon dangeorge.fili...@gmail.com wrote:

I added Andy's first suggestion and Ted's suggestion as ideas.

Andy, could you flesh out your second suggestion into a project and

make an

issue please?


On Fri, Mar 29, 2013 at 3:53 AM, Ted Dunning ted.dunn...@gmail.com

wrote:

It should be possible to view a Lucene index as a matrix.  This would
require that we standardize on a way to convert documents to rows.

  There

are many choices, the discussion of which should be deferred to the

actual

work on the project, but there are a few obvious constraints:

a) it should be possible to get the same result as dumping the term

vectors

for each document each to a line and converting that result using

standard

Mahout methods.

b) numeric fields ought to work somehow.

c) if there are multiple text fields that ought to work sensibly as

well.

  Two options include dumping multiple matrices or to convert the fields
into a single row of a single matrix.

d) it should be possible to refer back from a row of the matrix to

find the

correct document.  THis might be because we remember the Lucene doc

number

or because a field is named as holding a unique id.

e) named vectors and matrices should be used if plausible.

On Thu, Mar 28, 2013 at 4:58 PM, Dan Filimon 

dangeorge.fili...@gmail.com

wrote:
...
Ted, could you explain a bit more what you mean by simplify the

connection

to Lucene for clustering and classification? It's too vague for an

idea

proposal.




--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.tw...@cs.ox.ac.uk | +447799647538

Re: Call to action – Mahout needs your help

2013-03-26 Thread Shannon Quinn

I would love to help in any way I can. I'm fairly busy with my PhD 
studies until early May when I shift to an internship for the summer, so 
if I could have some help setting up tickets on JIRA for things we'd 
like to see done, I could take over the legwork once the summer hits. 
I'd be happy to work with Dan and mentor at least one student.


Shannon

On 3/26/13 10:06 AM, Isabel Drost wrote:

On Tue, Mar 26, 2013 at 12:12 PM, Dan Filimon
dangeorge.fili...@gmail.comwrote:


If you guys decide to participate in GSOC this year, I'd be happy to
spread the word and maybe even have a presentation about Mahout at
school. Also, since I'm squarely on the student side (doing my senior
project with Ted on Mahout) I think I have a good grasp of what the
problems are, especially for a beginner student.

And, if you do pick someone, I could help them part-time (especially
if they're from my school, you know, timezone and language help)
. Of course, I wouldn't really want to be the main mentor since I'm
still really new and not a committer yet. :)


That sounds like an awesome proposal to me. What do others think?


Isabel

Re: Call to action – Mahout needs your help

2013-03-25 Thread Shannon Quinn






I think that you mentioned a very good point with stating that it is not
clear whether Mahout is a library, a standalone program to interact with
via the command line. IMO, its first and foremost a library (similar to
Lucene), and this should also be reflected in the codebase.

That is my view as well and I think we have been moderately successful at it.


+1


As for the complexity issue, I don't know that we ever solve it, we just need 
to identify contributors in those areas quickly, mentor them, and make them 
committers as soon as they are ready.


On that note: GSoC is coming up, and I think it's a great opportunity to 
build some momentum in this direction. I know that when students see 
scalable machine learning their first thought isn't improving testing 
and documentation, but if we pushed hard in those areas specifically, in 
addition to making a broad effort on JIRA to elucidate exactly what 
needs work, we could likely pick up several quality students that could 
make lasting contributions.





I think that Mahout is and should always be more than recommenders, but
that we should be more courageous in throwing out things that are not
used very much or not maintained very much or don't meet the quality
standards which we would like to see.


+1 . On my end of things, while I do think some sort of canonical 
spectral clustering algorithm would be very useful to have, e.g. 
spectral k-means, the Eigencuts algorithm is one example of something 
that is so specialized that it could probably be jettisoned.

[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: MAHOUT-1159-ssvdpoweriter.patch

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch, 
 MAHOUT-1159-ssvdpoweriter.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1159) Add SSVD option to SpectralKMeans

[
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600292#comment-13600292
]

Shannon Quinn commented on MAHOUT-1159:
---

Agreed on all points, not sure why I missed that. I've attached the patch and
will commit it unless you have any problems with it.

Add SSVD option to SpectralKMeans
-

Key: MAHOUT-1159
URL: https://issues.apache.org/jira/browse/MAHOUT-1159
Project: Mahout
Issue Type: Improvement
Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
Fix For: 0.8

Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch,
MAHOUT-1159-ssvdpoweriter.patch

This adds SSVD as an option for eigensolver, in addition to the [default]
Lanczos solver. Testing indicated it yielded similar clustering accuracy with
a possible performance boost.
This patch includes other small fixes, such as using the default tempDir
for intermediate calculations.
The initialization of the SSVD solver is a bit awkward, with specifying the
number of reducers. I hard-coded this at 10; is there a better solution?
Perhaps making it an optional parameter to the SSVD constructor?
[Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan
Solanki, and Philip Schinis for working on this.]

[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: MAHOUT-1159-ssvdpoweriter.patch

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch, 
 MAHOUT-1159-ssvdpoweriter.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: (was: MAHOUT-1159-ssvdpoweriter.patch)

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch, 
 MAHOUT-1159-ssvdpoweriter.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1159) Add SSVD option to SpectralKMeans


[ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600346#comment-13600346
 ] 

Shannon Quinn commented on MAHOUT-1159:
---

Committed. Thanks for your input.

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch, 
 MAHOUT-1159-ssvdpoweriter.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-794) Eigencuts produces unexpected results, part 2


 [ 
https://issues.apache.org/jira/browse/MAHOUT-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn resolved MAHOUT-794.
--

Resolution: Invalid

The exact problems with Eigencuts are numerous; they should each be their own 
tickets. Those will be forthcoming soon. For that reason, I am closing this one 
as it is too broad.

 Eigencuts produces unexpected results, part 2
 -

 Key: MAHOUT-794
 URL: https://issues.apache.org/jira/browse/MAHOUT-794
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.5
Reporter: Sean Owen
Assignee: Shannon Quinn
 Fix For: 0.8


 See MAHOUT-516, which was closed. Looks like Shannon believes there is a 
 follow-on issue. I'm just opening a new issue to track this for 0.6.
 This is an issue in the workflow of the Eigencuts algorithm; some part of it 
 is not implemented correctly. More details to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Spectral fixes

2013-03-11 Thread Shannon Quinn

I have a load of fixes in the pipeline for the spectral clustering 
algorithms. The work on Eigencuts is extensive and still ongoing, so 
while I will post those tickets, the fixes will likely not make it for 0.8.


SpectralKmeans, however, has numerous fixes that are ready to go. Before 
I post and commit them, I would like some input on the following items:


1: We added the option to use SSVD in place of the Lanczos solver. Would 
it be acceptable to have a command-line flag to specify the solver to use?
2: Lots of temporary files are generated by the numerous MR jobs chained 
together. Is there a rule of thumb for whether or not to delete these 
intermediate files after running the whole job? Right now I have a 
command-line flag to indicate whether they should be removed or not.


Thanks!

Shannon

[jira] [Created] (MAHOUT-1159) Add SSVD option to SpectralKMeans

Shannon Quinn created MAHOUT-1159:
-

 Summary: Add SSVD option to SpectralKMeans
 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8


This adds SSVD as an option for eigensolver, in addition to the [default] 
Lanczos solver. Testing indicated it yielded similar clustering accuracy with a 
possible performance boost.

This patch includes other small fixes, such as using the default tempDir for 
intermediate calculations.

The initialization of the SSVD solver is a bit awkward, with specifying the 
number of reducers. I hard-coded this at 10; is there a better solution? 
Perhaps making it an optional parameter to the SSVD constructor?

[Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: MAHOUT-1159.patch

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: (was: MAHOUT-1159.patch)

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: MAHOUT-1159.patch

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1159) Add SSVD option to SpectralKMeans


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn resolved MAHOUT-1159.
---

Resolution: Fixed

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: MAHOUT-1159-ssvdopts.patch

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1159) Add SSVD option to SpectralKMeans

[
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599237#comment-13599237
]

Shannon Quinn commented on MAHOUT-1159:
---

Excellent points, thanks. Here's a new patch with the suggested fixes, let me
know if that works.

The only reason I noticed the discrepancy in the documentation is SSVD in
spectral k-means was originally tested in standalone mode, and obviously the
DistributedCache is not available in that case.

Add SSVD option to SpectralKMeans
-

Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch

[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: (was: MAHOUT-1159-ssvdopts.patch)

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: MAHOUT-1159-ssvdopts.patch

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MAHOUT-1159) Add SSVD option to SpectralKMeans