Re: Community/planning call. 11/15

2019-11-15 Thread Shannon Quinn
Switching ISPs today so can't schedule any meetings :/ 

On 11/15/19, 11:34 AM, "Trevor Grant"  wrote:

I can't :(

On Fri, Nov 15, 2019 at 9:04 AM Andrew Palumbo  wrote:

> Hey all,
>
>
> Checking in to see if we were doing a call today?
>
> Is the usual time 10:00pst?
>
> Andy
>



Re: [VOTE] Retire gene...@mahout.apache.org

2019-01-24 Thread Shannon Quinn
+1

Had no idea that list existed.

On 1/24/19, 9:54 AM, "Andrew Musselman"  wrote:

+1

On Thu, Jan 24, 2019 at 06:53 Trevor Grant  wrote:

> A recent JIRA ticket [1] asked we add moderators to
> gene...@mahout.apache.org.
>
> Upon further investigation, the list was set up in 2010, has 9 
subscribers,
> has had about 20 emails, and all of them (except an initial 'test' email)
> were general emails to all lists.
>
> I am opening a vote to retire the list gene...@mahout.apache.org (as
> opposed to adding moderators).  The list's content will be saved, future
> emails will bounce back.
>
> The vote will remain open for 72 hours.
>
> I am officially +1 for retiring.
>
> tg
>
> [1] https://issues.apache.org/jira/browse/MAHOUT-2056
>



Re: [NOTICE] Mandatory migration of git repositories to gitbox.apache.org

2019-01-03 Thread Shannon Quinn
+1

On 1/3/19, 2:31 PM, "Andrew Palumbo"  wrote:

I'd like to call a vote on moving to gitbox.  Here's my +1



Re: Friday hangout

2018-09-05 Thread Shannon Quinn

SGTM, moved it back a week on my calendar.

On 9/5/18 5:16 PM, Andrew Musselman wrote:

Looks like Andy and Trevor are both out for Friday; I think it makes sense
to postpone a week.

On Tue, Sep 4, 2018 at 8:30 AM Andrew Musselman 
wrote:


Chalk it up to trying something new :)

Yes 9-10 am Pacific Friday; the Hangouts link should work, if it doesn't
we'll use Zoom or something, will figure out in the first few minutes.

On Mon, Sep 3, 2018 at 3:04 PM Dmitriy Lyubimov  wrote:


so does mine. 9-10 am PST?

On Mon, Sep 3, 2018 at 12:10 PM Ivan Serdyuk <
local.tourist.k...@gmail.com>
wrote:


Google calendar reports "Could not find the requested".



On Mon, Sep 3, 2018 at 8:46 PM Andrew Palumbo 

wrote:

Probably my calendar messed it up.
Thx
--andy

On Sep 3, 2018 10:32 AM, Andrew Musselman 

wrote:

FYI, @andrew.. the calendar invite reads 9 am Sept 3 for me

(android).






Re: Welcome our GSoC Student Aditya Sarma

2017-05-05 Thread Shannon Quinn

Likewise :) Welcome Aditya!

On 5/5/17 7:55 AM, Jim Jagielski wrote:

Wow! I will be lurking over your shoulder trying to learn as much
as I can!


On May 4, 2017, at 4:45 PM, Aditya  wrote:

Hi everyone,

It feels really nice to know that I've been selected to work with Mahout as
part of GSoC 2017. I'm a senior undergraduate student from BITS Pilani,
India  with interests in Data Mining,
Machine Learning (more towards the algorithmic aspects) and I think it's a
really good time to contribute to Mahout. From the time I've subscribed to
the mailing list, I've seen quite a few big developments happening, be it
the logo design, reworking the website or work on the algorithms framework.

I'm looking forward to working with Trevor on the clustering submodule. And
it's good to see that Trevor recently issued a JIRA ticket for adding the
Canopy clustering algorithm to Mahout.

I hope I will be a part of many more algorithms that will be a part of
Mahout in the future.

Thank you all.

PS: Special thanks to you Trevor, for guiding me through the process and
answering all my doubts/queries.

Regards,
Aditya




On Fri, May 5, 2017 at 1:10 AM, KHATWANI PARTH BHARAT <
h2016...@pilani.bits-pilani.ac.in> wrote:


All The Best Aditya!




On Thu, May 4, 2017 at 11:46 PM, Andrew Palumbo 
wrote:


Welcome!!



Sent from my Verizon Wireless 4G LTE smartphone


 Original message 
From: Jim Jagielski 
Date: 05/04/2017 10:33 AM (GMT-08:00)
To: dev@mahout.apache.org
Cc: priv...@mahout.apache.org, u...@mahout.apache.org, Aditya <
adityasarma...@gmail.com>
Subject: Re: Welcome our GSoC Student Aditya Sarma

Welcome!!

On May 4, 2017, at 1:24 PM, Trevor Grant 

wrote:

Hello all,

I want to extend a warm welcome to Aditya Sarma, who has been accepted

to

the Mahout Project as Part of the Google Summer of Code program.

Aditya will be working on "DBSCAN Clustering In Mahout", if you go back

in

the archives you can see his full proposal.

We're really excited to have him, and looking forward to a great

summer.

Aditya, would you like to say a few words to introduce yourself?






Re: Native CUDA support

2017-03-28 Thread Shannon Quinn

Loved this proposal. Excited to see the POC.

On 3/27/17 10:16 PM, Andrew Palumbo wrote:

Thank you, Nikolai.  This is Great news!


From: Dmitriy Lyubimov 
Sent: Monday, March 27, 2017 7:55:20 PM
To: dev@mahout.apache.org
Subject: Re: Native CUDA support

thanks.

JCuda sounds good. :)

On Fri, Mar 10, 2017 at 9:06 AM, Nikolai Sakharnykh 
wrote:


Hello everyone,

We're actively working on adding native CUDA support to Apache Mahout.
Currently, GPU acceleration is enabled through ViennaCL (
http://viennacl.sourceforge.net/). ViennaCL is a linear algebra framework
that provides multiple backends including OpenMP, OpenCL and CUDA. However,
as we recently discovered the CUDA backend in ViennaCL is composed of
manually written CUDA kernels that are not well tuned for the latest GPU
architectures. Instead, we decided to explore a way to leverage CUDA
libraries for linear algebra: cuBLAS (dense matrices), cuSPARSE (sparse
matrices) and cuSOLVER (dense factorizations and sparse solvers). These
libraries are highly tuned by NVIDIA and provide the best performance for
many linear algebra primitives on the NVIDIA GPU architecture. Moreover,
the libraries are receiving frequent updates with new CUDA toolkit
releases: bug fixes, new functionality and optimizations.

We considered two approaches:

   1.  Direct calls to CUDA runtime and libraries through JavaCPP bridge
   2.  Use JCuda package (http://www.jcuda.org/)

JCuda is a thin Java layer on top of the CUDA runtime and already provides
Java wrappers for all available CUDA libraries so it makes sense to choose
this path. JCuda also provides a mechanism to call custom CUDA kernels by
compiling them into PTX with NVIDIA NVCC compiler and then loading through
CUDA driver API calls in Java code. Here is an example code that allocates
a pointer (cudaMalloc) and copies data to the GPU (cudaMemcpy) using JCuda:

// Allocate memory on the device
Pointer deviceData = new Pointer();
cudaMalloc(deviceData, memorySize);

// Copy the host data to the device
cudaMemcpy(deviceData, Pointer.to(hostData), memorySize,
   cudaMemcpyKind.cudaMemcpyHostToDevice);

Alternatively, a pointer can be allocated using cudaMallocManaged and then
it can be accessed on the CPU or on the GPU without explicit copies by
leveraging Unified Memory. This enables simpler data management model and
on the newer architectures enables features like on-demand paging and
transparent GPU memory oversubscription.

All CUDA libraries operate directly on the GPU pointers. Here is an
example of calling a single-precision GEMM with JCuda:

// Allocate memory on the device
Pointer d_A = new Pointer();
Pointer d_B = new Pointer();
Pointer d_C = new Pointer();
cudaMalloc(d_A, n * n * Sizeof.FLOAT);
cudaMalloc(d_B, n * n * Sizeof.FLOAT);
cudaMalloc(d_C, n * n * Sizeof.FLOAT);

// Copy the memory

// Execute sgemm
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n,
pAlpha, d_A, n, d_B, n, pBeta, d_C, n);

Most of existing sparse matrix classes and sparse matrix conversion
routines in Mahout can generally maintain their structure as the CSR format
is well-supported in both cuSPARSE and cuSOLVER libraries.

Our plan is to create a proof-of-concept implementation first to
demonstrate matrix-matrix and/or matrix-vector multiplication using CUDA
libraries, then expand functionality by adding more BLAS operations and
advanced algorithms that exist in cuSOLVER. Stay tuned for more updates!

Regards,
Nikolai.



---
This email message is for the sole use of the intended recipient(s) and
may contain
confidential information.  Any unauthorized review, use, disclosure or
distribution
is prohibited.  If you are not the intended recipient, please contact the
sender by
reply email and destroy all copies of the original message.

---





Re: Contributing to Apache Mahout via Google Summer of Code 2017

2017-03-10 Thread Shannon Quinn

Hi Aditya,

Looking forward to working with you! Please feel free to contact me 
directly and we'll get you up and running.


Shannon

On 3/9/17 11:43 AM, Andrew Musselman wrote:

Aditya, Shannon Quinn is a committer who mentioned he'd be happy to mentor
you this season.

Thanks again for reaching out; let him know where you are in the process
and let's get you up and running.

Best
Andrew


On Tue, Mar 7, 2017 at 8:58 AM, Andrew Palumbo <ap@outlook.com> wrote:


Hello Aditya,


Welcome to Mahout!


Here is some information on GSoC requirements [1].  While Mahout has
participated in the GSoC before, please note that you must find a committer
on the Mahout project who will have the bandwidth to be a good mentor to
you.  My suggestion would be to start with this step.


Hope this helps,


Andy


[1] https://community.apache.org/gsoc.html

Apache Community Development - GSoC<https://community.apache.org/gsoc.html
community.apache.org
Google is sponsoring the 2016 Summer of Code and The Apache Software
Foundation (ASF) has been accepted as a mentoring organization. This page
is your entry point to ...



From: Aditya <adityasarma...@gmail.com>
Sent: Tuesday, March 7, 2017 10:06:54 AM
To: Andrew Musselman
Cc: dev@mahout.apache.org
Subject: Re: Contributing to Apache Mahout via Google Summer of Code 2017

Hello Andrew,

With respect to the application process for Google Summer of Code, once you
guys are sure that I'll be able to contribute to the community on something
meaningful during summer. I'll be drafting a proposal (with guidance from
someone from Mahout) detailing the timeline and the plan of action. I guess
the mentor will be decided based on what my contribution is going to be.
I'm sorry for the delay in replying. It has been a crazy week for me in
terms of my thesis. I'll take some time and look through the framework and
try to get some hands on experience by contributing to a bug fix / an
existing issue in Jira.

I had a question though, do I submit the application proposal to someone
from Mahout? or is there a centralized process for ASF as a whole.

PS: With respect to help from your side, it would be a mentor willing to
mentor me through the summer as I work. I will talk to Saikat and see of
what use I can be in the Generalized Linear Model that has been proposed.

Best Regards,
Aditya




On Thu, Mar 2, 2017 at 4:43 AM, Andrew Musselman <
andrew.mussel...@gmail.com

wrote:
Also would probably be good to subscribe to the community list by sending
mail to dev-subscr...@community.apache.org.

On Wed, Mar 1, 2017 at 3:00 PM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:


Aditya, could you go through the page Isabel noted above (
http://community.apache.org/gsoc.html) and let us know what help you
need applying or moving the process forward?

Thanks!

On Wed, Mar 1, 2017 at 1:11 AM, Isabel Drost-Fromm <isa...@apache.org>
wrote:


Disclaimer: Last time I mentored is years ago.

Back then the ASF applied as a whole to be accepted as organisation for
GSoC. I remember Uli Stärk explaining once to me some of the paperwork
required from the "OSS project side". Upon acceptance, Google granted
several dozen slots for students, those were then distributed across
projects. The exact process should be on the com dev page here:

http://community.apache.org/gsoc.html

Isabel







Re: Future Mahout - Zeppelin work

2016-05-20 Thread Shannon Quinn

Agreed, thoroughly enjoying the blog post.

On 5/19/16 12:01 AM, Andrew Palumbo wrote:

Well done, Trevor!  I've not yet had a chance to try this in zeppelin but I 
just read the blog which is great!

 Original message 
From: Trevor Grant 
Date: 05/18/2016 2:44 PM (GMT-05:00)
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Ah thank you.

Fixing now.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo  wrote:


Hey Trevor- Just refreshed your readme.  The jar that I mentioned is
actually:


/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

rather than:


/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

(In the spark module that is)

From: Trevor Grant 
Sent: Wednesday, May 18, 2016 11:02:43 AM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

ah yes- I remember you pointing that out to me too.

I got side tracked yesterday for most of the day on an adventure in getting
Zeppelin to work right after I accidently updated to the new snapshot (free
hint: the secret was to clear my cache *face-palm*)

I'm going to add that dependency to the readme.md now.

thanks,
tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo 
wrote:


Trevor this is very cool- I have not been able to look at it closely yet
but just a small point: I believe that you'll also need to add the

mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

For things like the classification stats, confusion matrix, and t-digest.

Andy


From: Trevor Grant 
Sent: Wednesday, May 18, 2016 10:47:21 AM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

I still need to update my readme/env per Pat's comments below, however

with

out further ado, I present two notebooks that integrate Mahout + Spark +
Zeppelin + ggplot2

https://github.com/rawkintrevo/mahout-zeppelin

Supposing you have a somewhat recent version of Zeppelin 0.6 with sparkr
support running already, you may import the following raw notes directly
into Zeppelin:




https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json




https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json

So my thoughs on next steps, which I'm positing only as a starting point
for discussion, and are in no particular order of importance:

- Blog on HOWTO for everyman (assumes no familiarity with Mahout, and

only

enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into a tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels like -
e.g. build-profile vs. tutorial
   - I think the case for making a build-profile is that Zeppelin is first
and foremost a datascience tool for non technical users.
   - If we go that route I'll need some more support finding out what is

the

absolute minimum 'bare-bones' mahout we can include, e.g. does the user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do the

same

thing in Python.

The basic deal here is we are:
1) Setting up a standard Zeppelin Spark Interpretter to act like a Mahout
interpretter
 - This is taken care of by setting some env. variables, adding some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a 

To Pat's point- this is a kind of clumsy pipeline, however the Zeppelin
wrapper at least makes it *feel* less so.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel 

wrote:

Seems like there is plenty to use in ggplot or python but the pipeline

is

a 

Re: Congratulations to our new Chair

2016-04-21 Thread Shannon Quinn

Thanks Suneel for your excellent leadership.

Congratulations Andrew!

On 4/21/16 3:38 AM, Alessandro Negro wrote:

Congratulation!

Il giorno 21/apr/2016, alle ore 02:36, khurrum.na...@useitc.com ha scritto:


Congrats.

Sent from my iPhone


On Apr 20, 2016, at 8:33 PM, Andrew Palumbo  wrote:

Thanks you guys!

 Original message 
From: Andrew Musselman 
Date: 04/20/2016 8:14 PM (GMT-05:00)
To: dev@mahout.apache.org, u...@mahout.apache.org
Subject: Re: Congratulations to our new Chair

Suneel, thanks your great work as Chair and thank you Andy for stepping in!


On Wed, Apr 20, 2016 at 5:00 PM, Dmitriy Lyubimov  wrote:

congrats!


On Wed, Apr 20, 2016 at 4:55 PM, Suneel Marthi  wrote:

Please join me in congratulating Andrew Palumbo on becoming our new

Project

Chair.

As for me, it was a pleasure to serve as Chair starting with the Mahout
0.10.0 release and ending with the recent 0.12.0 release, and perhaps we
will do it again someday




Re: Welcome Anand Avati

2015-04-22 Thread Shannon Quinn

Welcome to the team, Anand!

On 4/22/15 2:55 PM, Pat Ferrel wrote:

Welcome Anand!

On Apr 22, 2015, at 11:29 AM, Andrew Palumbo ap@outlook.com wrote:

Congratulations Anand, Welcome to the team!

On 04/22/2015 02:18 PM, Gokhan Capan wrote:

Welcome Anand!

Sent from my iPhone


On Apr 22, 2015, at 20:47, Dmitriy Lyubimov dlie...@gmail.com wrote:

congrats and thank you!

-d

On Wed, Apr 22, 2015 at 10:33 AM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:


Welcome to the team Anand; thanks for your contributions!


On Wed, Apr 22, 2015 at 10:29 AM, Anand Avati av...@gluster.org wrote:

Thank you Suneel, I am thrilled to join the team!

I am a relative newbie to data mining and machine learning. I currently
work at Red Hat, but have joined grad school (in machine learning)

starting

this fall.

I look forward to continuing my contributions, and thank you once again

for

the opportunity.

Anand


On Wed, Apr 22, 2015, 08:08 Suneel Marthi smar...@apache.org wrote:

In recognition of the contributions of Anand Avati to the Mahout

project

over the past year, the PMC is pleased to announce that he has accepted

our

invitation to join the Mahout project as a committer.

As is customary, I will leave it to Anand to provide a little bit of
background about himself.

Congratulations and Welcome!

-Suneel Marthi
On Behalf of Mahout PMC






Re: [VOTE] Apache Mahout 0.10.0 Release

2015-04-11 Thread Shannon Quinn
+1 from non-PMC :)

iPhone'd

 On Apr 11, 2015, at 12:25, Suneel Marthi suneel.mar...@gmail.com wrote:
 
 Thanks everyone. We have had 5  +1 votes from the PMC and this release has
 passed and the Voting officially closes.
 Will send a formal release announcement once the release is finalized.
 
 Thanks again.
 
 On Sat, Apr 11, 2015 at 12:20 PM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Just built an external app using sbt against the staging repo and it looks
 good to me
 
 +1 (binding)
 
 On Apr 11, 2015, at 9:12 AM, Andrew Palumbo ap@outlook.com wrote:
 
 After testing examples locally from .tar and .zip distribution and testing
 the staged mahout-math artifact in a java application, I am happy with this
 release.
 
 +1 (binding)
 On 04/11/2015 11:45 AM, Suneel Marthi wrote:
 After checking the {source} * {tar,zip} and running a few tests locally,
 I
 am fine with this release.
 
 +1 (binding)
 
 On Sat, Apr 11, 2015 at 11:43 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 After checking the binary tarball and zip, and running through all the
 examples on an EMR cluster, I am good with this release.
 
 +1 (binding)
 
 On Fri, Apr 10, 2015 at 9:34 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
 Ah... forgot this.
 
 +1 (binding)
 
 On Fri, Apr 10, 2015 at 11:14 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
 I downloaded and tested the signatures and check-sums on
 {binary,source}
 x
 {zip,tar} + pom.  All were correct.
 
 One thing that I worry a little about is that the name of the artifact
 doesn't include apache.  Not sure that is a hard requirement, but it
 seems a good thing to do.
 
 
 
 On Fri, Apr 10, 2015 at 8:16 PM, Suneel Marthi 
 suneel.mar...@gmail.com
 wrote:
 
 Here's a new Mahout 0.10.0 Release Candidate at
 https://repository.apache.org/content/repositories/orgapachemahout-1007/
 The Voting for this ends on tomorrow.  Need atleast 3 PMC +1 for the
 release to pass.
 
 Grant, Ted:  Would appreciate if u guys could verify the signatures.
 
 
 Rest: Please test the artifacts.
 
 Thanks to all the contributors and committers.
 
 Regards,
 Suneel
 
 On Fri, Apr 10, 2015 at 11:45 AM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
 Ran well but we have a packaging problem with the binary distro.
 Will
 require either a pom or code change I think, hold the vote.
 
 
 
 On Apr 9, 2015, at 4:31 PM, Andrew Musselman 
 andrew.mussel...@gmail.com
 wrote:
 
 Running on EMR now.
 
 On Thu, Apr 9, 2015 at 3:52 PM, Pat Ferrel p...@occamsmachete.com
 wrote:
 I can't run it (due to messed up dev machine) but I verified the
 artifacts
 buildiing an external app with sbt using the staged repo instead
 of
 my
 local .m2 cache. This means the Scala classes were resolved
 correctly
 from
 the artifacts.
 
 Hope someone can actually run it on a cluster
 
 
 On Apr 9, 2015, at 2:42 PM, Suneel Marthi 
 suneel.mar...@gmail.com
 wrote:
 Please find the Mahout 0.10.0 release candidate at
 https://repository.apache.org/content/repositories/orgapachemahout-1005/
 The Voting runs till Saturday, April 11 2015, need atleast 3 PMC
 +1
 votes
 for the candidate release to pass.
 
 Thanks again to all the commiters and contributors for their hard
 work
 over
 the past few weeks.
 
 Regards,
 Suneel
 On Behalf of Apache Mahout Team
 
 
 


Re: Anyone using eclipse?

2015-03-30 Thread Shannon Quinn
Unsuccessfully thus far, but yes I'm on eclipse. 

iPhone'd

 On Mar 30, 2015, at 18:58, Suneel Marthi suneel.mar...@gmail.com wrote:
 
 I believe its only Shannon from amongst the committer team who is using
 Eclipse. I am talking him out into shifting to IntelliJ.
 
 On Mon, Mar 30, 2015 at 6:54 PM, Stevo Slavić ssla...@gmail.com wrote:
 
 Hello team,
 
 I'm curious, is anyone of you using eclipse IDE?
 If not, then as part of MAHOUT-1278 I could remove a lot from our POMs.
 
 Kind regards,
 Stevo Slavic.
 


[jira] [Commented] (MAHOUT-1662) Potential Path bug in SequenceFileVaultIterator breaks DisplaySpectralKMeans

2015-03-30 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386917#comment-14386917
 ] 

Shannon Quinn commented on MAHOUT-1662:
---

https://github.com/apache/mahout/pull/89

 Potential Path bug in SequenceFileVaultIterator breaks DisplaySpectralKMeans
 

 Key: MAHOUT-1662
 URL: https://issues.apache.org/jira/browse/MAHOUT-1662
 Project: Mahout
  Issue Type: Bug
  Components: Examples, mrlegacy
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 0.10.0


 Received the following error when attempting to run DisplaySpectralKMeans:
 Exception in thread main java.lang.IllegalArgumentException: Wrong FS: 
 file://tmp/calculations/diagonal/part-r-0/tmp/calculations/diagonal/part-r-0,
  expected: file:///
   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
   at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1750)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1774)
   at 
 org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.init(SequenceFileValueIterator.java:56)
   at 
 org.apache.mahout.clustering.spectral.VectorCache.load(VectorCache.java:115)
   at 
 org.apache.mahout.clustering.spectral.MatrixDiagonalizeJob.runJob(MatrixDiagonalizeJob.java:77)
   at 
 org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:170)
   at 
 org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:117)
   at 
 org.apache.mahout.clustering.display.DisplaySpectralKMeans.main(DisplaySpectralKMeans.java:76)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
 Tracked the origin of the bug to line 54 of SequenceFileVaultIterator. PR 
 which contains a fix is available; I would ask for independent verification 
 before merging it with master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1662) Potential Path bug in SequenceFileVaultIterator breaks DisplaySpectralKMeans

2015-03-30 Thread Shannon Quinn (JIRA)
Shannon Quinn created MAHOUT-1662:
-

 Summary: Potential Path bug in SequenceFileVaultIterator breaks 
DisplaySpectralKMeans
 Key: MAHOUT-1662
 URL: https://issues.apache.org/jira/browse/MAHOUT-1662
 Project: Mahout
  Issue Type: Bug
  Components: Examples, mrlegacy
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 0.10.0


Received the following error when attempting to run DisplaySpectralKMeans:

Exception in thread main java.lang.IllegalArgumentException: Wrong FS: 
file://tmp/calculations/diagonal/part-r-0/tmp/calculations/diagonal/part-r-0,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1750)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1774)
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.init(SequenceFileValueIterator.java:56)
at 
org.apache.mahout.clustering.spectral.VectorCache.load(VectorCache.java:115)
at 
org.apache.mahout.clustering.spectral.MatrixDiagonalizeJob.runJob(MatrixDiagonalizeJob.java:77)
at 
org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:170)
at 
org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:117)
at 
org.apache.mahout.clustering.display.DisplaySpectralKMeans.main(DisplaySpectralKMeans.java:76)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)

Tracked the origin of the bug to line 54 of SequenceFileVaultIterator. PR which 
contains a fix is available; I would ask for independent verification before 
merging it with master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1540) Reuters example for spectral clustering

2015-03-29 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1540:
--
Affects Version/s: (was: 1.0)
   0.9
Fix Version/s: (was: 1.0)
   0.10.1

 Reuters example for spectral clustering
 ---

 Key: MAHOUT-1540
 URL: https://issues.apache.org/jira/browse/MAHOUT-1540
 Project: Mahout
  Issue Type: Improvement
  Components: Examples
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
  Labels: DSL, scala, spark
 Fix For: 0.10.1


 Once MAHOUT-1538 and MAHOUT-1539 are complete, create a working example of 
 spectral clustering using the Reuters dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1538) Port spectral clustering to Mahout DSL

2015-03-29 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1538:
--
Fix Version/s: (was: 0.10.0)
   0.10.1

 Port spectral clustering to Mahout DSL
 --

 Key: MAHOUT-1538
 URL: https://issues.apache.org/jira/browse/MAHOUT-1538
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
  Labels: DSL, Spark, scala
 Fix For: 0.10.1


 Move spectral clustering logic to Mahout DSL. Dependencies include SSVD 
 (already ported) and K-means (currently in progress, or can use Spark MLlib 
 implementation as a temporary fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1539) Implement affinity matrix computation in Mahout DSL

2015-03-29 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1539:
--
Affects Version/s: (was: 1.0)
   0.9
Fix Version/s: (was: 0.10.0)
   0.10.1

 Implement affinity matrix computation in Mahout DSL
 ---

 Key: MAHOUT-1539
 URL: https://issues.apache.org/jira/browse/MAHOUT-1539
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
  Labels: DSL, scala, spark
 Fix For: 0.10.1

 Attachments: ComputeAffinities.scala


 This has the same goal as MAHOUT-1506, but rather than code the pairwise 
 computations in MapReduce, this will be done in the Mahout DSL.
 An orthogonal issue is the format of the raw input (vectors, text, images, 
 SequenceFiles), and how the user specifies the distance equation and any 
 associated parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1540) Reuters example for spectral clustering

2015-03-29 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386041#comment-14386041
 ] 

Shannon Quinn commented on MAHOUT-1540:
---

Given that this issue has explicit dependencies on MAHOUT-1538, and Saikat is 
still working on MAHOUT-1539, I propose bumping this to 0.10.1.

Plus, I'll need some assistance from everyone in familiarizing myself with the 
process of converting the Reuters dataset to something I can compute affinities 
from to construct the similarity matrix.

 Reuters example for spectral clustering
 ---

 Key: MAHOUT-1540
 URL: https://issues.apache.org/jira/browse/MAHOUT-1540
 Project: Mahout
  Issue Type: Improvement
  Components: Examples
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
  Labels: DSL, scala, spark
 Fix For: 0.10.1


 Once MAHOUT-1538 and MAHOUT-1539 are complete, create a working example of 
 spectral clustering using the Reuters dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-29 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386005#comment-14386005
 ] 

Shannon Quinn commented on MAHOUT-1659:
---

Pull request created: https://github.com/apache/mahout/pull/88

 Remove deprecated Lanczos solver from spectral clustering in mr-legacy
 --

 Key: MAHOUT-1659
 URL: https://issues.apache.org/jira/browse/MAHOUT-1659
 Project: Mahout
  Issue Type: Task
  Components: Clustering, mrlegacy
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.10.0


 Spectral clustering still has the option of using either SSVD or the Lanczos 
 solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Mahout 0.10.0 Bug bash

2015-03-28 Thread Shannon Quinn
Wait, I thought all DSL work on spectral clustering was waiting until 0.10.1?

iPhone'd

 On Mar 28, 2015, at 13:49, Suneel Marthi suneel.mar...@gmail.com wrote:
 
 Seems like we are stretched pretty thin given the work load, not to mention
 that Mahout work is completely orthogonal to our paychecks.
 
 Ted, Grant, Shannon - possible you guys could take some of the load??
 
 On Sat, Mar 28, 2015 at 1:25 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Today's:
 
 Andrew Palumbo
 --
 M-1648: Update CMS for Mahout 0.10.0
 M-1638: H2O bindings fail at drmParallelizeWithRowLabels
 M-1477: Clean up website on Logistic Regression
 M-1564: Naive Bayes classifier for new Text Documents
 M-1635: Getting an exception when I provide classification labels manually
 for Naive Bayes
 M-1493: Port Naive Bayes to Spark DSL(Patch available)
 M-1559: Documentation and cleanup for Naive Bayes Example
 M-1609: NullPointerException
 M-1607: Spark-shell DAG scheduler
 
 Andrew Musselman
 -
 M-1655: Refactor module dependencies
 M-1522: Handle logging levels via log4j.xml
 M-1563: cleanup Warnings during Build
 M-1470: LDA Topic dump
 M-1462: Cleaning up Random Forests documentation on Mahout website
 
 Dmitriy Lyubimov
 --
 M-1646: Refactor out all legacy MR dependencies from scala code
 
 Frank Scholten
 -
 M-1649: Lucene 5 upgrade
 M-1625: lucene2seq: failure to convert a document that does not contain a
 field (the field is not required)
 
 Pat Ferrel
 -
 M-1589: mahout.cmd has duplicated content(Patch available)
 M-1618: co-occurence recommender example
 
 Suneel Marthi
 -
 M-1586: Collections downloads must have hash signatures
 M-1647: The release build is incomplete
 M-1652: Java 7 update
 M-1512: Hadoop 2 compatibility
 M-1469: Streaming KMeans fails when executed in MR mode and
 REDUCE_STREAMING_KMEANS
 set to true
 M-1443: Update How to Release page(Tagged 0.10.1)
 M-1585: Javadocs not hosted by Mahout-Quality
 M-1612: NPE during JSON outputformatter for clusterdump
 M-1656: Change SNAPSHOT version from 1.0 to 0.10
 M-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf
 M-1619: HighDFWordsPruner overwrites cache files
 
 Stevo Slavic
 
 M-1650: upgrade 3rd party jars
 M-1602: Euclidean Distance Similarity Math
 M-1278: Improve inheritance of apache parent pom
 M-1562: Publish Scaladocs
 M-1277: Lose dependency on custom commons-cli
 
 Shannon Quinn
 ---
 M-1538: Port spectral clustering to Mahout DSL
 M-1593: Implement affinity matrix computation in Mahout DSL
 M-1540: Reuters Example spectral clustering Also online docs for Spectral
 clustering
 M-1659: Remove deprecated Lanczos solver from spectral clustering in
 mr-legacy
 
 Ted Dunning
 ---
 M-1636: Class dependencies for Spark module are put in job.jar, which is
 inefficient
 
 Sebastian Schelter
 --
 M-1584: Create a detailed example of how to index an arbitrary dataset and
 run LDA on it(Patch available)
 
 Gokhan Capan
 --
 M-1626: Support for required quasi-algebraic operations and starting with
 aggregating rows/blocks
 
 Unassigned
 --
 M-1594: Example factorize-movielens-1M.sh does not use HDFS(Patch
 available)
 M-1593: cluster-reuters.sh does not work complaining
 java.lang.IllegalStateException(Patch available)
 M-1557: Add support for sparse training vectors in MLP(Patch available)
 M-1516: run classify-20newsgroups.sh failed cause by
 /tmp/mahout-work-jpan/20news-all does not exists in hdfs.(Patch
 available)
 M-1643: CLI arguments are not being processed in spark-shell
 M-1637: RecommenderJob of ALS fails in the mapper because it uses the
 instance of other class
 M-1634: ALS don't work when it adds new files in Distributed Cache
 (Patch available)
 M-1633: Failure to execute query when solr index contains documents with
 different fields
 M-1551: Add document to describe how to use mlp with command line(Patch
 available)
 
 On Thu, Mar 26, 2015 at 7:07 PM, Suneel Marthi suneel.mar...@gmail.com
 wrote:
 
 Ok here's the bug bash as of today
 
 Andrew Palumbo
 --
 M-1648: Update CMS for Mahout 0.10.0
 M-1638: H2O bindings fail at drmParallelizeWithRowLabels
 M-1564: Naive Bayes classifier for new Text Documents
 M-1635: Exception when providing classification Labels
 M-1493: Port Naive Bayes to Spark DSL
 M-1559: Documentation and cleanup for Naive Bayes Example
 M-1609: NullPointerException
 M-1607: Spark-shell DAG scheduler
 
 Andrew Musselman
 -
 M-1655: Refactor module dependencies
 M-1563: cleanup Warnings during Build
 M-1470: LDA Topic dump
 
 Dmitriy Lyubimov
 --
 M-1646: Refactor out all legacy MR dependencies from scala code
 
 Frank Scholten

Re: Mahout 0.10.0 Bug bash

2015-03-28 Thread Shannon Quinn
Ah no worries, just got a bit panicked when I saw that. 

Summer will be better for me but for now these tickets have about maxed me out; 
3 months into the new tenure-track shtick is grueling. 

iPhone'd

 On Mar 28, 2015, at 14:27, Andrew Musselman andrew.mussel...@gmail.com 
 wrote:
 
 Okay, go ahead and move it; I was just moving things from 1.0 to 0.10.0
 almost indiscriminately.
 
 On Sat, Mar 28, 2015 at 11:22 AM, Shannon Quinn squ...@gatech.edu wrote:
 
 Wait, I thought all DSL work on spectral clustering was waiting until
 0.10.1?
 
 iPhone'd
 
 On Mar 28, 2015, at 13:49, Suneel Marthi suneel.mar...@gmail.com
 wrote:
 
 Seems like we are stretched pretty thin given the work load, not to
 mention
 that Mahout work is completely orthogonal to our paychecks.
 
 Ted, Grant, Shannon - possible you guys could take some of the load??
 
 On Sat, Mar 28, 2015 at 1:25 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Today's:
 
 Andrew Palumbo
 --
 M-1648: Update CMS for Mahout 0.10.0
 M-1638: H2O bindings fail at drmParallelizeWithRowLabels
 M-1477: Clean up website on Logistic Regression
 M-1564: Naive Bayes classifier for new Text Documents
 M-1635: Getting an exception when I provide classification labels
 manually
 for Naive Bayes
 M-1493: Port Naive Bayes to Spark DSL(Patch available)
 M-1559: Documentation and cleanup for Naive Bayes Example
 M-1609: NullPointerException
 M-1607: Spark-shell DAG scheduler
 
 Andrew Musselman
 -
 M-1655: Refactor module dependencies
 M-1522: Handle logging levels via log4j.xml
 M-1563: cleanup Warnings during Build
 M-1470: LDA Topic dump
 M-1462: Cleaning up Random Forests documentation on Mahout website
 
 Dmitriy Lyubimov
 --
 M-1646: Refactor out all legacy MR dependencies from scala code
 
 Frank Scholten
 -
 M-1649: Lucene 5 upgrade
 M-1625: lucene2seq: failure to convert a document that does not contain
 a
 field (the field is not required)
 
 Pat Ferrel
 -
 M-1589: mahout.cmd has duplicated content(Patch available)
 M-1618: co-occurence recommender example
 
 Suneel Marthi
 -
 M-1586: Collections downloads must have hash signatures
 M-1647: The release build is incomplete
 M-1652: Java 7 update
 M-1512: Hadoop 2 compatibility
 M-1469: Streaming KMeans fails when executed in MR mode and
 REDUCE_STREAMING_KMEANS
 set to true
 M-1443: Update How to Release page(Tagged 0.10.1)
 M-1585: Javadocs not hosted by Mahout-Quality
 M-1612: NPE during JSON outputformatter for clusterdump
 M-1656: Change SNAPSHOT version from 1.0 to 0.10
 M-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf
 M-1619: HighDFWordsPruner overwrites cache files
 
 Stevo Slavic
 
 M-1650: upgrade 3rd party jars
 M-1602: Euclidean Distance Similarity Math
 M-1278: Improve inheritance of apache parent pom
 M-1562: Publish Scaladocs
 M-1277: Lose dependency on custom commons-cli
 
 Shannon Quinn
 ---
 M-1538: Port spectral clustering to Mahout DSL
 M-1593: Implement affinity matrix computation in Mahout DSL
 M-1540: Reuters Example spectral clustering Also online docs for
 Spectral
 clustering
 M-1659: Remove deprecated Lanczos solver from spectral clustering in
 mr-legacy
 
 Ted Dunning
 ---
 M-1636: Class dependencies for Spark module are put in job.jar, which is
 inefficient
 
 Sebastian Schelter
 --
 M-1584: Create a detailed example of how to index an arbitrary dataset
 and
 run LDA on it(Patch available)
 
 Gokhan Capan
 --
 M-1626: Support for required quasi-algebraic operations and starting
 with
 aggregating rows/blocks
 
 Unassigned
 --
 M-1594: Example factorize-movielens-1M.sh does not use HDFS(Patch
 available)
 M-1593: cluster-reuters.sh does not work complaining
 java.lang.IllegalStateException(Patch available)
 M-1557: Add support for sparse training vectors in MLP(Patch
 available)
 M-1516: run classify-20newsgroups.sh failed cause by
 /tmp/mahout-work-jpan/20news-all does not exists in hdfs.(Patch
 available)
 M-1643: CLI arguments are not being processed in spark-shell
 M-1637: RecommenderJob of ALS fails in the mapper because it uses the
 instance of other class
 M-1634: ALS don't work when it adds new files in Distributed Cache
 (Patch available)
 M-1633: Failure to execute query when solr index contains documents with
 different fields
 M-1551: Add document to describe how to use mlp with command line
 (Patch
 available)
 
 On Thu, Mar 26, 2015 at 7:07 PM, Suneel Marthi suneel.mar...@gmail.com
 
 wrote:
 
 Ok here's the bug bash as of today
 
 Andrew Palumbo
 --
 M-1648: Update CMS for Mahout 0.10.0
 M-1638: H2O bindings fail at drmParallelizeWithRowLabels
 M-1564: Naive Bayes classifier for new Text Documents
 M-1635: Exception when providing

Re: Mahout 0.10.0 Bug bash

2015-03-27 Thread Shannon Quinn

Yes--removing the Lanczos solver from spectral clustering.

On 3/27/15 10:29 AM, Suneel Marthi wrote:

and this is for 0.10.0 ???

On Fri, Mar 27, 2015 at 10:27 AM, Shannon Quinn squ...@gatech.edu wrote:


Created M-1659 and assigned it to myself to reflect current work.

Shannon


On 3/26/15 10:07 PM, Suneel Marthi wrote:


Ok here's the bug bash as of today

Andrew Palumbo
--
M-1648: Update CMS for Mahout 0.10.0
M-1638: H2O bindings fail at drmParallelizeWithRowLabels
M-1564: Naive Bayes classifier for new Text Documents
M-1635: Exception when providing classification Labels
M-1493: Port Naive Bayes to Spark DSL
M-1559: Documentation and cleanup for Naive Bayes Example
M-1609: NullPointerException
M-1607: Spark-shell DAG scheduler

Andrew Musselman
-
M-1655: Refactor module dependencies
M-1563: cleanup Warnings during Build
M-1470: LDA Topic dump

Dmitriy Lyubimov
--
M-1646: Refactor out all legacy MR dependencies from scala code

Frank Scholten
-
M-1649: Lucene 5 upgrade

Pat Ferrel
-
M-1589: mahout.cmd has duplicated content
M-1618: co-occurence recommender example

Suneel Marthi
-
M-1586: Collections downloads must have hash signatures
M-1647: Release build
M-1652: Java 7 update
M-1512: Hadoop 2 compatibility
M-1469: Streaming KMeans fails when executed in MR mode and
REDUCE_STREAMING_KMEANS set to true
M-1443: Update How to Release page
M-1585: Javadocs not hosted by Mahout-Quality
M-1612: NPE during JSON outputformatter for clusterdump

Stevo Slavic

M-1650: upgrade 3rd party jars
M-1602: Euclidean Distance Similarity Math
M-1278: Improve inheritance of apache parent pom

Shannon Quinn
---
M-1540: Reuters Example spectral clustering
Also online docs for Spectral clustering

Ted Dunning
---
M-1636: Class dependencies for Spark module are put in job.jar, which is
inefficient






[jira] [Created] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-27 Thread Shannon Quinn (JIRA)
Shannon Quinn created MAHOUT-1659:
-

 Summary: Remove deprecated Lanczos solver from spectral clustering 
in mr-legacy
 Key: MAHOUT-1659
 URL: https://issues.apache.org/jira/browse/MAHOUT-1659
 Project: Mahout
  Issue Type: Task
  Components: Clustering, mrlegacy
Affects Versions: 0.9
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.10.0


Spectral clustering still has the option of using either SSVD or the Lanczos 
solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Mahout 0.10.0 Bug bash

2015-03-27 Thread Shannon Quinn

Created M-1659 and assigned it to myself to reflect current work.

Shannon

On 3/26/15 10:07 PM, Suneel Marthi wrote:

Ok here's the bug bash as of today

Andrew Palumbo
--
M-1648: Update CMS for Mahout 0.10.0
M-1638: H2O bindings fail at drmParallelizeWithRowLabels
M-1564: Naive Bayes classifier for new Text Documents
M-1635: Exception when providing classification Labels
M-1493: Port Naive Bayes to Spark DSL
M-1559: Documentation and cleanup for Naive Bayes Example
M-1609: NullPointerException
M-1607: Spark-shell DAG scheduler

Andrew Musselman
-
M-1655: Refactor module dependencies
M-1563: cleanup Warnings during Build
M-1470: LDA Topic dump

Dmitriy Lyubimov
--
M-1646: Refactor out all legacy MR dependencies from scala code

Frank Scholten
-
M-1649: Lucene 5 upgrade

Pat Ferrel
-
M-1589: mahout.cmd has duplicated content
M-1618: co-occurence recommender example

Suneel Marthi
-
M-1586: Collections downloads must have hash signatures
M-1647: Release build
M-1652: Java 7 update
M-1512: Hadoop 2 compatibility
M-1469: Streaming KMeans fails when executed in MR mode and
REDUCE_STREAMING_KMEANS set to true
M-1443: Update How to Release page
M-1585: Javadocs not hosted by Mahout-Quality
M-1612: NPE during JSON outputformatter for clusterdump

Stevo Slavic

M-1650: upgrade 3rd party jars
M-1602: Euclidean Distance Similarity Math
M-1278: Improve inheritance of apache parent pom

Shannon Quinn
---
M-1540: Reuters Example spectral clustering
Also online docs for Spectral clustering

Ted Dunning
---
M-1636: Class dependencies for Spark module are put in job.jar, which is
inefficient





Re: [jira] [Created] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-27 Thread Shannon Quinn

Are the slides from these talks going to be posted somewhere?

On 3/27/15 1:10 PM, Suneel Marthi wrote:

Different Topic: There's a talk this afternoon by Cloudera's Data Scientist
at MlConf NYC about Mahout's LanczosSolver, SSVD and MlLib SSVD.

See http://mlconf.com/mlconf-2015-nyc/

Here we r talking about purging Mahout's LanczosSolver for 2 years now.
Seems like the talk will be about the old MapReduce based SSVD and
LanczosSolver while we have
the new non-MR distributed SSVD stuff. I hope I am wrong here but will see.

On Fri, Mar 27, 2015 at 1:02 PM, Shannon Quinn squ...@gatech.edu wrote:


Honestly not sure, as I haven't had a chance to play around with the scala
dsl much yet. Suneel suggested we save that for 0.10.1.


On 3/27/15 12:00 PM, Dmitriy Lyubimov wrote:


Shannon,

How difficult would it be to port spectral clustering to our scala alg and
math? We have ssvd there as well.
On Mar 27, 2015 7:26 AM, Shannon Quinn (JIRA) j...@apache.org wrote:

  Shannon Quinn created MAHOUT-1659:

-

   Summary: Remove deprecated Lanczos solver from spectral
clustering in mr-legacy
   Key: MAHOUT-1659
   URL: https://issues.apache.org/jira/browse/MAHOUT-1659
   Project: Mahout
Issue Type: Task
Components: Clustering, mrlegacy
  Affects Versions: 0.9
  Reporter: Shannon Quinn
  Assignee: Shannon Quinn
  Priority: Minor
   Fix For: 0.10.0


Spectral clustering still has the option of using either SSVD or the
Lanczos solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)






Re: [jira] [Created] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-27 Thread Shannon Quinn
Honestly not sure, as I haven't had a chance to play around with the 
scala dsl much yet. Suneel suggested we save that for 0.10.1.


On 3/27/15 12:00 PM, Dmitriy Lyubimov wrote:

Shannon,

How difficult would it be to port spectral clustering to our scala alg and
math? We have ssvd there as well.
On Mar 27, 2015 7:26 AM, Shannon Quinn (JIRA) j...@apache.org wrote:


Shannon Quinn created MAHOUT-1659:
-

  Summary: Remove deprecated Lanczos solver from spectral
clustering in mr-legacy
  Key: MAHOUT-1659
  URL: https://issues.apache.org/jira/browse/MAHOUT-1659
  Project: Mahout
   Issue Type: Task
   Components: Clustering, mrlegacy
 Affects Versions: 0.9
 Reporter: Shannon Quinn
 Assignee: Shannon Quinn
 Priority: Minor
  Fix For: 0.10.0


Spectral clustering still has the option of using either SSVD or the
Lanczos solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)





Re: 0.10 release Hangout

2015-03-23 Thread Shannon Quinn
Will be teaching until 9:30 PT, at which point I have another meeting 
until 11. Would love to get a summary of the meeting; also happy to help 
with some of the tasks.


Shannon

On 3/23/15 3:56 PM, Andrew Musselman wrote:

We'll be getting on a Google Hangout tomorrow, Tuesday, from 9-11 a.m.
Pacific, to work through open questions for what should be in the release,
go through Jira, and do some delegation of tasks.

Here's the Hangout URL
https://plus.google.com/hangouts/_/calendar/YW5kcmV3Lm11c3NlbG1hbkBnbWFpbC5jb20.glvu1gfv3kvj5241n9bsg3clrc

See you then!





Re: Release

2015-03-17 Thread Shannon Quinn

+1

On 3/17/15 8:19 PM, Andrew Musselman wrote:

How about 0.10 is the first block and 0.10.1 is the second?

On Wed, Mar 18, 2015 at 1:12 AM, Andrew Palumbo ap@outlook.com wrote:


I like this timeline... though mid April is coming up quickly.. Going back
to Pat's list for 0.10.0:

  1) refactor mrlegacy out of scala deps.

2) build fixes for release.
3) docs — might be good to guinea-pig the new CMS with git pubsub so we
don’t have to do svn, not sure when that will be ready


I would add:

  4) Fix any remaining legacy bugs.

5) docs, docs, docs


along with just some general cleanup.

Is anything else missing?




On 03/17/2015 07:16 PM, Andrew Musselman wrote:


I'm good with that timing pending scope..

On Wed, Mar 18, 2015 at 12:13 AM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

  i was thinking 0.10.0 mid-april, update 0.10.1 end of spring.

   i would suggest feature extraction topics for 0.11.x. Esp. w.r.t.
SchemaRDD aka DataFrame -- vectorizing, hashing, ML schema support,
imputation of missing data, outlier cleanups etc. There's a lot.

Hardware backs integration -- i will certainly be looking at those,
but perhaps the easiest is to start with automatic detection and
configuration of capabilities via netlib, since it is already in the
path and it seems likely that it will (eventually) support cuda as
well in some form. This is for 0.11 or 0.12.x, depends on
availability.

Higher order methods are somewhat a matter of inspiration. I think i
could offer some stuff there too as I already have implemented a lot
of those on top of Mahout before. I did bayesian optimization (aka
spearmint, GP-EI etc.) on Mahout algebra, line search, (L)bfgs,
stats including Gaussian Process support. BFGS and line search are
fairly simple methods and i will give a reference if anybody is
interested. also, breeze also has line search with strong wolfe
conditions (if a coded reference is needed). All that is up for grabs
as a fairly well understood subject.

(5-6 months out) Once GP-EI is available, it becomes a fairly
interesting topic to resurrect implicit feedback issue. Important
insight there is that in fact feature incoding can be done by a custom
scheme (not necessarily using encoding schme done in paper; in fact,
there are 2 of them there; or the way mllib encodes that as well).
once custom encoding schemes are adjusted, using bayesian optimization
is increasingly important, especially if there are more than just 2
parameters there.






Re: Release

2015-03-17 Thread Shannon Quinn
I think we need a better idea of what the release will contain, then we 
can start narrowing the range of possible release dates.


If we take what Pat outlined, an April release might be somewhat 
ambitious but we probably wouldn't miss by much.


On 3/17/15 11:51 AM, David Starina wrote:

Hi guys,

Do you have any specific release date in mind? Guys at Bigtop are planning
an april release, is there any chance there will be a Hadoop 2.x compatible
Mahout release by then to be included with Bigtop?



On Sunday, March 15, 2015, Pat Ferrel p...@occamsmachete.com wrote:


Lots of discussion off the record about doing a release but shouldn’t we
plan this?

What has to be in a release of Mahout 0.10?

Seems like we could release as-is but it would be nice to have some of the
already completed work that isn’t committed yet:
* mrlegacy refactored out of scala, is it possible to get this in Dmitriy?

One question is how to package, with which version of Spark. There is a
bug in Spark 1.2.1 and I think in 1.2 (this is the big distro build) that
requires any class that uses the JavaSerializer to set a specific SparkConf
key/value to point to the guava jar on all workers. This only effects
IndexedDatasets since they use Guava’s BiMap. Rumor has it that 1.3 fixes
this but I haven’t tried it yet.

So we are currently stuck on 1.1.1 but could document how to work around
to use 1.2 for a user who want’s to build Mahout from scratch. A user
source build on 1.3 may not require a work around. We seem to be good on
hadoop 2.x, which in itself is a good reason to release since 0.9 was not.

What else needs to be done:
* rename module math-scala to core?
* create the distribution build. Currently this does not publish the
scaladocs and does not create artifacts for H2O or and Scala.
* is H2O really in a form to publish?

Docs
* IMO we should name the Mahout Spark-Scala DSL and shell. More unique
names are easier to find in searches. Maybe Suneel can polish off his
sanskrit and suggest something.
* we should be ready to do some work here to restructure the CMS since it
is very 0.9 centric with Scala stuff almost an afterthought.




Re: Codebase refactoring proposal

2015-01-23 Thread Shannon Quinn
Also +1

iPhone'd

 On Jan 23, 2015, at 18:38, Andrew Palumbo ap@outlook.com wrote:
 
 +1
 
 
 Sent from my Verizon Wireless 4G LTE smartphone
 
 div Original message /divdivFrom: Dmitriy Lyubimov 
 dlie...@gmail.com /divdivDate:01/23/2015  6:06 PM  (GMT-05:00) 
 /divdivTo: dev@mahout.apache.org /divdivSubject: Codebase refactoring 
 proposal /divdiv
 /div
 So right now mahout-spark depends on mr-legacy.
 I did quick refactoring and it turns out it only _irrevocably_ depends on
 the following classes there:
 
 MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ...
 *sigh* o.a.m.common.Pair
 
 So  I just dropped those five classes into new a new tiny mahout-hadoop
 module (to signify stuff that is directly relevant to serializing thigns to
 DFS API) and completely removed mrlegacy and its transients from spark and
 spark-shell dependencies.
 
 So non-cli applications (shell scripts and embedded api use) actually only
 need spark dependencies (which come from SPARK_HOME classpath, of course)
 and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
 optionally mahout-spark-shell (for running shell)).
 
 This of course still doesn't address driver problems that want to throw
 more stuff into front-end classpath (such as cli parser) but at least it
 renders transitive luggage of mr-legacy (and the size of worker-shipped
 jars) much more tolerable.
 
 How does that sound?


Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-18 Thread Shannon Quinn

Saikat,

Spark has the cartesian() method that will align all pairs of points; 
that's the nontrivial part of determining an RBF kernel. After that it's 
a simple matter of performing the equation that's given on the 
scikit-learn doc page.


However, like you said it'll also have to be implemented using the 
Mahout DSL. I can envision that users would like to compute pairwise 
metrics for a lot more than just RBF kernels (pairwise Euclidean 
distance, etc), so my guess would be a DSL implementation of cartesian() 
is what you're looking for. You can build the other methods on top of that.


Correct me if I'm wrong.

Shannon

On 9/18/14, 3:28 PM, Saikat Kanjilal wrote:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html
I need to implement the above in the scala world and expose a DSL API to call 
the computation when computing the affinity matrix.


From: ted.dunn...@gmail.com
Date: Thu, 18 Sep 2014 10:04:34 -0700
Subject: Re: Mahout-1539-computation of gaussian kernel between 2 arrays of 
shapes
To: dev@mahout.apache.org

There are number of non-traditional linear algebra operations like this
that are important to implement.

Can you describe what you intend to do so that we can discuss the shape of
the API and computation?



On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal sxk1...@hotmail.com
wrote:


Dmitry et al,As part of the above JIRA I need to calculate the gaussian
kernel between 2 shapes, I looked through mahout-math-scala and didnt see
anything to do this, any objections to me adding some code under
scalabindings to do this?
Thanks in advance.






Re: Affinity matrix computation

2014-09-13 Thread Shannon Quinn
Since it's an input processing method--rather than strictly an algorithm 
in the category of SVMs, K-means, etc--and since you're early in the 
development cycle, wherever makes it easiest is probably best for now. 
We can always merge it elsewhere once you're ready to submit a PR.


Of course someone please correct me if I'm mistaken.

On 9/13/14, 1:14 PM, Saikat Kanjilal wrote:

Hi Committers,I'm beginning some work on the affinity matrix computation in 
mahout-dsl, I was wondering where in the directory structure I should put this 
effort, are we placing all our algorithms in mahout-dsl in a specific 
area?Thanks in advance.




Re: Upgrade to spark 1.0.x

2014-08-08 Thread Shannon Quinn

+1

On 8/8/14, 3:58 PM, Suneel Marthi wrote:

+1


On Fri, Aug 8, 2014 at 3:48 PM, Ted Dunning ted.dunn...@gmail.com wrote:


+1 to merge




On Fri, Aug 8, 2014 at 12:36 PM, Gokhan Capan gkhn...@gmail.com wrote:


+1 to merging spark-1.0.x to master

Sent from my iPhone


On Aug 8, 2014, at 22:06, Dmitriy Lyubimov dlie...@gmail.com wrote:

Current master is still at Spark 0.9.x . MAHOUT-1603 (PR #40) is

making a

number of valuable tweaks to enable Spark 1.0.x and (Spark SQL code, by
extension. I did a quick test, SQL seems to work for my simple tests in
Mahout environment).

This squashed PR is pushed to apache/mahout branch spark-1.0.x rather

than

master. Whenever (if) folks are ready, i can merge it to the master.

Alternative approach would be to maintain both 1.0.x and 0.9.x branches

for

some time. I don't see it as valuable as the costs would likely overrun

any

benefit here, but if anyone still clings to spark 0.9.x dependency,

please

let me know in this thread.

thanks.
-d




Re: Git Migration

2014-05-22 Thread Shannon Quinn

Works for me.

Shannon

On 5/22/14, 3:45 PM, Gokhan Capan wrote:

Works for me as well

Gokhan


On Thu, May 22, 2014 at 9:23 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:


Thanks; I just pushed successfully.


On Thu, May 22, 2014 at 10:55 AM, Dmitriy Lyubimov dlie...@gmail.com

wrote:
did you read Jake's email earlier at dev/infra discussion? he describes

and

makes references here.

It is two-fold: first  we can push whatever commits to master of
https://git-wip-us.apache.org/repos/asf?p=mahout.git

However the other side of the coin is that significant commits should go
thru pull requests directly to (if i understand it correctly)

apache/mahout

mirror on github. Such pull requests are managed thru commits to git-wp

as

well by specific messages (again, see references in Jake's email). My
understanding is that github integration features are not yet enabled,

only

commits to master of git-wp-us.a.o are at this point.

At this point I simply would like everyone to verify they can push

commits

to master branch of git-wp-us.a.o per instructions in INFRA- and

report

back there (I can push).

I guess someone (perhaps me) will have to write the manual for working

with

github pull requests (mainly, merging them to git-wp-us.o.a and closing
them).


On Thu, May 22, 2014 at 10:47 AM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:


What's the workflow to commit a change?  I'm totally in the dark about
that.


On Thu, May 22, 2014 at 10:14 AM, Dmitriy Lyubimov dlie...@gmail.com

wrote:
Hi,

(1) git migration of the project is now complete. Any volunteers to

verify

per INFRA-? If you do, please report back to the issue.

(2) Anybody knows what to do with jenkins now? i still don't have

proper

privileges on it. thanks.



[1] https://issues.apache.org/jira/browse/INFRA-





Re: Proposal for additional features in Mahout (minkowski Distance, mahalobnis Distance and K-nearest neighbor classifier)

2014-05-18 Thread Shannon Quinn

Hi Arunav,

Contributions are certainly welcome. If you can post a patch on JIRA ( 
https://issues.apache.org/jira/browse/MAHOUT ), we can have a look at 
it. I don't know if you've been monitoring our mailing lists or have 
otherwise heard, but Mahout is no longer accepting new MapReduce code. 
We're still in discussions regarding the next-generation Mahout 
backends, but we're moving instead towards engine-agnostic (e.g. Mahout 
DSL, see http://mahout.apache.org/users/sparkbindings/home.html ) 
implementations.


As for Minkowski distance, I'm not sure if someone else is working on 
it, but as I mentioned you're welcome to post a patch and we can discuss 
it from there. Thanks!


Shannon

On 5/18/14, 1:29 PM, Arunav Sanyal wrote:

Hi

I am new to apache mahout and would like to contribute in whatever humble
way I can.

I see that the Vector class in Apache Mahout does not have the
functionality of minkowski distance.

http://en.wikipedia.org/wiki/Minkowski_distance

is a distance metric which generalizes distance measures between any two
vectors. It can represent hamming distance, euclidean distance depending on
parameters. I already have a simple solution ready for review if this is
approved. Similarly I am working on the more generic Mahalobnis distance
measure.

My primary motive for introducing these distance measures is to come up
with a generic implementation of the K-nearest neighbor classifier (not to
be confused K-means clustering). I will be working on that as well shortly.

If somebody else is working towards these features, I would like to
collaborate and donate whatever code patches that they deem necessary. If
not, I humbly request that the community approve these for inclusion into
apache mahout.


Yours sincerely
Arunav Sanyal




Re: VOTE: moving commits to git-wp.o.a github PR features.

2014-05-16 Thread Shannon Quinn
+1

iPhone'd

 On May 16, 2014, at 14:46, Andrew Musselman andrew.mussel...@gmail.com 
 wrote:
 
 +1
 
 
 On Fri, May 16, 2014 at 11:02 AM, Dmitriy Lyubimov dlie...@gmail.comwrote:
 
 Hi,
 
 I would like to initiate a procedural vote moving to git as our primary
 commit system, and using github PRs as described in Jake Farrel's email to
 @dev [1]
 
 [1]
 
 https://blogs.apache.org/infra/entry/improved_integration_between_apache_and
 
 If voting succeeds, i will file a ticket with infra to commence necessary
 changes and to move our project to git-wp as primary source for commits as
 well as add github integration features [1]. (I assume pure git commits
 will be required after that's done, with no svn commits allowed).
 
 The motivation is to engage GIT and github PR features as described, and
 avoid git mirror history messes like we've seen associated with authors.txt
 file fluctations.
 
 PMC and committers have binding votes, so please vote. Lazy consensus with
 minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time
 for weekend (i.e. Tuesday afternoon PST) .
 
 here is my +1
 
 -d
 


Re: consensus statement?

2014-05-06 Thread Shannon Quinn
+1

iPhone'd

 On May 6, 2014, at 12:23, Ted Dunning ted.dunn...@gmail.com wrote:
 
 I have been involved in side conversations to try to build a bit of unity
 among our community and would like to propose this as a statement of what
 we are doing:
 
 
 Apache Mahout is moving immediately to a faster execution model. The first
 of these is Spark. Outside contributions are always encouraged.
 
 
 As a bit of commentary, it is clear that what the committers are working on
 is Spark and it is clear that Spark will be the first new platform for
 Mahout.  It is also clear that there are non-committers (the 0xdata crew
 for one) who are working with the community to extend Mahout beyond just
 Spark.  As a statement of where the community is *right* now, however, I
 don't think we need to say much more than that we encourage contributions.
 
 Sound fair?  Correct?


[jira] [Commented] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website

2014-05-03 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988666#comment-13988666
 ] 

Shannon Quinn commented on MAHOUT-1441:
---

If no one has any objections in the next couple of days, I can close this 
ticket.

 Add documentation for Spectral KMeans to Mahout Website
 ---

 Key: MAHOUT-1441
 URL: https://issues.apache.org/jira/browse/MAHOUT-1441
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0
Reporter: Suneel Marthi
Assignee: Shannon Quinn
 Fix For: 1.0

 Attachments: MAHOUT-1441.diff


 Need to update the Website with Design, user guide and any relevant 
 documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1538) Port spectral clustering to Mahout DSL

2014-05-02 Thread Shannon Quinn (JIRA)
Shannon Quinn created MAHOUT-1538:
-

 Summary: Port spectral clustering to Mahout DSL
 Key: MAHOUT-1538
 URL: https://issues.apache.org/jira/browse/MAHOUT-1538
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 1.0


Move spectral clustering logic to Mahout DSL. Dependencies include SSVD 
(already ported) and K-means (currently in progress, or can use Spark MLlib 
implementation as a temporary fix).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1539) Implement affinity matrix computation in Mahout DSL

2014-05-02 Thread Shannon Quinn (JIRA)
Shannon Quinn created MAHOUT-1539:
-

 Summary: Implement affinity matrix computation in Mahout DSL
 Key: MAHOUT-1539
 URL: https://issues.apache.org/jira/browse/MAHOUT-1539
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 1.0


This has the same goal as MAHOUT-1506 
(https://issues.apache.org/jira/browse/MAHOUT-1506), but rather than code the 
pairwise computations in MapReduce, this will be done in the Mahout DSL.

An orthogonal issue is the format of the raw input (vectors, text, images, 
SequenceFiles), and how the user specifies the distance equation and any 
associated parameters.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1539) Implement affinity matrix computation in Mahout DSL

2014-05-02 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1539:
--

Description: 
This has the same goal as MAHOUT-1506, but rather than code the pairwise 
computations in MapReduce, this will be done in the Mahout DSL.

An orthogonal issue is the format of the raw input (vectors, text, images, 
SequenceFiles), and how the user specifies the distance equation and any 
associated parameters.

  was:
This has the same goal as MAHOUT-1506 
(https://issues.apache.org/jira/browse/MAHOUT-1506), but rather than code the 
pairwise computations in MapReduce, this will be done in the Mahout DSL.

An orthogonal issue is the format of the raw input (vectors, text, images, 
SequenceFiles), and how the user specifies the distance equation and any 
associated parameters.


 Implement affinity matrix computation in Mahout DSL
 ---

 Key: MAHOUT-1539
 URL: https://issues.apache.org/jira/browse/MAHOUT-1539
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 1.0


 This has the same goal as MAHOUT-1506, but rather than code the pairwise 
 computations in MapReduce, this will be done in the Mahout DSL.
 An orthogonal issue is the format of the raw input (vectors, text, images, 
 SequenceFiles), and how the user specifies the distance equation and any 
 associated parameters.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1540) Reuters example for spectral clustering

2014-05-02 Thread Shannon Quinn (JIRA)
Shannon Quinn created MAHOUT-1540:
-

 Summary: Reuters example for spectral clustering
 Key: MAHOUT-1540
 URL: https://issues.apache.org/jira/browse/MAHOUT-1540
 Project: Mahout
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 1.0


Once MAHOUT-1538 and MAHOUT-1539 are complete, create a working example of 
spectral clustering using the Reuters dataset.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website

2014-05-02 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1441:
--

Attachment: MAHOUT-1441.diff

Update on the documentation. It specifies a brief overview of spectral 
clustering theory (with a link to further reading), a guide for how to run the 
algorithm in Mahout, and a small toy example. Also linked are the outstanding 
issues for improving the algorithm and what those changes will be.

Ready to commit.

 Add documentation for Spectral KMeans to Mahout Website
 ---

 Key: MAHOUT-1441
 URL: https://issues.apache.org/jira/browse/MAHOUT-1441
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0
Reporter: Suneel Marthi
Assignee: Shannon Quinn
 Fix For: 1.0

 Attachments: MAHOUT-1441.diff


 Need to update the Website with Design, user guide and any relevant 
 documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1538) Port spectral clustering to Mahout DSL

2014-05-02 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988137#comment-13988137
 ] 

Shannon Quinn commented on MAHOUT-1538:
---

That's fine, though until k-means is fully ported in Mahout this will remain 
incomplete. I was thinking of Spark as more of a drop-in temp replacement until 
the former is complete (unless it already is and I missed it?).

 Port spectral clustering to Mahout DSL
 --

 Key: MAHOUT-1538
 URL: https://issues.apache.org/jira/browse/MAHOUT-1538
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 1.0


 Move spectral clustering logic to Mahout DSL. Dependencies include SSVD 
 (already ported) and K-means (currently in progress, or can use Spark MLlib 
 implementation as a temporary fix).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website

2014-05-02 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988167#comment-13988167
 ] 

Shannon Quinn commented on MAHOUT-1441:
---

Published the new content. All seems well except for the inline latex; what's 
the correct syntax?

 Add documentation for Spectral KMeans to Mahout Website
 ---

 Key: MAHOUT-1441
 URL: https://issues.apache.org/jira/browse/MAHOUT-1441
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0
Reporter: Suneel Marthi
Assignee: Shannon Quinn
 Fix For: 1.0

 Attachments: MAHOUT-1441.diff


 Need to update the Website with Design, user guide and any relevant 
 documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website

2014-05-02 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988190#comment-13988190
 ] 

Shannon Quinn commented on MAHOUT-1441:
---

Yep, realized that a few moments ago. Thanks, both of you. It should be good 
now.

 Add documentation for Spectral KMeans to Mahout Website
 ---

 Key: MAHOUT-1441
 URL: https://issues.apache.org/jira/browse/MAHOUT-1441
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0
Reporter: Suneel Marthi
Assignee: Shannon Quinn
 Fix For: 1.0

 Attachments: MAHOUT-1441.diff


 Need to update the Website with Design, user guide and any relevant 
 documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website

2014-05-01 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986977#comment-13986977
 ] 

Shannon Quinn commented on MAHOUT-1441:
---

It's in progress. Is there a deadline for this? I was hoping to finish it next 
week.

However I do have a couple of questions. Obviously the Eigencuts docs will be 
stripped out entirely, but there are still other components that need to be 
added for the full pipeline to function: a DSL-based affinity matrix input, and 
a working example on the Reuters dataset. Should these items be completed 
*first*, or should I just leave notes in the documentation to JIRA tickets for 
these issues? If the latter, the documentation just needs some basic cleaning 
up and can be done pretty quickly, albeit without specifics on how aspects of 
it actually work in practice. If the latter, I'll need a little more time to 
port the algorithm to Mahout DSL.

 Add documentation for Spectral KMeans to Mahout Website
 ---

 Key: MAHOUT-1441
 URL: https://issues.apache.org/jira/browse/MAHOUT-1441
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0
Reporter: Suneel Marthi
Assignee: Shannon Quinn
 Fix For: 1.0


 Need to update the Website with Design, user guide and any relevant 
 documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: CMS still not working

2014-04-25 Thread Shannon Quinn
Broken for me on Chrome in OS X. Noticed the mathjax was also broken on 
other pages (e.g. Spark  Scala) on the same environment.


On 4/25/14, 12:06 PM, Andrew Musselman wrote:

Broken for me in Chrome on Ubuntu.


On Fri, Apr 25, 2014 at 9:02 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:


Hm... mathjax doesn't render for me correctly . Is it just me or this is
also now broken? https://mahout.apache.org/users/dim-reduction/ssvd.html


On Fri, Apr 25, 2014 at 1:40 AM, Sebastian Schelter s...@apache.org
wrote:


Fyi: filed a ticket with infra as our CMS is still not working...

https://issues.apache.org/jira/browse/INFRA-7628





Re: CMS still not working

2014-04-25 Thread Shannon Quinn

Hmm. Here's what I see on Naive Bayes:

https://dl.dropboxusercontent.com/u/1377610/nb.png

Here's what I see on SSVD (under *https*):

https://dl.dropboxusercontent.com/u/1377610/ssvd_https.png

And here's SSVD under *http*. Looks fine! (NB looks the same either way 
for me, though)


https://dl.dropboxusercontent.com/u/1377610/ssvd_http.png

Chrome on OS X.

On 4/25/14, 12:28 PM, Suneel Marthi wrote:

I remember something like that, obviously this issue is only with Naive
Bayes page. You could compare NAive Bayes with SSVD to see what's missing.



On Fri, Apr 25, 2014 at 12:24 PM, ap.dev ap@outlook.com wrote:


@dimitri you said something once about having to double escape Mathjax
formatted lines.  I didn't do this in the markdown editor I was using for
the Naive Bayes page.  Maybe that has something to do with it?


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Suneel Marthi smar...@apache.org
Date:04/25/2014  12:20 PM  (GMT-05:00)
To: mahout dev@mahout.apache.org
Subject: Re: CMS still not working

SSVD page renders fine and so do others except for Naive Bayes (on MacOS
with all browsers - Chrome, Safari, Firefox, Opera).

It couldn't be a mathjax issue, some weird tag or something on Naive Bayes
page??


On Fri, Apr 25, 2014 at 12:12 PM, Dmitriy Lyubimov dlie...@gmail.com

wrote:
it's strange. ubuntu is all i ever used and I swear it was working just
last week. i wonder if mathjax guys did something that broke it, perhaps

in

the light of recent heartbleed bugs. javascript seems to be in place.


On Fri, Apr 25, 2014 at 9:09 AM, ap.dev ap@outlook.com wrote:


Mathjax formatting looks good on Firefox from a windows machine for

scala

spark bindings page.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Andrew Musselman andrew.mussel...@gmail.com
Date:04/25/2014  12:06 PM  (GMT-05:00)
To: dev@mahout.apache.org
Subject: Re: CMS still not working

Broken for me in Chrome on Ubuntu.


On Fri, Apr 25, 2014 at 9:02 AM, Dmitriy Lyubimov dlie...@gmail.com
wrote:


Hm... mathjax doesn't render for me correctly . Is it just me or this

is

also now broken?

https://mahout.apache.org/users/dim-reduction/ssvd.html


On Fri, Apr 25, 2014 at 1:40 AM, Sebastian Schelter s...@apache.org
wrote:


Fyi: filed a ticket with infra as our CMS is still not working...

https://issues.apache.org/jira/browse/INFRA-7628





Re: Welcome Pat Ferrel as new committer on Mahout

2014-04-24 Thread Shannon Quinn
Congratulations Pat! Been enjoying your discussions so far. Looking 
forward to working with you.


On 4/24/14, 6:22 AM, Frank Scholten wrote:

Congratulations Pat! :-)

On Apr 24, 2014, at 12:19, Sebastian Schelter s...@apache.org wrote:


Hi,

this is to announce that the Project Management Committee (PMC) for Apache 
Mahout has asked Pat Ferrel to become committer and we are pleased to announce 
that he has accepted.

Being a committer enables easier contribution to the project since in addition 
to posting patches on JIRA it also gives write access to the code repository. 
That also means that now we have yet another person who can commit patches 
submitted by others to our repo *wink*

Pat, we look forward to working with you in the future. Welcome! It would be 
great if you could introduce yourself with a few words.

-s




[jira] [Commented] (MAHOUT-1506) Creation of affinity matrix for spectral clustering

2014-04-18 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13974008#comment-13974008
 ] 

Shannon Quinn commented on MAHOUT-1506:
---

That's fine. This still needs to get done but I'll open up another ticket 
specifying scala DSL instead.

 Creation of affinity matrix for spectral clustering
 ---

 Key: MAHOUT-1506
 URL: https://issues.apache.org/jira/browse/MAHOUT-1506
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn
 Fix For: 1.0


 I wanted to get this discussion going, since I think this is a critical 
 blocker for any kind of documentation update on spectral clustering (I can't 
 update the documentation until the algorithm is useful, and it won't be 
 useful until there's a built-in method for converting raw data to an affinity 
 matrix).
 Namely, I'm wondering what kind of raw data should this algorithm be 
 expecting (anything that k-means expects, basically?), and what are the data 
 structures associated with this? I've created a proof-of-concept for how 
 pairwise affinity generation could work.
 https://github.com/magsol/Hadoop-Affinity
 It's a two-step job, but if the data structures in the input data format 
 provides 1) the total number of data points, and 2) for each data point to 
 know its index in the overall set, then the first job can be scrapped 
 entirely and affinity generation will consist of 1 MR task.
 (discussions on Spark / h20 pending, of course)
 Mainly this is an engineering problem at this point. Let me know your 
 thoughts and I'll get this done (I'm out of town the next 10 days for my 
 wedding/honeymoon, will get to this on my return).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1506) Creation of affinity matrix for spectral clustering

2014-04-03 Thread Shannon Quinn (JIRA)
Shannon Quinn created MAHOUT-1506:
-

 Summary: Creation of affinity matrix for spectral clustering
 Key: MAHOUT-1506
 URL: https://issues.apache.org/jira/browse/MAHOUT-1506
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn


I wanted to get this discussion going, since I think this is a critical blocker 
for any kind of documentation update on spectral clustering (I can't update the 
documentation until the algorithm is useful, and it won't be useful until 
there's a built-in method for converting raw data to an affinity matrix).

Namely, I'm wondering what kind of raw data should this algorithm be 
expecting (anything that k-means expects, basically?), and what are the data 
structures associated with this? I've created a proof-of-concept for how 
pairwise affinity generation could work.

https://github.com/magsol/Hadoop-Affinity

It's a two-step job, but if the data structures in the input data format 
provides 1) the total number of data points, and 2) for each data point to know 
its index in the overall set, then the first job can be scrapped entirely and 
affinity generation will consist of 1 MR task.

(discussions on Spark / h20 pending, of course)

Mainly this is an engineering problem at this point. Let me know your thoughts 
and I'll get this done (I'm out of town the next 10 days for my 
wedding/honeymoon, will get to this on my return).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MAHOUT-1473) Cleanup website on Spectral Clustering

2014-03-22 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn reassigned MAHOUT-1473:
-

Assignee: Shannon Quinn

 Cleanup website on Spectral Clustering
 --

 Key: MAHOUT-1473
 URL: https://issues.apache.org/jira/browse/MAHOUT-1473
 Project: Mahout
  Issue Type: Improvement
  Components: Documentation
Reporter: Sebastian Schelter
Assignee: Shannon Quinn
 Fix For: 1.0


 The website on spectral clustering needs clean up. We need to go through the 
 text, remove dead links and check whether the information is still consistent 
 with the current code.
 https://mahout.apache.org/users/clustering/spectral-clustering.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website

2014-03-09 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13925272#comment-13925272
 ] 

Shannon Quinn commented on MAHOUT-1441:
---

The experiment section of that paper would be fairly straightforward to 
reproduce, and I do agree that we should do that. However, the advantage with 
the reuters dataset is that most of the other algorithms use this as well as an 
example of how the algorithm works in the first place, e.g. comparing one to 
another with the same dataset. My impression is that whether or not the 
algorithm is well-suited to the reuters dataset, though certainly important, is 
secondary to being able to compare multiple Mahout algorithms with the same 
dataset. The hard part with spectral clustering is designing the initial 
affinity matrix from the reuters data.

 Add documentation for Spectral KMeans to Mahout Website
 ---

 Key: MAHOUT-1441
 URL: https://issues.apache.org/jira/browse/MAHOUT-1441
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0
Reporter: Suneel Marthi
Assignee: Shannon Quinn
 Fix For: 1.0


 Need to update the Website with Design, user guide and any relevant 
 documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Mahout 0.9 Release

2014-01-29 Thread Shannon Quinn

LGTM

On 1/29/14, 4:27 PM, peng wrote:

+1, can't see a bad side.

On Wed 29 Jan 2014 11:33:02 AM EST, Suneel Marthi wrote:

+1 from me





On Wednesday, January 29, 2014 8:58 AM, Sebastian Schelter 
s...@apache.org wrote:


+1


On 01/29/2014 05:25 AM, Andrew Musselman wrote:

Looks good.

+1


On Tue, Jan 28, 2014 at 8:07 PM, Andrew Palumbo ap@outlook.com 
wrote:



a), b), c), d) all passed here.

CosineDistance of clustered points from cluster-reuters.sh -1 
kmeans were

within the range [0,1].


Date: Tue, 28 Jan 2014 16:45:42 -0800
From: suneel_mar...@yahoo.com
Subject: Mahout 0.9 Release
To: u...@mahout.apache.org; dev@mahout.apache.org

Fixed the issues that were reported with Clustering code this past 
week,

upgraded codebase to Lucene 4.6.1 that was released today.


Here's the URL for the 0.9 release in staging:-

https://repository.apache.org/content/repositories/orgapachemahout-1004/org/apache/mahout/mahout-distribution/0.9/ 



The artifacts have been signed with the following key:
https://people.apache.org/keys/committer/smarthi.asc

Please:-
a) Verify that u can unpack the release (tar or zip)
b) Verify u r able to compile the distro
c)  Run through the unit tests: mvn clean test
d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please 
run

through all the different options in each script.


Need a minimum of 3 '+1' votes from PMC for the release to be 
finalized.









Re: cluster-reuters.sh broken in trunk

2014-01-24 Thread Shannon Quinn
Does Mahout still support Hadoop 0.20.2x? I know we had some discussions on 
this but I can't find them at the moment. 

iPhone'd

 On Jan 24, 2014, at 16:43, Suneel Marthi suneel_mar...@yahoo.com wrote:
 
 I assume u r running this in MR mode??  Could u clear up your 
 /tmp/mahout-work- folder and try again.
 
 
 
 
 On Friday, January 24, 2014 1:56 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Actually, getting the same error with a fresh svn checkout:
 
 14/01/24 09:42:13 INFO driver.MahoutDriver: Program took 291353 ms
 (Minutes: 4.8558834)
 Running on hadoop, using /home/akm/hadoop-0.20.205.0/bin/hadoop and
 HADOOP_CONF_DIR=
 MAHOUT-JOB:
 /home/akm/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
 14/01/24 09:42:16 INFO common.AbstractJob: Command line arguments:
 {--clustering=null,
 --clusters=[/tmp/mahout-work-akm/reuters-kmeans-clusters],
 --convergenceDelta=[0.5],
 --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],
 --endPhase=[2147483647],
 --input=[/tmp/mahout-work-akm/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/],
 --maxIter=[10], --method=[mapreduce], --numClusters=[20],
 --output=[/tmp/mahout-work-akm/reuters-kmeans], --overwrite=null,
 --startPhase=[0], --tempDir=[temp]}
 14/01/24 09:42:17 INFO common.HadoopUtil: Deleting
 /tmp/mahout-work-akm/reuters-kmeans-clusters
 14/01/24 09:42:17 WARN util.NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/01/24 09:42:17 INFO compress.CodecPool: Got brand-new compressor
 14/01/24 09:42:17 INFO kmeans.RandomSeedGenerator: Wrote 20 Klusters to
 /tmp/mahout-work-akm/reuters-kmeans-clusters/part-randomSeed
 14/01/24 09:42:17 INFO kmeans.KMeansDriver: Input:
 /tmp/mahout-work-akm/reuters-out-seqdir-sparse-kmeans/tfidf-vectors
 Clusters In: /tmp/mahout-work-akm/reuters-kmeans-clusters/part-randomSeed
 Out: /tmp/mahout-work-akm/reuters-kmeans Distance:
 org.apache.mahout.common.distance.CosineDistanceMeasure
 14/01/24 09:42:17 INFO kmeans.KMeansDriver: convergence: 0.5 max
 Iterations: 10
 14/01/24 09:42:17 INFO compress.CodecPool: Got brand-new decompressor
 Exception in thread main java.lang.IllegalStateException: No input
 clusters found in
 /tmp/mahout-work-akm/reuters-kmeans-clusters/part-randomSeed. Check your -c
 argument.
 at
 org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:212)
 at
 org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:143)
 at
 org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:103)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at
 org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:47)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at
 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 
 
 
 
 On Fri, Jan 24, 2014 at 10:07 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Yeah, disregard, my repo was out of whack.
 
 
 On Fri, Jan 24, 2014 at 10:00 AM, ap.dev ap@outlook.com wrote:
 
 I'm not getting any exceptions there.
 
  Original message 
 From: Andrew Musselman andrew.mussel...@gmail.com
 Date:01/24/2014  11:38 AM  (GMT-05:00)
 To: dev@mahout.apache.org
 Subject: cluster-reuters.sh broken in trunk
 
 Last night I had this issue when testing out cluster-reuters.sh with no
 flags; anyone seen this recently?
 
 14/01/23 22:03:54 INFO driver.MahoutDriver: Program took 286799 ms
 (Minutes: 4.7799833)
 Running on hadoop, using /home/akm/hadoop-0.20.205.0/bin/hadoop and
 HADOOP_CONF_DIR=
 MAHOUT-JOB:
 /home/akm/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
 14/01/23 22:03:57 INFO common.AbstractJob: Command line arguments:
 {--clustering=null,
 --clusters=[/tmp/mahout-work-akm/reuters-kmeans-clusters],
 --convergenceDelta=[0.5],
 
 --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],
 --endPhase=[2147483647],
 
 --input=[/tmp/mahout-work-akm/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/],
 --maxIter=[10], --method=[mapreduce], --numClusters=[20],
 --output=[/tmp/mahout-work-akm/reuters-kmeans], --overwrite=null,
 --startPhase=[0], --tempDir=[temp]}
 14/01/23 

Re: MAHOUT 0.9 Release - New URL

2014-01-16 Thread Shannon Quinn
a), b), and c) all pass for me. Don't have the setup yet at work to go 
through d), will wait for others to verify.


On 1/16/14, 9:41 AM, Suneel Marthi wrote:

Third time's a Charm!!!


Here's the new URL for Mahout 0.9 Release:
https://repository.apache.org/content/repositories/orgapachemahout-1002/org/apache/mahout/mahout-distribution/0.9/

For those volunteering to test this, some of the things to be verified:

a) Verify that u can unpack the release (tar or zip)
b) Verify u r able to compile the distro
c)  Run through the unit tests: mvn clean test
d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run through 
all the different options in each script.
  


Committers
  and PMC members:
---

Need 'at least 3 +1 votes' for the Release to pass.


Thanks and Regards.





Re: MAHOUT 0.9 Release - New URL

2014-01-16 Thread Shannon Quinn

OS X 10.9.1, java version 1.6.0_65.

On 1/16/14, 10:41 AM, Sergey Svinarchuk wrote:

I tested mahout 0.9 on Ubuntu 12.04 64bit, java version 1.6.0_27

a) Verify that u can unpack the release (tar or zip) - passed
b) Verify u r able to compile the distro - passed
c)  Run through the unit tests: mvn clean test -passed
d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run
through all the different options in each script. - will update later


On Thu, Jan 16, 2014 at 5:35 PM, Sotiris Salloumis i...@eprice.gr wrote:


Hi Suneel,

Below first round of tests,

Environment: SMP Debian 3.2.51-1 x86_64
Machine: Intel(R) Core(TM) i7 CPU 950  @ 3.07GHz stepping 05 12GB
RAM
OpenJDK: javac 1.6.0_27

a) Verify that u can unpack the release (tar or zip)  [ Passed: tar -zxvf ]
b) Verify u r able to compile the distro  [ Passed: With OpenJDK, Latest
Maven on LatestDebian ]
c)  Run through the unit tests: mvn clean test [ Passed: 370 milliseconds]

d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run
through all the different options in each script. [Ongoing will update
later]

Regards
Sotiris

-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com]
Sent: Thursday, January 16, 2014 4:41 PM
To: u...@mahout.apache.org; mahout
Subject: MAHOUT 0.9 Release - New URL

Third time's a Charm!!!


Here's the new URL for Mahout 0.9 Release:

https://repository.apache.org/content/repositories/orgapachemahout-1002/org/
apache/mahout/mahout-distribution/0.9/

For those volunteering to test this, some of the things to be verified:

a) Verify that u can unpack the release (tar or zip)
b) Verify u r able to compile the distro
c)  Run through the unit tests: mvn clean test
d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run
through all the different options in each script.


Committers
  and PMC members:
---

Need 'at least 3 +1 votes' for the Release to pass.


Thanks and Regards.






Re: Mahout 0.9 release

2013-11-28 Thread Shannon Quinn
I'll aim to get the documentation on spectral clustering done by 0.9, and the 
code fixes and improvements in for 1.0.

iPhone'd

 On Nov 28, 2013, at 12:15, Suneel Marthi suneel_mar...@yahoo.com wrote:
 
 Yes, lets defer the arbitrary properties to next release.
 
 
 
 
 
 On Thursday, November 28, 2013 11:02 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Was going to open M-1030 this weekend; I think doing the quick fix can be 
 done in time and the more involved job of putting arbitrary properties on 
 vectors should be pushed to 1.0.
 
 Sound reasonable?
 
 
 
 On Thu, Nov 28, 2013 at 7:58 AM, Suneel Marthi suneel_mar...@yahoo.com 
 wrote:
 
 Forgot to add 
 
 
 M-1288 Solr Recommender - Pat Ferrell
 
 to my earlier email.
 
 
 
 
 On Thursday, November 28, 2013 10:38 AM, Suneel Marthi 
 suneel_mar...@yahoo.com wrote:
 
 Adding Mahout-1349 to the list of JIRAs .
 
 
 
 
 
 On Thursday, November 28, 2013 10:37 AM, Suneel Marthi 
 suneel_mar...@yahoo.com wrote:
 
 Update on Open JIRAs for 0.9:
 
 Mahout-1245, Mahout-1304, Mahout-1305, Mahout-1307, Mahout-1326 - all 
 related to Wiki updates, please see Isabel's updates.
 
 M-1286 - Peng and
  Sebastian, we had talked about this during the last hangout. Can this be 
 included in 0.9?
 
 M-1030- Andrew Musselman, its critical that we get this into 0.9, its been 
 deferred for last 2 Mahout releases.
 
 M-1319, M-1328, M-1347, M-1350 - Suneel
 
 
 M-1265 - Multi Layer Perceptron, Yexi please look at my comments on 
 Reviewboard.
 
 M-1273 - Kun Yung, Ted, defer this to next release ???
 
 
 
 M-1312, M-1256 - Stevo, could u take one of them
 
 
 On Thursday, November 28, 2013 5:01 AM, Isabel Drost-Fromm 
 isa...@apache.org wrote:
 
 On Wed, 27 Nov 2013 14:23:11 -0800
  (PST)
 Suneel Marthi suneel_mar...@yahoo.com wrote:
 Below are the Open issues for 0.9:-
 
 This looks like we should be targeting Dec. 9th as code freeze to me.
 What do you all think?
 
 
 Mahout-1245, Mahout-1304, Mahout-1305, Mahout-1307, Mahout-1326 - All
 related to Wiki updates, missing Wiki documentation and Wiki
 migration to new CMS.  Isabel's working on M-1245 (migrating to new
 CMS). Could some of the others be consolidated with that?
 
 I believe MAHOUT-1245 essentially is ready to be published - all I want
 before notifying INFRA to
 switch to the new cms based site is one other
 person to take at least a brief look.
 
 For MAHOUT-1304 - Sebastian, can you please check that the cms based
 site actually does fit on 1280px? We can close this issue then.
 
 MAHOUT-1305 - I think this should be turned into a task to actually
 delete most of the pages that have been migrated to the new CMS (almost
 all of them). Once 1245 is shipped, it would be great if a few more
 people could lend a hand in getting this done.
 
 MAHOUT-1307 - Can be closed once switched to CMS
 
 MAHOUT-1326 - This really relates to the old Confluence export plugin
 we once have been using to generate static pages out of our wiki that
 is no longer active. Unless anyone on the Mahout dev list
 knows how to
 fully
  delete all exported static pages we should file an issue with
 INFRA to ask for help getting those deleted. They definitely are
 confusing to users.
 
 
 
 M-1286 - Peng and ssc, we had talked about this during the last
 hangout. Can this be included in 0.9?
 
 M-1030 - Andrew Musselman? Any updates on this, its important that we
 fix this for 0.9
 
 M-1319, M-1328,
   M-1347, M-1364 - Suneel
 
 M-1273 - Kun Yung, remember talking about this in one of the earlier
 hangouts; can't recall what was decided?
 
 M-1312, M-1256 - Dan Filimon (or Stevo??)
 
 M-996  someone could pick
  this up (if its still relevant with present
 codebase i.e.)
 
 I think this can move to the next release - according to the
 contributor and Sebastian the patch is rather hacky and there for
 illustration purposes only. I'd rather see some more thought go into
 that instead of pushing to have this in 0.9.
 
 
 M-1265 Yexi had submitted a patch for this, it would be good if this
 could go in as part of 0.9 
 
 M-1288 Solr Recommender - Pat Ferrell
 
 M-1285: Any takers for this?
 
 Would be nice to have - in particular if someone on dev@ (not
 necessarily a committer) wants to get started with the code base.
 Otherwise I'd say fix for next release
  if time gets short.
 
 
 M-1356: Isabel's started on this, Stevo could u review this?
 
 We definitely can punt that for the next release or even thereafter. It
 would be great if someone who has some knowledge of Java security
 policies would take a look. The implication of not fixing this
 essentially is that in case someone commits test code that writes
 outside of target or to some globally shared directory we might end up
 having randomly failing tests due to the parallel setup again. But as
 these will occur shortly after the commit it should be easy enough to
 find the code change that caused the breakage.
 
 
 
 M-1329: Support for Hadoop 2
 
 Is that truly 

Re: Mahout 0.9 release

2013-11-28 Thread Shannon Quinn
Possibly. I'll know more after Monday (got a few big deadlines then). 

iPhone'd

 On Nov 28, 2013, at 13:32, Suneel Marthi suneel_mar...@yahoo.com wrote:
 
 Shannon,
 
 Would it be possible to add Spectral clustering to 
 examples/bin/cluster-reuters.sh (for 0.9)?
 
 
 
 
 
 
 On Thursday, November 28, 2013 12:59 PM, Shannon Quinn squ...@gatech.edu 
 wrote:
 
 I'll aim to get the documentation on spectral clustering done by 0.9, and the 
 code fixes and improvements in for 1.0.
 
 iPhone'd
 
 
 On Nov 28, 2013, at 12:15, Suneel Marthi suneel_mar...@yahoo.com wrote:
 
 Yes, lets defer the arbitrary properties to next release.
 
 
 
 
 
 On Thursday, November 28, 2013 11:02 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Was going to open M-1030 this weekend; I think doing the quick fix can be 
 done in time and the more involved job of putting arbitrary properties on 
 vectors should be pushed to 1.0.
 
 Sound reasonable?
 
 
 
 On Thu, Nov 28, 2013 at 7:58 AM, Suneel Marthi suneel_mar...@yahoo.com 
 wrote:
 
 Forgot to add 
 
 
 M-1288 Solr Recommender - Pat Ferrell
 
 to my earlier email.
 
 
 
 
 On Thursday, November 28, 2013 10:38 AM, Suneel Marthi 
 suneel_mar...@yahoo.com wrote:
 
 Adding Mahout-1349 to the list of JIRAs .
 
 
 
 
 
 On Thursday, November 28, 2013 10:37 AM, Suneel Marthi 
 suneel_mar...@yahoo.com wrote:
 
 Update on Open JIRAs for 0.9:
 
 Mahout-1245, Mahout-1304, Mahout-1305, Mahout-1307, Mahout-1326 - all 
 related to Wiki updates, please see Isabel's updates.
 
 M-1286 - Peng and
   Sebastian, we had talked about this during the last hangout. Can this be 
 included in 0.9?
 
 M-1030- Andrew Musselman, its critical that we get this into 0.9, its been 
 deferred for last 2 Mahout releases.
 
 M-1319, M-1328, M-1347, M-1350 - Suneel
 
 
 M-1265 - Multi Layer Perceptron, Yexi please look at my comments on 
 Reviewboard.
 
 M-1273 - Kun Yung, Ted, defer this to next release ???
 
 
 
 M-1312, M-1256 - Stevo, could u take one of them
 
 
 On Thursday, November 28, 2013 5:01 AM, Isabel Drost-Fromm 
 isa...@apache.org wrote:
 
 On Wed, 27 Nov 2013 14:23:11 -0800
   (PST)
 Suneel Marthi suneel_mar...@yahoo.com wrote:
 Below are the Open issues for 0.9:-
 
 This looks like we should be targeting Dec. 9th as code freeze to me.
 What do you all think?
 
 
 Mahout-1245, Mahout-1304, Mahout-1305, Mahout-1307, Mahout-1326 - All
 related to Wiki updates, missing Wiki documentation and Wiki
 migration to new CMS.  Isabel's working on M-1245 (migrating to new
 CMS). Could some of the others be consolidated with that?
 
 I believe MAHOUT-1245 essentially is ready to be published - all I want
 before notifying INFRA to
 switch to the new cms based site is one other
 person to take at least a brief look.
 
 For MAHOUT-1304 - Sebastian, can you please check that the cms based
 site actually does fit on 1280px? We can close this issue then.
 
 MAHOUT-1305 - I think this should be turned into a task to actually
 delete most of the pages that have been migrated to the new CMS (almost
 all of them). Once 1245 is shipped, it would be great if a few more
 people could lend a hand in getting this done.
 
 MAHOUT-1307 - Can be closed once switched to CMS
 
 MAHOUT-1326 - This really relates to the old Confluence export plugin
 we once have been using to generate static pages out of our wiki that
 is no longer active. Unless anyone on the Mahout dev list
 knows how to
 fully
   delete all exported static pages we should file an issue with
 INFRA to ask for help getting those deleted. They definitely are
 confusing to users.
 
 
 
 M-1286 - Peng and ssc, we had talked about this during the last
 hangout. Can this be included in 0.9?
 
 M-1030 - Andrew Musselman? Any updates on this, its important that we
 fix this for 0.9
 
 M-1319, M-1328,
M-1347, M-1364 - Suneel
 
 M-1273 - Kun Yung, remember talking about this in one of the earlier
 hangouts; can't recall what was decided?
 
 M-1312, M-1256 - Dan Filimon (or Stevo??)
 
 M-996  someone could pick
   this up (if its still relevant with present
 codebase i.e.)
 
 I think this can move to the next release - according to the
 contributor and Sebastian the patch is rather hacky and there for
 illustration purposes only. I'd rather see some more thought go into
 that instead of pushing to have this in 0.9.
 
 
 M-1265 Yexi had submitted a patch for this, it would be good if this
 could go in as part of 0.9 
 
 M-1288 Solr Recommender - Pat Ferrell
 
 M-1285: Any takers for this?
 
 Would be nice to have - in particular if someone on dev@ (not
 necessarily a committer) wants to get started with the code base.
 Otherwise I'd say fix for next release
   if time gets short.
 
 
 M-1356: Isabel's started on this, Stevo could u review this?
 
 We definitely can punt that for the next release or even thereafter. It
 would be great if someone who has some knowledge of Java security
 policies would take a look. The implication of not fixing

Re: spectral clustering additions [was: Mahout 0.9 release]

2013-11-21 Thread Shannon Quinn

Excellent. My todo list, then:

1: post docs for the algorithm on the Apache CMS
2: create an example to demonstrate how to use it
3: code a job to process raw input into a similarity matrix (will create 
a JIRA for it)


I have a question for #3 that can be a separate thread; mainly, what are 
the primary input formats I should be concerned with processing?


On 11/21/13, 1:09 PM, Isabel Drost-Fromm wrote:

On Thu, 21 Nov 2013 09:42:28 -0800 (PST)
Suneel Marthi suneel_mar...@yahoo.com wrote:


We are missing wiki docs for both Streaming kmeans and Spectral clustering.

I can pull something together for streaming kmeans.

Speaking of which we need to add a wiki page for Ted's t-digest once we figure 
out how it plays into Mahout (maybe as a measure of Streaming kmeans 
clustering, Ted??).

Given that we are in the process of migrating substantial parts of our wiki to 
the main website soon to be hosted in Apache CMS it would be great if you could 
add your content there. See also MAHOUT-1245 and 
http://markmail.org/thread/5ixlclhlh3acgcoq for some details.

Isabel




Re: spectral clustering additions [was: Mahout 0.9 release]

2013-11-21 Thread Shannon Quinn

That also gives me at least one answer for #3 :)

On 11/21/13, 4:03 PM, Suneel Marthi wrote:

On #2, it would be good if could add Spectral KMeans to 
examples/bin/cluster-reuters.sh to process Reuters dataset.





On Thursday, November 21, 2013 3:50 PM, Shannon Quinn squ...@gatech.edu wrote:
  
Excellent. My todo list, then:


1: post docs for the algorithm on the Apache CMS
2: create an example to demonstrate how to use it
3: code a job to process raw input into a similarity matrix (will create
a JIRA for it)

I have a question for #3 that can be a separate thread; mainly, what are
the primary input formats I should be concerned with processing?


On 11/21/13, 1:09 PM, Isabel Drost-Fromm wrote:

On Thu, 21 Nov 2013 09:42:28 -0800 (PST)
Suneel Marthi suneel_mar...@yahoo.com wrote:


We are missing wiki docs for both Streaming kmeans and Spectral clustering.

I can pull something together for streaming kmeans.

Speaking of which we need to add a wiki page for Ted's t-digest once we figure 
out how it plays into Mahout (maybe as a measure of Streaming kmeans 
clustering, Ted??).

Given that we are in the process of migrating substantial parts of our wiki to 
the main website soon to be hosted in Apache CMS it would be great if you could 
add your content there. See also MAHOUT-1245 and 
http://markmail.org/thread/5ixlclhlh3acgcoq for some details.

Isabel




Re: spectral clustering additions [was: Mahout 0.9 release]

2013-11-20 Thread Shannon Quinn
Right; I won't propose its re-integration until I'm confident it works 
as advertised. I'm referring to the vanilla spectral clustering that's 
still in Mahout.


An example sounds good, will do.

On 11/20/13, 4:29 PM, Suneel Marthi wrote:

Shannon,

Eigencuts has been deprecated and removed from the present codebase. Do we need 
to revert that?

On Spectral clustering, please do add an example to 
examples/bin/cluster-reuters.sh.





On Wednesday, November 20, 2013 4:05 PM, Shannon Quinn squ...@gatech.edu 
wrote:
  
On that note, I wanted to ask: what does everyone feel needs to be done

to make the standard spectral clustering  robust enough to be considered
a core algorithm? For me the biggest item was to have a job that
computes the pairwise similarities required (I've recently started
this), and I'd love to know what sort of input formats it should support
for conversion to a similarity matrix. Is there anything else?

Eigencuts is another matter; I'm working on streamlining the data
structures to make that more efficient.


 Original Message 
Subject: Re: Mahout 0.9 release
Date: Wed, 20 Nov 2013 21:39:18 +0100
From: Isabel Drost-Fromm isa...@apache.org
Reply-To: dev@mahout.apache.org
To: dev@mahout.apache.org



On Wed, 20 Nov 2013 10:32:42 -0800 (PST)
Suneel Marthi suneel_mar...@yahoo.com wrote:


We are presently targeting 0.9 for Dec 9.

Speaking of which: Any helping hand (be it on fixing issues, reviewing patches, 
adding to the documentation) is highly welcome to make this happen! If you are 
unsure what tasks exactly the project urgently needs help with do not be afraid 
to ask on the mailing list.


Isabel




Re: Eigencuts version of spectral clustering

2013-09-04 Thread Shannon Quinn
Eigencuts was removed from 0.8. The fixed version was never released due to 
the bottleneck you described.

Off the books, it's still a work in progress, but I won't be petitioning the 
PMC to put it back in until it scales properly. 

iPhone'd

On Sep 4, 2013, at 16:10, Andrew Musselman andrew.mussel...@gmail.com wrote:

 Looks like this is finished as of May of this year, but is there still the
 bottleneck performance issue with it?  I.e., is it useful in production?
 
 Thanks
 Andrew


Re: You are invited to Apache Mahout meet-up

2013-08-22 Thread Shannon Quinn

I'm only sorry I'm not in the Bay area. Sounds great!

On 8/22/13 3:38 AM, Stevo Slavić wrote:

Retweeted meetup invite. Have fun!

Kind regards,
Stevo Slavic.


On Thu, Aug 22, 2013 at 8:34 AM, Ted Dunning ted.dunn...@gmail.com wrote:


Very cool.

Would love to see folks turn out for this.


On Wed, Aug 21, 2013 at 9:38 PM, Ellen Friedman
b.ellen.fried...@gmail.comwrote:


The Apache Mahout user group has been re-activated. If you are in the Bay
Area in California, join us on Aug 27 (Redwood City).

Sebastian Schelter will be the main speaker, talking about new directions
with Mahout recommendation. Grant Ingersoll, Ted Dunning and I be there

to

do a short introduction for the meet-up and update on the 0.8 release.

Here's the link to rsvp: http://bit.ly/16K32hg

Hope you can come, and please spread the word.

Ellen





Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-12 Thread Shannon Quinn

Meant to add: my vote is also for bi-weekly

On 6/12/13 7:26 AM, Grant Ingersoll wrote:

Hi,

One of the things we kicked around at Buzzwords was having a 
weekly/bi-weekly/monthly dev session via Google hangout (Drill does this with 
good success, I believe).  Since we are so spread out, I thought I would throw 
out a Doodle (scheduling tool for those unfamiliar) to see what times work best 
for the majority of people interested in such a thing.  Anyone is free to 
participate, but this is not a Q and A session, but is instead focused on 
writing code, fixing bugs, triaging JIRA, releasing, etc.

If you are interested, please fill out http://doodle.com/gatxxkm7f25fq5y8  
(note, all times are Eastern Time Zone since I did the poll!)  I just grabbed a 
sampling of hours throughout the day.  I also picked 1 week as being 
representative of this being on a repeating schedule.  If none of the times 
work for you, but you are still interested, please respond here.  I would 
imagine we would meet for 1-2 hours.

Also, please reply with the frequency at which you would like to meet:

[]  Weekly
[]  Bi-weekly (every 2 weeks)
[]  Monthly

My vote is every two weeks.

-Grant




Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-12 Thread Shannon Quinn

Angel and Suneel, you may want to re-fill out the new doodle.

FYI, this week won't be representative of my schedule; I'm in the last 
few weeks of a job at ORNL where I travel every weekend. Normally I'll 
have more flexibility than just 6pm on weeknights.


On 6/12/13 8:26 AM, Grant Ingersoll wrote:

On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote:


+1, awesome idea

One question: the poll, while set to GMT -5, does say it's in Central Time. Is 
this a daylight savings thing?

I turned on Time Zone support, so not sure how it will look to others, but it 
sounds like it adjusts based on your location...  I see: 8 am, 10, 1, so on.

I also realize, that I messed it up.  I meant 9 pm, not 9 am.

Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv






Re: (Bi-)Weekly/Monthly Dev Sessions

2013-06-12 Thread Shannon Quinn
We have a good spread of people filling out both versions of the Doodle 
:) Here's the one Grant said is the correct one:


http://doodle.com/ymqaiwbh7khisnyv

On 6/12/13 1:44 PM, Andrew Musselman wrote:

Bi-weekly is good for me; I'm in Seattle and just filled out the poll.

Great idea!


On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.comwrote:


+1, am in Seattle as well and would love to attend and be involved.

Sent from my iPhone

On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com
wrote:


Good idea on recurring meetings. Im very interested in participating.
Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8.

An agenda for the meetings ahead of time will help us get the most of our
time at the meetings.

Thanks.
On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org wrote:


On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu wrote:


Angel and Suneel, you may want to re-fill out the new doodle.

FYI, this week won't be representative of my schedule; I'm in the last

few weeks of a job at ORNL where I travel every weekend. Normally I'll

have

more flexibility than just 6pm on weeknights.

Yeah, Doodle makes you pick dates, but I just want it to be

representative

a week long period of time and not tied to a specific set of dates.  So,
just put in what your ideal times are in general and ignore the fact

that

it is set to next week.


On 6/12/13 8:26 AM, Grant Ingersoll wrote:

On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote:


+1, awesome idea

One question: the poll, while set to GMT -5, does say it's in Central

Time. Is this a daylight savings thing?

I turned on Time Zone support, so not sure how it will look to others,

but it sounds like it adjusts based on your location...  I see: 8 am,

10,

1, so on.

I also realize, that I messed it up.  I meant 9 pm, not 9 am.

Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv


Grant Ingersoll | @gsingers
http://www.lucidworks.com










Re: 0.8 progress

2013-06-09 Thread Shannon Quinn

M-1250 lgtm.

On 6/9/13 4:58 PM, Grant Ingersoll wrote:

7 issues remaining:

M-833 -- Suneel
M-975 -- Ted
M-1030 -- Suneel
M-1067 -- Dmitriy  --  This is an enhancement, should we push?
M-1147 -- Jake
M-1233 -- Yannis (Grant?)
M-1250 -- Sebastian (but all of us should chime in)

In theory, 833 and 1067 can be pushed, but I think all others are blockers.

-Grant


On Jun 9, 2013, at 8:51 AM, Grant Ingersoll gsing...@apache.org wrote:


I'm on M-1211 and 1247 (M-992 is related)  Will be on IRC for a few hours this 
morning.

-Grant

On Jun 9, 2013, at 1:48 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:


Working on M-833.

From: Suneel Marthi suneel_mar...@yahoo.com
To: dev@mahout.apache.org dev@mahout.apache.org
Sent: Saturday, June 8, 2013 6:09 PM
Subject: Re: 0.8 progress

I will be looking at M-833 and M-1030 tonight.

I can get the initial limited functionality for M-884 as part of 0.8 release by 
tomorrow. Thanks to Robin for reviewing.







From: Grant Ingersoll gsing...@apache.org
To: dev@mahout.apache.org
Sent: Saturday, June 8, 2013 5:09 PM
Subject: Re: 0.8 progress


I've got 1103 and 1126 close to done.  Should be in by tomorrow.

On Jun 8, 2013, at 4:18 PM, Robin Anil robin.a...@gmail.com wrote:


Down to 15.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sat, Jun 8, 2013 at 12:30 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:


I am done with M-1026.





From: Grant Ingersoll gsing...@apache.org
To: dev@mahout.apache.org
Sent: Saturday, June 8, 2013 10:42 AM
Subject: Re: 0.8 progress


Hmm, JIRA seems to be down...

1084 is in.  I'm pretty close to being done on 1103.

I'm on #mahout on Freenode if anyone wants to coordinate, and will be
there for the next 1 hour or so.

On Jun 8, 2013, at 7:21 AM, Grant Ingersoll gsing...@apache.org wrote:


We are down to 18 issues!  Let's keep cranking.

I'm working on 1103 and 1084 at the moment.

On Jun 6, 2013, at 12:00 PM, Grant Ingersoll gsing...@apache.org

wrote:

On Jun 6, 2013, at 12:12 PM, Sebastian Schelter 

ssc.o...@googlemail.com wrote:

Hi Grant,

Here's my take:

Will/Must be finished:
M-944[include]

^ Committed.


M-958 [include]
M-975[include]
M-1084 [include]
M-1098  [include]
M-1103 [include]
M-1126[push if no one steps up]
M-1147  [include]
M-1211  [push if no one steps up]
M-1233  [push if no one steps up]
M-1241  [include]

Can be pushed if no one steps up:
M-627 [push if no one steps up]
M-833 [push if no one steps up]
M-1163 [push if no one steps up]
M-1164[push if no one steps up]
M-1243[include]
M-992 [include]

^ Working on this now.


M-996 [push if no one steps up]
M-1067[include]

Unsure:
M-974 [push if no one steps up]
M-1026 [push if no one steps up]
M-1030 [unsure]


On 06.06.2013 11:26, Grant Ingersoll wrote:

Working from the link below, we are down to 22 issues.



https://issues.apache.org/jira/issues/?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.8%22%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

Here's my opinion (and only my opinion, please vote, change as you

see fit) based on a cursory glance of the state of these as to what needs
to be in the release and what can be pushed:

Will/Must be finished:
M-944
M-958
M-975
M-1084
M-1098
M-1103
M-1126
M-1147
M-1211
M-1233
M-1241

Can be pushed if no one steps up:
M-627
M-833
M-1163
M-1164
M-1243
M-992
M-996
M-1067

Unsure:
M-974
M-1026
M-1030



Grant Ingersoll | @gsingers
http://www.lucidworks.com








Grant Ingersoll | @gsingers
http://www.lucidworks.com







Grant Ingersoll | @gsingers
http://www.lucidworks.com







Grant Ingersoll | @gsingers
http://www.lucidworks.com



Grant Ingersoll | @gsingers
http://www.lucidworks.com



Grant Ingersoll | @gsingers
http://www.lucidworks.com







Grant Ingersoll | @gsingers
http://www.lucidworks.com










Re: [DRAFT] 0.8 Release Announcement + Future Plans Discussion

2013-06-08 Thread Shannon Quinn



Clustering

- Fuzzy k-Means o.a.m.clustering.fuzzykmeans
- Spectral k-Means in o.a.m.clustering.spectral

-1 on spectral being dropped as that seems to receive decent traction.
Agreed, given recent activity in particular. However I would put forth 
deprecating Eigencuts (o.a.m.clustering.eigencuts) until such time that 
it can be made scalable.




Re: [DRAFT] 0.8 Release Announcement + Future Plans Discussion

2013-06-08 Thread Shannon Quinn
Sorry, that's o.a.m.clustering.spectral.eigencuts. Then move the .kmeans 
package to simply be o.a.m.clustering.spectral .


On 6/8/13 1:37 PM, Shannon Quinn wrote:



Clustering

- Fuzzy k-Means o.a.m.clustering.fuzzykmeans
- Spectral k-Means in o.a.m.clustering.spectral

-1 on spectral being dropped as that seems to receive decent traction.
Agreed, given recent activity in particular. However I would put forth 
deprecating Eigencuts (o.a.m.clustering.eigencuts) until such time 
that it can be made scalable.






[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-05 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675836#comment-13675836
 ] 

Shannon Quinn commented on MAHOUT-1214:
---

@Yiqun: I would suggest making this as general as possible. Don't confine it to 
just spectral k-means. Submit a patch and we can look it over.

@Grant: Unless the patch came in today, I don't think we could have it ready 
for inclusion in 0.8.

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
  Labels: clustering, improvement
 Fix For: Backlog


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-05 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675902#comment-13675902
 ] 

Shannon Quinn commented on MAHOUT-1214:
---

Developing a better input format for spectral kmeans has been on my to-do list 
ever since writing the algorithm. Unfortunately, to handle any sort of raw data 
format, it requires n^2 pairwise comparisons which is not trivial in a Hadoop 
setting. [1] describes various methods of achieving an efficient MapReduce 
implementation for computing the affinity matrix. This is ultimately the route 
we should go, ideally creating it as a separate job with tunable parameters 
that spectral kmeans will invoke.

In the meantime, we can probably put a check in the job that reads the affinity 
matrix to find zeros and ignore them.

[1] http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=arnumber=5444877

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
  Labels: clustering, improvement
 Fix For: Backlog


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-05-23 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13665190#comment-13665190
 ] 

Shannon Quinn commented on MAHOUT-1214:
---

This all looks great. With the work I did on Eigencuts this semester, there are 
some optimizations in the data structures I'd like to test that might further 
help spectral kmeans' performance, in addition to looking into ball kmeans, 
streaming kmeans, and SSVD.

I still have a question to Yiqun: if you've implemented an orthogonality check 
in EigenVerificationJob, how is this not something that can be applied to 
EigenVerificationJob in general, as opposed to only spectral kmeans?

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
  Labels: clustering, improvement

 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-05-23 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13665203#comment-13665203
 ] 

Shannon Quinn commented on MAHOUT-1177:
---

Yu Lee and Yexi: For the time being, I'd be on board with shelving the addition 
of any new clustering algorithms, and instead focusing on improving 
documentation and unifying the APIs for the existing ones. I think that would 
help scope your work a little more effectively, while still providing an 
extremely valuable body of work. Plus, it would greatly aid the development of 
new algorithms to have a specific interface to build into. Beyond that, I think 
your ideas are good and would encourage you to start laying out your specific 
plans.

Ravi: I would suggest browsing the open JIRAs for Mahout and to submit a patch 
for one you think you can tackle. Please feel free to ping our email list if 
you have specific questions, though for general ones please submit them to the 
list rather than on JIRA.


 GSOC 2013: Reform and simplify the clustering APIs
 --

 Key: MAHOUT-1177
 URL: https://issues.apache.org/jira/browse/MAHOUT-1177
 Project: Mahout
  Issue Type: Improvement
Reporter: Dan Filimon
  Labels: gsoc2013, mentor

 Clustering is one of the most used features in Mahout and has many 
 applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
 We have of lots clustering algorithms. There's:
 - basic k-means
 - canopy clustering
 - Dirichlet clustering
 - Fuzzy k-means
 - Spectral k-means
 - Streaming k-means [coming soon]
 We want to make them easier to use by updating the APIs and make sure they 
 all work in the same way have consistent inputs, outputs, diagnostics and 
 documentation.
 This is a great way to gain an in-depth understanding of clustering 
 algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
 software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-05-19 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13661587#comment-13661587
 ] 

Shannon Quinn commented on MAHOUT-1214:
---

Ted,

I'm not sure I follow. You mean use SSVD exclusively in place of Lanczos?

I'd love to assess performance and accuracy with ball or streaming k-means 
instead. That's an excellent idea.

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
  Labels: clustering, improvement

 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-05-16 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13659680#comment-13659680
 ] 

Shannon Quinn commented on MAHOUT-1214:
---

1: Examining the orthogonality of eigenvectors has to do with 
EigenVerificationJob, a part of the distributed Lanczos pipeline. It's used in 
spectral KMeans, but also elsewhere in Mahout (essentially any time the 
distributed Lanczos solver is used). Unless you're referring to a check that's 
specific to the spectral KMeans domain?

2: This is an excellent point of improvement. I look forward to seeing the 
patch.

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
  Labels: clustering, improvement

 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-05-02 Thread Shannon Quinn
This sounds excellent. I'd be happy to assist in unifying the interfaces 
of the spectral methods in particular.


On 5/2/13 3:54 PM, Yu Lee (JIRA) wrote:

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647841#comment-13647841
 ]

Yu Lee commented on MAHOUT-1177:


Hello Robin Anil, Jeff Eastman, Dan Filimon, and Ted Dunning,

Yexi and I (Yu Lee) are new to this Mahout community. We want to contribute to 
the improvement of Mahout by reforming and simplifying the clustering APIs per 
the following link:
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644120#comment-13644120

We have gone through the code of Mahout clustering. Now we have some ideas 
about improving it:

=
Addressing the problems in the current interface:

Testing cases are missing. For example, in spectral kmeans clustering, the run 
methods of SpectralKmeansDriver and EigencutsDriver are not tested

Documentations are missing for some methods. For example: in the run method of 
DirichletDriver, the description of parameter 'numModels' is missing; in the 
run method of SpectralKmeansDriver, the description of some arguments are 
missing

Some testing methods do not contain the specific description of some arguments. For example: in the 
run method of FuzzyKmeansDriver, the description of an argument of m (fuzzification 
factor) is missing. Although a wiki link regarding Clustering Analysis is given, it is 
not clear enough.

-

Implementing some new clustering algorithms

Agglomerative hierarchical clustering, which will cluster the data points into 
a dendragram, so that user could indicate whatever number of clusters as they 
want. (http://en.wikipedia.org/wiki/Hierarchical_clustering)

Dbscan, which is a density based clustering method being able to identify 
clusters with arbitrary shapes, and is useful in spatial clustering. 
(http://en.wikipedia.org/wiki/DBSCAN)

-

Providing a new unified interface

Currently, each clustering algorithm has its own implemented class with 
different interfaces (i.e., run methods in different Drivers have different 
argument list). However, it is better to have a unified interface to execute 
all available clustering methods, and an example interface is as follows:

Clustering-run(input, output, methodClass,clusteringConfig)

Here, the methodClass indicates a specific clustering method, while 
clusteringConfig indicates the configuration for this specific clustering method.

=

Could you please let us know what you think about our ideas?


 

GSOC 2013: Reform and simplify the clustering APIs
--

 Key: MAHOUT-1177
 URL: https://issues.apache.org/jira/browse/MAHOUT-1177
 Project: Mahout
  Issue Type: Improvement
Reporter: Dan Filimon
  Labels: gsoc2013, mentor

Clustering is one of the most used features in Mahout and has many applications 
[http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
We have of lots clustering algorithms. There's:
- basic k-means
- canopy clustering
- Dirichlet clustering
- Fuzzy k-means
- Spectral k-means
- Streaming k-means [coming soon]
We want to make them easier to use by updating the APIs and make sure they all 
work in the same way have consistent inputs, outputs, diagnostics and 
documentation.
This is a great way to gain an in-depth understanding of clustering algorithms, 
familiarize yourself with Hadoop, Mahout clustering and good software 
engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Gsoc 2013 question

2013-04-09 Thread Shannon Quinn

Hi there.

If you don't have a fully-formed project idea or are otherwise looking 
for suggestions, feel free to post your question here.


Shannon

On 4/9/13 1:38 PM, George Zografos wrote:

Hello mahout dev community.
I have a question regarding a project idea for GSOC 2013.
Should I post it here or to JIRA as a comment?





Re: Welcome Suneel Marthi and Dan Filimon

2013-04-04 Thread Shannon Quinn

Congratulations! :)

On 4/4/13 6:30 AM, Grant Ingersoll wrote:

In recognition of the contributions of Suneel Marthi and Dan Filimon to the 
Mahout project, the PMC is pleased to announce both have accepted our 
invitations to join the Mahout project as committers.

As is customary, I will leave it to Suneel and Dan to provide a little bit of 
background on who they are.

Congratulations!

-Grant


Grant Ingersoll | @gsingers
http://www.lucidworks.com










Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-04-04 Thread Shannon Quinn
According to the GSoC calendar, accepted organizations aren't posted 
until April 8 (Monday), at which point (assuming Apache is accepted...I 
can't imagine it wouldn't be) slots will be doled out internally. This 
will probably take at least a day or two, so probably by middle of next 
week we'll know how many slots Mahout has.


Speaking of which: how do the various subprojects negotiate for slots? 
Is there a central spreadsheet, or an IRC meeting to attend? Or did I 
miss the email detailing this?


On 4/4/13 2:43 PM, Dan Filimon wrote:

Any news on this front? Did we get approved/assigned a slot/anything?


On Fri, Mar 29, 2013 at 7:44 PM, Dan Filimon dangeorge.fili...@gmail.comwrote:


Ok, updated!


On Fri, Mar 29, 2013 at 7:36 PM, Andy Twigg andy.tw...@gmail.com wrote:


Dan,

I think what you've written is fine (I wanted to edit to remove the
'?' around random forests but couldn't).

ok?



On 29 March 2013 11:14, Dan Filimon dangeorge.fili...@gmail.com wrote:

I added Andy's first suggestion and Ted's suggestion as ideas.

Andy, could you flesh out your second suggestion into a project and

make an

issue please?


On Fri, Mar 29, 2013 at 3:53 AM, Ted Dunning ted.dunn...@gmail.com

wrote:

It should be possible to view a Lucene index as a matrix.  This would
require that we standardize on a way to convert documents to rows.

  There

are many choices, the discussion of which should be deferred to the

actual

work on the project, but there are a few obvious constraints:

a) it should be possible to get the same result as dumping the term

vectors

for each document each to a line and converting that result using

standard

Mahout methods.

b) numeric fields ought to work somehow.

c) if there are multiple text fields that ought to work sensibly as

well.

  Two options include dumping multiple matrices or to convert the fields
into a single row of a single matrix.

d) it should be possible to refer back from a row of the matrix to

find the

correct document.  THis might be because we remember the Lucene doc

number

or because a field is named as holding a unique id.

e) named vectors and matrices should be used if plausible.

On Thu, Mar 28, 2013 at 4:58 PM, Dan Filimon 

dangeorge.fili...@gmail.com

wrote:
...
Ted, could you explain a bit more what you mean by simplify the

connection

to Lucene for clustering and classification? It's too vague for an

idea

proposal.




--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.tw...@cs.ox.ac.uk | +447799647538







Re: Call to action – Mahout needs your help

2013-03-26 Thread Shannon Quinn
I would love to help in any way I can. I'm fairly busy with my PhD 
studies until early May when I shift to an internship for the summer, so 
if I could have some help setting up tickets on JIRA for things we'd 
like to see done, I could take over the legwork once the summer hits. 
I'd be happy to work with Dan and mentor at least one student.


Shannon

On 3/26/13 10:06 AM, Isabel Drost wrote:

On Tue, Mar 26, 2013 at 12:12 PM, Dan Filimon
dangeorge.fili...@gmail.comwrote:


If you guys decide to participate in GSOC this year, I'd be happy to
spread the word and maybe even have a presentation about Mahout at
school. Also, since I'm squarely on the student side (doing my senior
project with Ted on Mahout) I think I have a good grasp of what the
problems are, especially for a beginner student.

And, if you do pick someone, I could help them part-time (especially
if they're from my school, you know, timezone and language help)
. Of course, I wouldn't really want to be the main mentor since I'm
still really new and not a committer yet. :)


That sounds like an awesome proposal to me. What do others think?


Isabel





Re: Call to action – Mahout needs your help

2013-03-25 Thread Shannon Quinn





I think that you mentioned a very good point with stating that it is not
clear whether Mahout is a library, a standalone program to interact with
via the command line. IMO, its first and foremost a library (similar to
Lucene), and this should also be reflected in the codebase.

That is my view as well and I think we have been moderately successful at it.


+1


As for the complexity issue, I don't know that we ever solve it, we just need 
to identify contributors in those areas quickly, mentor them, and make them 
committers as soon as they are ready.


On that note: GSoC is coming up, and I think it's a great opportunity to 
build some momentum in this direction. I know that when students see 
scalable machine learning their first thought isn't improving testing 
and documentation, but if we pushed hard in those areas specifically, in 
addition to making a broad effort on JIRA to elucidate exactly what 
needs work, we could likely pick up several quality students that could 
make lasting contributions.





I think that Mahout is and should always be more than recommenders, but
that we should be more courageous in throwing out things that are not
used very much or not maintained very much or don't meet the quality
standards which we would like to see.


+1 . On my end of things, while I do think some sort of canonical 
spectral clustering algorithm would be very useful to have, e.g. 
spectral k-means, the Eigencuts algorithm is one example of something 
that is so specialized that it could probably be jettisoned.


[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-12 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: MAHOUT-1159-ssvdpoweriter.patch

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch, 
 MAHOUT-1159-ssvdpoweriter.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-12 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600292#comment-13600292
 ] 

Shannon Quinn commented on MAHOUT-1159:
---

Agreed on all points, not sure why I missed that. I've attached the patch and 
will commit it unless you have any problems with it.

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch, 
 MAHOUT-1159-ssvdpoweriter.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-12 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: MAHOUT-1159-ssvdpoweriter.patch

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch, 
 MAHOUT-1159-ssvdpoweriter.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-12 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: (was: MAHOUT-1159-ssvdpoweriter.patch)

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch, 
 MAHOUT-1159-ssvdpoweriter.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-12 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600346#comment-13600346
 ] 

Shannon Quinn commented on MAHOUT-1159:
---

Committed. Thanks for your input.

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch, 
 MAHOUT-1159-ssvdpoweriter.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-794) Eigencuts produces unexpected results, part 2

2013-03-11 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn resolved MAHOUT-794.
--

Resolution: Invalid

The exact problems with Eigencuts are numerous; they should each be their own 
tickets. Those will be forthcoming soon. For that reason, I am closing this one 
as it is too broad.

 Eigencuts produces unexpected results, part 2
 -

 Key: MAHOUT-794
 URL: https://issues.apache.org/jira/browse/MAHOUT-794
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.5
Reporter: Sean Owen
Assignee: Shannon Quinn
 Fix For: 0.8


 See MAHOUT-516, which was closed. Looks like Shannon believes there is a 
 follow-on issue. I'm just opening a new issue to track this for 0.6.
 This is an issue in the workflow of the Eigencuts algorithm; some part of it 
 is not implemented correctly. More details to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Spectral fixes

2013-03-11 Thread Shannon Quinn
I have a load of fixes in the pipeline for the spectral clustering 
algorithms. The work on Eigencuts is extensive and still ongoing, so 
while I will post those tickets, the fixes will likely not make it for 0.8.


SpectralKmeans, however, has numerous fixes that are ready to go. Before 
I post and commit them, I would like some input on the following items:


1: We added the option to use SSVD in place of the Lanczos solver. Would 
it be acceptable to have a command-line flag to specify the solver to use?
2: Lots of temporary files are generated by the numerous MR jobs chained 
together. Is there a rule of thumb for whether or not to delete these 
intermediate files after running the whole job? Right now I have a 
command-line flag to indicate whether they should be removed or not.


Thanks!

Shannon


[jira] [Created] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-11 Thread Shannon Quinn (JIRA)
Shannon Quinn created MAHOUT-1159:
-

 Summary: Add SSVD option to SpectralKMeans
 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8


This adds SSVD as an option for eigensolver, in addition to the [default] 
Lanczos solver. Testing indicated it yielded similar clustering accuracy with a 
possible performance boost.

This patch includes other small fixes, such as using the default tempDir for 
intermediate calculations.

The initialization of the SSVD solver is a bit awkward, with specifying the 
number of reducers. I hard-coded this at 10; is there a better solution? 
Perhaps making it an optional parameter to the SSVD constructor?

[Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-11 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: MAHOUT-1159.patch

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-11 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: (was: MAHOUT-1159.patch)

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-11 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: MAHOUT-1159.patch

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-11 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn resolved MAHOUT-1159.
---

Resolution: Fixed

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-11 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: MAHOUT-1159-ssvdopts.patch

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-11 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599237#comment-13599237
 ] 

Shannon Quinn commented on MAHOUT-1159:
---

Excellent points, thanks. Here's a new patch with the suggested fixes, let me 
know if that works.

The only reason I noticed the discrepancy in the documentation is SSVD in 
spectral k-means was originally tested in standalone mode, and obviously the 
DistributedCache is not available in that case.

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-11 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: (was: MAHOUT-1159-ssvdopts.patch)

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-11 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1159:
--

Attachment: MAHOUT-1159-ssvdopts.patch

 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1159) Add SSVD option to SpectralKMeans

2013-03-11 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599237#comment-13599237
 ] 

Shannon Quinn edited comment on MAHOUT-1159 at 3/11/13 8:36 PM:


Excellent points, thanks. Here's a new patch with the suggested fixes, let me 
know if that works.

The only reason I noticed the discrepancy in the documentation is SSVD in 
spectral k-means was originally tested in standalone mode, and obviously the 
DistributedCache is not available in that case.

Updated the patch: actually used the new parameters.

  was (Author: magsol):
Excellent points, thanks. Here's a new patch with the suggested fixes, let 
me know if that works.

The only reason I noticed the discrepancy in the documentation is SSVD in 
spectral k-means was originally tested in standalone mode, and obviously the 
DistributedCache is not available in that case.
  
 Add SSVD option to SpectralKMeans
 -

 Key: MAHOUT-1159
 URL: https://issues.apache.org/jira/browse/MAHOUT-1159
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Shannon Quinn
Assignee: Shannon Quinn
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1159.patch, MAHOUT-1159-ssvdopts.patch


 This adds SSVD as an option for eigensolver, in addition to the [default] 
 Lanczos solver. Testing indicated it yielded similar clustering accuracy with 
 a possible performance boost.
 This patch includes other small fixes, such as using the default tempDir 
 for intermediate calculations.
 The initialization of the SSVD solver is a bit awkward, with specifying the 
 number of reducers. I hard-coded this at 10; is there a better solution? 
 Perhaps making it an optional parameter to the SSVD constructor?
 [Thanks to University of Pittsburgh CS undergraduates Andrew King, Pawan 
 Solanki, and Philip Schinis for working on this.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   3   >