Re: VOTE: take 2: mahout-collections-1.0
+1 On Mon, Apr 12, 2010 at 4:50 AM, Ted Dunning ted.dunn...@gmail.com wrote: +1 (on trust, really) On Sun, Apr 11, 2010 at 6:49 PM, Benson Margulies bimargul...@gmail.comwrote: https://repository.apache.org/content/repositories/orgapachemahout-015/ contains (this time for sure) all the artifacts for release 1.0 of the mahout-collections component. This is the first independent release of collections from the rest of mahout; it differs from the version released with mahout 0.3 only in removing a dependency on slf4j. This vote will remain open for 72 hours.
Re: VOTE: release mahout-collections-codegen 1.0
+1 On Thu, Apr 8, 2010 at 2:57 AM, Drew Farris drew.far...@gmail.com wrote: +1 On Tue, Apr 6, 2010 at 9:08 PM, Benson Margulies bimargul...@gmail.com wrote: In order to decouple the mahout-collections library from the rest of Mahout, to allow more frequent releases and other good things, we propose to release the code generator for the collections library as a separate Maven artifact. (Followed, in short order, by the collections library proper.) This is proposed release 1.0 of mahout-collections-codegen-plugin. This is intended as a maven-only release; we'll put the artifacts in the Mahout download area as well, but we don't ever expect anyone to use this except from Maven, inasmuch as it is a maven plugin. The release artifacts are in the Nexus stage, as follows. https://repository.apache.org/content/repositories/orgapachemahout-006/ This vote will remain open for 72 hours.
Re: [DISCUSS] Mahout TLP Board Resolution
should be Abdelhakim Deneche ... cause my first name is 'Abdelhakim On Thu, Mar 18, 2010 at 1:07 PM, Grant Ingersoll gsing...@apache.org wrote: So here's the update: X. Establish the Apache Mahout Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a machine learning platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Mahout Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Mahout Project be and hereby is responsible for the creation and maintenance of software related to a machine learning platform; and be it further RESOLVED, that the office of Vice President, Apache Mahout be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Mahout Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Mahout Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Mahout Project: • Deneche Abdelhakim adene...@... • Isabel Drost (isa...@...) • Ted Dunning (tdunn...@...) • Jeff Eastman (jeast...@...) • Drew Farris (d...@...) • Grant Ingersoll (gsing...@...) • Benson Margulies (bimargul...@...) • Sean Owen (sro...@...) • Robin Anil (robina...@...) • Jake Mannix (jman...@...) RESOLVED, that the Apache Mahout Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Mahout sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Mahout sub-project encumbered upon the Apache Mahout Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Sean Owen be appointed to the office of Vice President, Apache Mahout, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed.
Re: [DISCUSS] Mahout TLP Board Resolution
close, actually: عبد الحكيم =D On Thu, Mar 18, 2010 at 6:41 PM, Benson Margulies bimargul...@gmail.com wrote: Or perhaps: عبدل حكيم ? On Thu, Mar 18, 2010 at 1:34 PM, deneche abdelhakim adene...@gmail.com wrote: should be Abdelhakim Deneche ... cause my first name is 'Abdelhakim On Thu, Mar 18, 2010 at 1:07 PM, Grant Ingersoll gsing...@apache.org wrote: So here's the update: X. Establish the Apache Mahout Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a machine learning platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Mahout Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Mahout Project be and hereby is responsible for the creation and maintenance of software related to a machine learning platform; and be it further RESOLVED, that the office of Vice President, Apache Mahout be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Mahout Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Mahout Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Mahout Project: • Deneche Abdelhakim adene...@... • Isabel Drost (isa...@...) • Ted Dunning (tdunn...@...) • Jeff Eastman (jeast...@...) • Drew Farris (d...@...) • Grant Ingersoll (gsing...@...) • Benson Margulies (bimargul...@...) • Sean Owen (sro...@...) • Robin Anil (robina...@...) • Jake Mannix (jman...@...) RESOLVED, that the Apache Mahout Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Mahout sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Mahout sub-project encumbered upon the Apache Mahout Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Sean Owen be appointed to the office of Vice President, Apache Mahout, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed.
Re: [DISCUSS] Mahout TLP Board Resolution
just to get it right: not being in the PMC doesn't mean I'm no more a committer, right ? On Mon, Mar 15, 2010 at 6:08 PM, Jake Mannix jake.man...@gmail.com wrote: +1 and I'm in (my email @apache is just jmannix btw, for some reason its not listed on those resolutions) On Mar 15, 2010 9:07 AM, Robin Anil robin.a...@gmail.com wrote: I'm in :) :thumbs up: On Mon, Mar 15, 2010 at 8:01 PM, Grant Ingersoll gsing...@apache.orgwrote: Now that 0.3 is almost out and also given discussions over on gene...@lucene.a.o, I think we ca...
Re: [jira] Commented: (MAHOUT-323) Classify new data using Decision Forest
oops, will attach it as soon as possible. I really wonder why submit a patch and attach a patch are two different operations in JIRA ? On Sat, Mar 6, 2010 at 10:08 PM, Robin Anil (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/MAHOUT-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12842313#action_12842313 ] Robin Anil commented on MAHOUT-323: --- No patch? Forgot to attach? Don't bother too much about the code freeze. Since this is a feature that could help people use RF as a classifier even more than it can now, i guess you can keep it for 0.3 with some documentation ofcourse. But before that attach :) Classify new data using Decision Forest --- Key: MAHOUT-323 URL: https://issues.apache.org/jira/browse/MAHOUT-323 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.4 Reporter: Deneche A. Hakim Assignee: Deneche A. Hakim When building a Decision Forest we should be able to store it somewhere and use it later to classify new datasets -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (MAHOUT-323) Classify new data using Decision Forest
yes, I'm planning to make DF look more like a Mahout classifier. I will take a look at bayes. On Sun, Mar 7, 2010 at 7:09 PM, Robin Anil (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/MAHOUT-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12842459#action_12842459 ] Robin Anil commented on MAHOUT-323: --- Hey deneche. can we have a single point entry to the classifier. You are free to modify Train and Test Classifier of bayes. Or keep the same naming convention in df and in bayes? Classify new data using Decision Forest --- Key: MAHOUT-323 URL: https://issues.apache.org/jira/browse/MAHOUT-323 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.4 Reporter: Deneche A. Hakim Assignee: Deneche A. Hakim Attachments: mahout-323.patch When building a Decision Forest we should be able to store it somewhere and use it later to classify new datasets -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Welcome Drew Farris
Welcome Drew =D On Fri, Feb 19, 2010 at 5:02 AM, Grant Ingersoll gsing...@apache.org wrote: On Feb 18, 2010, at 8:32 PM, Drew Farris wrote: There's lots more stuff I'd like to get in there, now I only need to figure how to squeeze 48 hours of consciousness into a day. I believe there is a compression algorithm for that.
Re: Mahout 0.3 Plan and other changes
One important question in my mind here is how does this effect 0.20 based jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and deneche is also maintaining two version it seems. I will check the AbstractJob and see although I maintain two versions of Decision Forests, one with the old api and with the new one, the differences between the two APIs are so important that I can't just keep working on the two versions. Thus all the new stuff is being committed using the new API and as far as I can say it seems to work great. On Thu, Feb 4, 2010 at 4:48 PM, Robin Anil robin.a...@gmail.com wrote: On Thu, Feb 4, 2010 at 7:28 PM, Sean Owen sro...@gmail.com wrote: On Thu, Feb 4, 2010 at 12:28 PM, Robin Anil robin.a...@gmail.com wrote: 3rd thing: I am planning to convert the launcher code to implement ToolRunner. Anyone volunteer to help me with that? I had wished to begin standardizing how we write these jobs, yes. If you see AbstractJob, you'll see how I've unified my three jobs and how I'm trying to structure them. It implements ToolRunner so all that is already taken care of. I think some standardization is really useful, to solve problems like this and others, and I'll offer this as a 'draft' for further work. No real point in continuing to solve these things individually. One important question in my mind here is how does this effect 0.20 based jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and deneche is also maintaining two version it seems. I will check the AbstractJob and see 5th The release: Fix a date for 0.3 release? We should look to improve quality in this release. i.e In-terms of running the parts of the code each of us haven't tested (like I have run bayes and fp growth many a time, So, I will focus on running clustering algorithms and try out various options see if there is any issue) provide feedback so that the one who wrote it can help tweak it? Maybe, maybe not. There are always 100 things that could be worked on, and that will never change -- it'll never be 'done'. The question of a release, at this point, is more like, has enough time elapsed / has enough progress been made to warrant a new point release? I think we are at that point now. The question is not what big things can we do -- 'big' is for 0.4 or beyond now -- but what small wins can we get in, or what small changes are necessary to tie up loose ends to make a roughly coherent release. In that sense, no, I'm not sure I'd say things like what you describe should be in for 0.3. I mean we could, but then it's months away, and isn't that just what we call 0.4? Everyone's had a week or two to move towards 0.3 so I believe it's time to begin pushing on these issues, closing then / resolving them / moving to 0.4 by end of week. Then set the wheel in motion first thing next week, since it'll still be some time before everyone's on board.
Re: dependency question: mahout-examples - watchmaker-swing - jfreechart - jcommons?
The only example that actually uses watchmaker-swing is Travelling Salesman, mainly because it was a direct port of an existing watchmaker example. And if I remember well, it does not actually use JFreeChart...so I think it's safe to exclude it. On Sat, Jan 30, 2010 at 5:19 AM, Drew Farris drew.far...@gmail.com wrote: I spent some time looking at the licenses for the dependencies included in the binary release built as a part of MAHOUT-215, and I'm wondering if anyone knows whether code in mahout-examples uses directly or indirectly any of the jfreechart code is included as a transient dependency of the watchmaker-swing library. The issue at hand is that jfreechart pulls in something called jcommons, which appears to be licensed under GPL. It is my understanding that mahout shouldn't include GPL licensed dependencies in a binary release. So, if mahout doesn't use jfreechart in any way via watchmaker-swing, I can set an exclusion for it in the dependency declaration and thus prevent the inclusion of jcommons. Mahout builds and test complete fine with this exclusion set, but that's not the whole story of course. Drew
Re: Unit test failure
Yeah, its probably due to the way I used to generate random data...the problem is that I never get this error =P so it's very difficult to fix...I'll try my best as soon as I have some time. In the mean time, rerunning 'mvn clean install' again generally does the trick. On Sat, Jan 16, 2010 at 6:58 PM, Grant Ingersoll gsing...@apache.org wrote: try rerunning... I think that one has intermittent failures. Perhaps Deneche can dig in. You will likely need to look in the Hadoop logs too. On Jan 16, 2010, at 12:49 PM, Benson Margulies wrote: https://issues.apache.org/jira/browse/MAHOUT-258 The error message: testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest) Time elapsed: 6.731 sec ERROR! java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) does not give me much to go on. I don't see how adding new Set classes to my tree could cause this ...
Re: Unit test lag?
I'm getting similar slowdowns with my VirtualBox Ubuntu 9.04 I'm suspecting that the problem is not -only- caused by RandomUtils because: 1. I'm familiar with MerseneTwisterRNG slowdowns (I use it a lot) but the test time used to be reported accurately by maven. Now maven reports that a test took less than a second but it actually took a lot more ! 2. Most of my tests actually call RandomUtils.useTestSeed() in setup() (InMemInputSplitTest included) but the tests still take a lot of time, and again its not reported accurately by maven 3. I generally launch a 'mvn clean install' every Thursday. I never got this slowdowns until last Thursday (dit we change anything that could have caused this slowdowns) On Sun, Jan 17, 2010 at 12:33 AM, Benson Margulies bimargul...@gmail.com wrote: Unit tests should generally be using a fixed seed and not need to load a secure seed from dev/random. I would say that RandomUtils is probably the problem here. The secure seed should be loaded lazily only if the test seed is not in use. The problem, as I see it, is that the uncommons-math package start initializing a random seed as soon as you touch it, whether you need it or not. RandomUtils can only avoid this by avoiding uncommons-math in unit test mode. -- Ted Dunning, CTO DeepDyve
Re: Unit test lag?
removing the maven repository does not solve the problem, neither a fresh checkout of the trunk. but older revisions don't show any slowdown!!! I tried the following revisions: Those old revisions seem Ok: r896946 | srowen | 2010-01-07 19:02:41 +0100 (Thu, 07 Jan 2010) | 1 line MAHOUT-238 r897134 | robinanil | 2010-01-08 09:23:22 +0100 (Fri, 08 Jan 2010) | 1 line MAHOUT-221 Missed out two files while checking in FP-Bonsai r897405 | adeneche | 2010-01-09 11:02:49 +0100 (Sat, 09 Jan 2010) | 1 line MAHOUT-216 The slowdowns start at this revision !!! r897440 | srowen | 2010-01-09 13:53:25 +0100 (Sat, 09 Jan 2010) | 1 line Code style adjustments; enabled/fixed TestSamplingIterator On Sun, Jan 17, 2010 at 5:47 AM, deneche abdelhakim adene...@gmail.com wrote: I'm getting similar slowdowns with my VirtualBox Ubuntu 9.04 I'm suspecting that the problem is not -only- caused by RandomUtils because: 1. I'm familiar with MerseneTwisterRNG slowdowns (I use it a lot) but the test time used to be reported accurately by maven. Now maven reports that a test took less than a second but it actually took a lot more ! 2. Most of my tests actually call RandomUtils.useTestSeed() in setup() (InMemInputSplitTest included) but the tests still take a lot of time, and again its not reported accurately by maven 3. I generally launch a 'mvn clean install' every Thursday. I never got this slowdowns until last Thursday (dit we change anything that could have caused this slowdowns) On Sun, Jan 17, 2010 at 12:33 AM, Benson Margulies bimargul...@gmail.com wrote: Unit tests should generally be using a fixed seed and not need to load a secure seed from dev/random. I would say that RandomUtils is probably the problem here. The secure seed should be loaded lazily only if the test seed is not in use. The problem, as I see it, is that the uncommons-math package start initializing a random seed as soon as you touch it, whether you need it or not. RandomUtils can only avoid this by avoiding uncommons-math in unit test mode. -- Ted Dunning, CTO DeepDyve
Re: Welcome Benson Marguiles as Mahout Committer
Welcome =D On Wed, Jan 13, 2010 at 10:36 PM, Drew Farris drew.far...@gmail.com wrote: Congratulations Benson. It is wonderful to see your great work in the mahout-math (and the future mahout-collections?) come together quickly. On Wed, Jan 13, 2010 at 3:28 PM, Grant Ingersoll gsing...@apache.orgwrote: The Lucene PMC is pleased to welcome the addition of Benson Marguiles as a committer on Mahout. I hope you'll join me in offering Benson a warm welcome. Benson, Lucene tradition is that new committers provide a little bit of a background about who they are, so feel free to step up and do so. Cheers, Grant
Re: svn commit: r896922 [1/3] - in /lucene/mahout/trunk: core/src/main/java/org/apache/mahout/common/ core/src/main/java/org/apache/mahout/fpm/pfpgrowth/ core/src/main/java/org/apache/mahout/fpm/pfp
the build is successful, thanks =D On Fri, Jan 8, 2010 at 9:23 AM, Robin Anil robin.a...@gmail.com wrote: Try Now
Re: [jira] Resolved: (MAHOUT-71) Dataset to Matrix Reader
yep :p On Sun, Jan 3, 2010 at 4:41 PM, Sean Owen (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/MAHOUT-71?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-71. - Resolution: Later Fix Version/s: (was: 0.3) Looks like this is inactive now? Dataset to Matrix Reader Key: MAHOUT-71 URL: https://issues.apache.org/jira/browse/MAHOUT-71 Project: Mahout Issue Type: New Feature Reporter: Deneche A. Hakim Assignee: Deneche A. Hakim Priority: Minor This component should allow the input datasets to be read as Matrix Rows. A Map-Reduce Algorithm should handle any dataset in a matrix format, where the collumns are the attributes (and one of them is the Label) and the rows are the datas. Working with Hadoop, we'll need to pass the dataset in the mapper's input, so it must be a file (or many files). We'll then need a custom InputFormat to feed the mappers with the data, and here comes the lovely-named row-wise splitting matrix input format. Now we want to be able to work with any given dataset file format (including the ARFF and my custom format), and thus the InputFormat needs a decoder that converts the dataset lines into matrix rows. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [math] watch out for Windows
last time I tried, running Hadoop 0.20 on Windows was impossible for me...should we still try to support Windows ? I found that installing Ubuntu on Windows using Virtual Box is the easiest way to use Hadoop inside Windows On Mon, Dec 28, 2009 at 8:47 PM, Benson Margulies bimargul...@gmail.com wrote: Robin I just established that the new code generator isn't working on Windows at all. I'm in process on a repair.
Re: Publish code quality reports on web-site?
I'm not planing to make new changes to 'mapred', my new code should go to 'mapreduce' On Thu, Dec 3, 2009 at 3:34 PM, Isabel Drost isa...@apache.org wrote: On Thu Sean Owen sro...@gmail.com wrote: I suggest our current stance be that we use 0.20.x, with the old APIs. When 0.21 comes out and stabilizes, we move. So I suggest keeping these and deleting 'mapred' at that point. Sounds good to me. Isabel
Re: Publish code quality reports on web-site?
df/mapred works with the old hadoop API df/mapreduce works with hadoop 0.20 API On Saturday, November 28, 2009, Sean Owen sro...@gmail.com wrote: I'm all for generating and publishing this. The CPD results highlight a question I had: what's up with the amount of duplication between org/apache/mahout/df/mapred and org/apache/mahout/df/mapreduce -- what is the difference supposed to be. PMD is complaining a lot about the foo == false vs !foo style. I prefer the latter too but we had agreed to use the former, so we could disable this check if possible. Checkstyle: can we set it to allow a 120 character line, and adjust it to consider an indent to be 2 spaces? it's flagging like every line of code right now ! On that note, if possible, I would suggest disabling the following FindBugs checks, as they are flagging a lot of stuff that isn't 'wrong', to me. SE_NO_SERIALVERSIONID I completely disagree with it. serialVersionUID itself is bad practice, in my book. EI_EXPOSE_REP2 it's a fair point but only relevant to security, and we have no such issue. The items it flags are done on purpose for performance, it looks like. SQL_PREPARED_STATEMENT_GENERATED_FROM_NONCONSTANT_STRING SQL_NONCONSTANT_STRING_PASSED_TO_EXECUTE It's a good point in general, but I'm the only one writing JDBC code, and there is actually no security issue here. It's a false positive and we could disable this. SE_BAD_FIELD This one is a little aggressive. It assumes that types not known to be Serializable must not be Serializable, which isn't true. RV_RETURN_VALUE_IGNORED It's a decent idea but flags a lot of legitimate code. For example it's complaining about ignoring Queue.poll(), which, like a lot of Collection API methods, UWF_FIELD_NOT_INITIALIZED_IN_CONSTRUCTOR I don't necessarily agree with this one, explicitly setting fields to null and primitives to zero? tidy but I'm not used to it. I didn't see anything big flagged, good, but we should all have a look at the results and tweak accordingly. In some cases it had a good small point, or I was indifferent about the approach it was suggesting versus what was in the code, so I changed to comply with the check. On Fri, Nov 27, 2009 at 8:26 PM, Isabel Drost isa...@apache.org wrote: Hello, I just ran several code analysis reports over the Mahout source code. Results are published at http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html It includes several reports on code quality, test coverage, java docs and the like. When generated regularly say on Hudson I think it could be beneficial both for us (for getting a quick impression of where cleanup is necessary most) as well as for potential users. I would like to see a third tab added to our homepage that points to a page containing reports for each of our modules. I would try to cleanup the generated site a little before - we certainly do not need the Project information stuff in there, as most of this is already generated through forest. In addition I can take care of setting up a hudson job to recreate the site on a regular schedule. Cheers, Isabel -- |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`' -. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://main...@spaceboyz.net
Re: 0.2 status
please use Decision Forests instead of Random Forests On Thu, Nov 12, 2009 at 9:01 AM, Robin Anil robin.a...@gmail.com wrote: Please edit/add stuff. Robin == Apache Mahout 0.2 has been released and is now available for public download. Apache Mahout is a subproject of Apache Lucene with the goal of delivering scalable machine learning algorithm implementations under the Apache license. link Mahout is a machine learning library meant to scale to the size of data we manage today. Built on top of the powerful map/reduce paradigm of Apache Hadoop project, Mahout lets you run popular machine learning methods like clustering, collaborative filtering, classification over Terabytes of data over thousands of computers. The complete changelist can be found here: http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278 New Mahout 0.2 features include - Major performance enhancements in Collaborative Filtering, Classification and Clustering - New: Latent Dirichlet Allocation(LDA) implementation for topic modelling - New: Frequent Itemset Mining for mining top-k patterns from a list of transactions - New: Random Forests implementation for Decision Tree classification (In Memory Partial Data) - New: HBase storage support for Naive Bayes model building and classification - New: Generation of vectors from Text documents for use with Mahout Algorithms - Performance improvements in various Vector implementations - Tons of bug fixes and code cleanup On Thu, Nov 12, 2009 at 9:06 AM, Grant Ingersoll gsing...@apache.org wrote: Anyone care to writeup a release announcement? Here's Solr's: http://lucene.grantingersoll.com/2009/11/10/apache-solr-1-4-0-offically-released/ I've cleaned up the build quite a bit and am now testing preparing the artifacts w/ the much simpler build (no more installing third party libs, they are all up under o.a.mahout in the Maven repo). I'd like to have everything ready to go once the artifacts are put up for a vote. Thanks, Grant
Re: [jira] Commented: (MAHOUT-184) Code tweaks for .df.* code
Sure. On Fri, Oct 2, 2009 at 8:59 AM, Isabel Drost (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/MAHOUT-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761501#action_12761501 ] Isabel Drost commented on MAHOUT-184: - Looks good to me. Deneche, could you please also have a look at the patch to spot any issues early on? I would prefer using CLI for the job implementation (core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java), but that can be done in a later patch. Code tweaks for .df.* code -- Key: MAHOUT-184 URL: https://issues.apache.org/jira/browse/MAHOUT-184 Project: Mahout Issue Type: Improvement Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 0.2 Attachments: Tweaks_to__df__.patch This follows on my last email to the mailing list, and code inspection. It's big enough I made a patch. No surprises I hope given the consensus on code style and practice. Might be some good takeaways in here, or points for further discussion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
commit rights ?
I'm trying to commit [MAHOUT-122 | https://issues.apache.org/jira/browse/MAHOUT-122], but I'm getting the following error: svn: Commit failed (details follow): svn: Server sent unexpected return value (403 Forbidden) in response to MKACTIVITY request for '/repos/asf/!svn/act/de296129-b366-459b-b184-c95f10139e7e' I'm using the following command: svn commit --username adeneche --password *** --message 'MAHOUT-122 Decision Forests Reference Implementation'
Re: commit rights ?
Yes ! that was it...thanks for the answer, I would have spent 99 years 3 months and 6 days before finding the problem myself =P On Sun, Sep 27, 2009 at 12:39 PM, Grant Ingersoll gsing...@apache.org wrote: Yeah, you're in the committers list, so I'd check that you are using https -Grant On Sep 27, 2009, at 6:47 AM, Simon Willnauer wrote: Are you commiting into a http or https path. you must check out via https in order to commit this has been an issue for many new commiters. Simon On Sun, Sep 27, 2009 at 8:49 AM, deneche abdelhakim adene...@apache.org wrote: I'm trying to commit [MAHOUT-122 | https://issues.apache.org/jira/browse/MAHOUT-122], but I'm getting the following error: svn: Commit failed (details follow): svn: Server sent unexpected return value (403 Forbidden) in response to MKACTIVITY request for '/repos/asf/!svn/act/de296129-b366-459b-b184-c95f10139e7e' I'm using the following command: svn commit --username adeneche --password *** --message 'MAHOUT-122 Decision Forests Reference Implementation' -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: svn commit: r816569 - in /lucene/mahout/trunk/examples/src: main/java/org/apache/mahout/classifier/bayes/ main/java/org/apache/mahout/clustering/meanshift/ main/java/org/apache/mahout/clustering
yes its meant to be run twice, one time selecting the training samples and the next time the testing samples. It assumes that RNG will return the exact same numbers twice. On Mon, Sep 21, 2009 at 1:54 PM, Sean Owen sro...@gmail.com wrote: I rolled it back. So the reader depends on the seed and the exact behavior of the RNG? I have no doubt it is needed if intended, just checking that it's intended. (I also fixed build-reuters.sh) On Sun, Sep 20, 2009 at 1:55 PM, Sean Owen sro...@gmail.com wrote: Sorry I will investigate when back at my workstation. I remember something like this but thought I preserved the seed. Guess I missed something. My bad, I try not to ever change semantics.
Re: svn commit: r816569 - in /lucene/mahout/trunk/examples/src: main/java/org/apache/mahout/classifier/bayes/ main/java/org/apache/mahout/clustering/meanshift/ main/java/org/apache/mahout/clustering
The change in examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/hadoop/DatasetSplit.java could lead to a bug. The problem is in the following modification: - rng = new MersenneTwisterRNG(split.getSeed()); + rng = RandomUtils.getRandom(); rng is supposed to use the seed given by split. I tried to correct this line my self, but I'm having problems committing the change. I'm getting the following message from svn: svn: Commit failed (details follow): svn: Server sent unexpected return value (403 Forbidden) in response to MKACTIVITY request for '/repos/asf/!svn/act/627fc1d8-98ad-4046-ae77-41962e731928' although I successfully committed my changes to the site. On Fri, Sep 18, 2009 at 11:01 AM, sro...@apache.org wrote: Author: srowen Date: Fri Sep 18 10:01:12 2009 New Revision: 816569 URL: http://svn.apache.org/viewvc?rev=816569view=rev Log: Bit of cleanup and, I think, a fix to the WikipediaDatasetCreatorMapper? Modified: lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/meanshift/DisplayMeanShift.java lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputMapper.java lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/CDRule.java lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/hadoop/DatasetSplit.java lucene/mahout/trunk/examples/src/test/java/org/apache/mahout/ga/watchmaker/cd/CDCrossoverTest.java lucene/mahout/trunk/examples/src/test/java/org/apache/mahout/ga/watchmaker/cd/hadoop/CDMapperTest.java lucene/mahout/trunk/examples/src/test/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosToolTest.java Modified: lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java URL: http://svn.apache.org/viewvc/lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java?rev=816569r1=816568r2=816569view=diff == --- lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java (original) +++ lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java Fri Sep 18 10:01:12 2009 @@ -42,12 +42,15 @@ public class WikipediaDatasetCreatorMapper extends MapReduceBase implements MapperLongWritable, Text, Text, Text { - private static final Logger log = LoggerFactory.getLogger(WikipediaDatasetCreatorMapper.class); - private static SetString inputCategories = null; - private static boolean exactMatchOnly = false; - private static Analyzer analyzer; + private static final Logger log = LoggerFactory.getLogger(WikipediaDatasetCreatorMapper.class); private static final Pattern SPACE_NON_ALPHA_PATTERN = Pattern.compile([\\s\\W]); + private static final Pattern OPEN_TEXT_TAG_PATTERN = Pattern.compile(text xml:space=\preserve\); + private static final Pattern CLOSE_TEXT_TAG_PATTERN = Pattern.compile(/text); + + private SetString inputCategories = null; + private boolean exactMatchOnly = false; + private Analyzer analyzer; @Override public void map(LongWritable key, Text value, @@ -59,7 +62,7 @@ String catMatch = findMatchingCategory(document); if(!catMatch.equals(Unknown)){ - document = StringEscapeUtils.unescapeHtml(document.replaceFirst(text xml:space=\preserve\, ).replaceAll(/text, )); + document = StringEscapeUtils.unescapeHtml(CLOSE_TEXT_TAG_PATTERN.matcher(OPEN_TEXT_TAG_PATTERN.matcher(document).replaceFirst()).replaceAll()); TokenStream stream = analyzer.tokenStream(catMatch, new StringReader(document)); Token token = new Token(); while((token = stream.next(token)) != null){ @@ -69,18 +72,19 @@ } } - public static String findMatchingCategory(String document){ + private String findMatchingCategory(String document){ int startIndex = 0; int categoryIndex; - String match = null; // TODO this is never updated? while((categoryIndex = document.indexOf([[Category:, startIndex))!=-1) { categoryIndex+=11; int endIndex = document.indexOf(]], categoryIndex); - if(endIndex=document.length() || endIndex 0) break; + if (endIndex = document.length() || endIndex 0) { + break; + } String category = document.substring(categoryIndex, endIndex).toLowerCase().trim(); //categories.add(category.toLowerCase()); - if (exactMatchOnly == true inputCategories.contains(category)){ + if (exactMatchOnly inputCategories.contains(category)){ return category; } else if (exactMatchOnly == false){ for (String
Re: Updating the Web site
forest is installed in my home directory :( --- En date de : Mar 15.9.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: Re: Updating the Web site À: mahout-dev@lucene.apache.org Date: Mardi 15 Septembre 2009, 14h14 Hmm, make sure you have proper permissions to write on the forrest install. I believe Forrest downloads stuff to its directories. I recall seeing similar things. Very annoying. On Sep 15, 2009, at 7:12 AM, deneche abdelhakim wrote: I'm already using Java 1.5 ! --- En date de : Mar 15.9.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: Re: Updating the Web site À: mahout-dev@lucene.apache.org Date: Mardi 15 Septembre 2009, 12h54 Forrest has a bug w/ JDK 1.6, just switch to 1.5 for it and it should work. On Sep 15, 2009, at 6:24 AM, deneche abdelhakim wrote: I followed the instructions available here: http://cwiki.apache.org/MAHOUT/howtoupdatethewebsite.html in order to add my name to the committer list =P when running 'forrest run' but I'm getting broken links: X [0] skin/images/current.gif BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory) X [0] skin/images/page.gif BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory) X [0] skin/images/chapter.gif BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory) it also sais that Your site would still be generated, but some pages would be broken. svn status shows me that I only modified src/documentation/content/ xdocs/whoweare.xml can I proceed anyway and copy the site to the publish directory ? -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Updating the Web site
and js files to site ... Copying 12 files to /home/hakim/mahout/site/build/site/skin Copying 5 files to /home/hakim/mahout/site/build/site/skin Finished copying the non-generated resources. Now Cocoon will generate the rest. Static site will be generated at: /home/hakim/mahout/site/build/site Cocoon will report the status of each document: - in column 1: *=okay X=brokenLink ^=pageSkipped (see FAQ). cocoon 2.2.0-dev Copyright (c) 1999-2005 Apache Software Foundation. All rights reserved. Build: December 8 2005 (TargetVM=1.4, SourceVM=1.4, Debug=on, Optimize=on) * [1/20][20/20] 5.635s 8.3Kb linkmap.html * [2/20][1/19]1.282s 6.9Kb releases.html * [3/21][2/22]1.022s 16.3Kb index.html * [4/21][1/19]0.509s 7.2Kb developer-resources.html * [5/20][0/0] 2.717s 2.3Kb linkmap.pdf * [7/18][0/0] 0.154s 4.2Kb skin/profile.css * [8/17][0/0] 2.909s 348bskin/images/rc-b-l-15-1body-2menu-3menu.png * [11/16] [2/20]0.461s 30.3Kb taste.html * [13/14] [0/0] 0.856s 32.9Kb taste.pdf * [14/13] [0/0] 22.791s 33.9Kb index.pdf * [18/9][0/0] 0.077s 5.1Kb developer-resources.pdf * [19/8][0/0] 0.09s 4.4Kb releases.pdf * [20/8][1/19]0.327s 9.7Kb mailinglists.html * [21/7][0/0] 0.259s 5.5Kb mailinglists.pdf * [22/6][0/0] 0.511s 2.9Kb skin/basic.css * [23/6][1/19]0.326s 7.2Kb whoweare.html * [24/5][0/0] 0.103s 4.1Kb whoweare.pdf * [26/4][1/19]0.322s 6.7Kb systemrequirements.html * [27/3][0/0] 0.079s 3.3Kb systemrequirements.pdf * [28/15] [13/13] 0.143s 12.4Kb skin/screen.css * [29/14] [0/0] 0.035s 390bskin/images/rc-t-r-15-1body-2menu-3menu.png X [0] skin/images/current.gif BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory) * [31/12] [0/0] 0.052s 214b skin/images/rc-t-r-5-1header-2tab-unselected-3tab-unselected.png * [32/11] [0/0] 0.018s 200b skin/images/rc-b-r-5-1header-2tab-selected-3tab-selected.png X [0] skin/images/page.gif BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory) * [34/9][0/0] 0.019s 209b skin/images/rc-t-l-5-1header-2tab-selected-3tab-selected.png * [35/8][0/0] 0.022s 214b skin/images/rc-t-r-5-1header-2searchbox-3searchbox.png X [0] skin/images/chapter.gif BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory) * [37/6][0/0] 0.029s 199b skin/images/rc-t-l-5-1header-2tab-unselected-3tab-unselected.png * [38/5][0/0] 0.055s 215b skin/images/rc-t-r-5-1header-2tab-selected-3tab-selected.png * [40/3][0/0] 0.049s 319bskin/images/rc-b-r-15-1body-2menu-3menu.png * [41/2][0/0] 0.018s 199b skin/images/rc-t-l-5-1header-2searchbox-3searchbox.png * [42/1][0/0] 0.025s 1.2Kb skin/print.css Total time: 0 minutes 44 seconds, Site size: 213,417 Site pages: 30 Java Result: 1 Copying broken links file to site root. Copying 1 file to /home/hakim/mahout/site/build/site BUILD FAILED /home/hakim/apache-forrest-0.8/main/targets/site.xml:180: Error building site. There appears to be a problem with your site build. Read the output above: * Cocoon will report the status of each document: - in column 1: *=okay X=brokenLink ^=pageSkipped (see FAQ). * Even if only one link is broken, you will still get failed. * Your site would still be generated, but some pages would be broken. - See /home/hakim/mahout/site/build/site/broken-links.xml Total time: 1 minute 5 seconds *** --- En date de : Mer 16.9.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: Re: Updating the Web site À: mahout-dev@lucene.apache.org Date: Mercredi 16 Septembre 2009, 15h35 What's the full log say? On Sep 16, 2009, at 7:15 AM, deneche abdelhakim wrote: forest is installed in my home directory :( --- En date de : Mar 15.9.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: Re: Updating the Web site À: mahout-dev@lucene.apache.org Date: Mardi 15 Septembre 2009, 14h14 Hmm, make sure you have proper permissions to write on the forrest install. I believe Forrest downloads stuff to its directories. I recall seeing similar things. Very annoying. On Sep 15, 2009, at 7:12 AM, deneche abdelhakim wrote: I'm already using Java 1.5 ! --- En date de : Mar 15.9.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: Re: Updating the Web site À: mahout-dev@lucene.apache.org Date: Mardi
Re: Updating the Web site
its working =D --- En date de : Mer 16.9.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: Re: Updating the Web site À: mahout-dev@lucene.apache.org Date: Mercredi 16 Septembre 2009, 16h08 svn up and try again On Sep 16, 2009, at 10:00 AM, Grant Ingersoll wrote: Now when I did a forrest clean I get the same error. On Sep 16, 2009, at 9:44 AM, deneche abdelhakim wrote: 'forrest site' gives me: ** Apache Forrest. Run 'forrest -projecthelp' to list options Buildfile: /home/hakim/apache-forrest-0.8/main/forrest.build.xml check-java-version: This is apache-forrest-0.8 Using Java 1.5 from /usr/lib/jvm/java-1.5.0-sun-1.5.0.19/jre init-props: echo-settings: check-skin: init-proxy: fetch-skins-descriptors: fetch-skin: unpack-skins: init-skins: fetch-plugins-descriptors: Fetching plugins descriptor: http://forrest.apache.org/plugins/plugins.xml Getting: http://forrest.apache.org/plugins/plugins.xml To: /home/hakim/mahout/site/build/tmp/plugins-1.xml local file date : Wed Dec 03 01:37:14 CET 2008 Not modified - so not downloaded Fetching plugins descriptor: http://forrest.apache.org/plugins/whiteboard-plugins.xml Getting: http://forrest.apache.org/plugins/whiteboard-plugins.xml To: /home/hakim/mahout/site/build/tmp/plugins-2.xml local file date : Thu Jan 15 04:07:07 CET 2009 Not modified - so not downloaded Plugin list loaded from http://forrest.apache.org/plugins/plugins.xml. Plugin list loaded from http://forrest.apache.org/plugins/whiteboard-plugins.xml. init-plugins: Copying 1 file to /home/hakim/mahout/site/build/tmp Copying 1 file to /home/hakim/mahout/site/build/tmp Copying 1 file to /home/hakim/mahout/site/build/tmp Copying 1 file to /home/hakim/mahout/site/build/tmp Copying 1 file to /home/hakim/mahout/site/build/tmp -- Installing plugin: org.apache.forrest.plugin.output.pdf -- check-plugin: org.apache.forrest.plugin.output.pdf is available in the build dir. Trying to update it... init-props: echo-settings: init-proxy: fetch-plugins-descriptors: fetch-plugin: Trying to find the description of org.apache.forrest.plugin.output.pdf in the different descriptor files Using the descriptor file /home/hakim/mahout/site/build/tmp/plugins-1.xml... Processing /home/hakim/mahout/site/build/tmp/plugins-1.xml to /home/hakim/mahout/site/build/tmp/pluginlist2fetchbuild.xml Loading stylesheet /home/hakim/apache-forrest-0.8/main/var/pluginlist2fetch.xsl fetch-local-unversioned-plugin: get-local: Trying to locally get org.apache.forrest.plugin.output.pdf Looking in local /home/hakim/apache-forrest-0.8/plugins Found ! init-build-compiler: echo-init: init: compile: jar: local-deploy: Locally deploying org.apache.forrest.plugin.output.pdf build: Plugin org.apache.forrest.plugin.output.pdf deployed ! Ready to configure fetch-remote-unversioned-plugin-version-forrest: fetch-remote-unversioned-plugin-unversion-forrest: has-been-downloaded: downloaded-message: uptodate-message: not-found-message: Fetch-plugin Ok, installing ! unpack-plugin: install-plugin: configure-plugin: configure-output-plugin: Mounting output plugin: org.apache.forrest.plugin.output.pdf Processing /home/hakim/mahout/site/build/tmp/output.xmap to /home/hakim/mahout/site/build/tmp/output.xmap.new Loading stylesheet /home/hakim/apache-forrest-0.8/main/var/pluginMountSnippet.xsl Moving 1 file to /home/hakim/mahout/site/build/tmp configure-plugin-locationmap: Mounting plugin locationmap for org.apache.forrest.plugin.output.pdf Processing /home/hakim/mahout/site/build/tmp/locationmap.xml to /home/hakim/mahout/site/build/tmp/locationmap.xml.new Loading stylesheet /home/hakim/apache-forrest-0.8/main/var/pluginLmMountSnippet.xsl Moving 1 file to /home/hakim/mahout/site/build/tmp init: -prepare-classpath: check-contentdir: examine-proj: validation-props: validate-xdocs: 8 file(s) have been successfully validated. ...validated xdocs validate-skinconf: 1 file(s) have been successfully validated. ...validated skinconf validate-sitemap: ...validated project sitemap validate-skins-stylesheets: ...validated skin stylesheets validate-skins: validate-skinchoice: ...validated existence of skin 'lucene' validate-stylesheets: validate: site: Copying the various non-generated resources to site. Warnings will be issued if the optional project resources are not found. This is often the case, because
Re : Welcome the newest Mahouts!
Got my Apache account yesterday 8D being a coder I always find it different to write other things than code =P, so my biography will probably be weird: I am an algerian PhD student, I'm expecting to use machine learning algorithms (probably evolutionary computing) and distributed computing (mahout? maybe). During my master I worked on Artificial Immune Systems applied to pattern recognition. I like coding, mainly in Java, but also in C# (although being at pro-noob level in C#). The past two years I learned a lot with mahout's community, and I'm looking forward to learn much more. --- En date de : Mer 26.8.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: Welcome the newest Mahouts! À: mahout-u...@lucene.apache.org, Mahout Dev List mahout-dev@lucene.apache.org Date: Mercredi 26 Août 2009, 16h57 I am pleased to announce that the Lucene PMC has voted to add Deneche Abdelhakim, Robin Anil and David Hall as Mahout committers. Deneche, Robin and David have all made significant contributions to Mahout in regards to classification, clustering, evolutionary programming and general usage and utilities. Furthermore, all three are or have been pursuing studies in machine learning at University, so we look for more great things as well! I hope you will join me in extending them a warm welcome. I know I look forward to working with them and continuing to build on Mahout's capabilities on our way to a 1.0 release. Also, it is customary that each new committer take the time to introduce themselves on the mailing list with a brief bio/background so we can all better get to know you. Finally, if you're interested in knowing more about what's involved in becoming a committer or would simply like to contribute to Mahout, see http://cwiki.apache.org/MAHOUT/howtocontribute.html and http://cwiki.apache.org/MAHOUT/howtobecomeacommitter.html. Congrats to Deneche, Robin and David! -Grant
Updating the Web site
I followed the instructions available here: http://cwiki.apache.org/MAHOUT/howtoupdatethewebsite.html in order to add my name to the committer list =P when running 'forrest run' but I'm getting broken links: X [0] skin/images/current.gif BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory) X [0] skin/images/page.gif BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory) X [0] skin/images/chapter.gif BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory) it also sais that Your site would still be generated, but some pages would be broken. svn status shows me that I only modified src/documentation/content/xdocs/whoweare.xml can I proceed anyway and copy the site to the publish directory ?
Re: Re : Welcome the newest Mahouts!
Can you tell more on what you will be working on, which problems you are trying to solve? I'm expecting to work on Discrete Tomography, probably reconstruction algorithms. But the final decision isn't not mine, so I may end up working on something else =P --- En date de : Mar 15.9.09, Isabel Drost isa...@apache.org a écrit : De: Isabel Drost isa...@apache.org Objet: Re: Re : Welcome the newest Mahouts! À: mahout-dev@lucene.apache.org Date: Mardi 15 Septembre 2009, 12h29 On Tue, 15 Sep 2009 10:11:56 + (GMT) deneche abdelhakim a_dene...@yahoo.fr wrote: Got my Apache account yesterday 8D Congratulations! And a warm welcome from me of course. I am an algerian PhD student, I'm expecting to use machine learning algorithms (probably evolutionary computing) and distributed computing (mahout? maybe). Can you tell more on what you will be working on, which problems you are trying to solve? The past two years I learned a lot with mahout's community, and I'm looking forward to learn much more. Hope you'll enjoy your time here. Isabel
Re: Updating the Web site
I'm already using Java 1.5 ! --- En date de : Mar 15.9.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: Re: Updating the Web site À: mahout-dev@lucene.apache.org Date: Mardi 15 Septembre 2009, 12h54 Forrest has a bug w/ JDK 1.6, just switch to 1.5 for it and it should work. On Sep 15, 2009, at 6:24 AM, deneche abdelhakim wrote: I followed the instructions available here: http://cwiki.apache.org/MAHOUT/howtoupdatethewebsite.html in order to add my name to the committer list =P when running 'forrest run' but I'm getting broken links: X [0] skin/images/current.gif BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory) X [0] skin/images/page.gif BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory) X [0] skin/images/chapter.gif BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory) it also sais that Your site would still be generated, but some pages would be broken. svn status shows me that I only modified src/documentation/content/ xdocs/whoweare.xml can I proceed anyway and copy the site to the publish directory ?
Re: JIRA permission ?
Thanks! --- En date de : Mar 15.9.09, Isabel Drost isa...@apache.org a écrit : De: Isabel Drost isa...@apache.org Objet: Re: JIRA permission ? À: mahout-dev@lucene.apache.org Date: Mardi 15 Septembre 2009, 17h23 On Tue, 15 Sep 2009 14:52:28 + (GMT) deneche abdelhakim a_dene...@yahoo.fr wrote: now that I'm a committer ( 8D ) I suppose I can assign JIRA issues to myself. Do I need a special permission to do that ? because I'm not able to find a way to do it =P I added you as committer to jira. You should be able to assign JIRA issues to yourself now. Isabel
Re : Comprehensive study on Java Memory Optimization
Thanks Robin =D --- En date de : Lun 14.9.09, Robin Anil robin.a...@gmail.com a écrit : De: Robin Anil robin.a...@gmail.com Objet: Comprehensive study on Java Memory Optimization À: mahout-dev mahout-dev@lucene.apache.org Date: Lundi 14 Septembre 2009, 9h08 Hope it would be useful. Link: http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memory-efficient-java-tutorial.pdf Robin
Re : [GSOC] Code Submissions
done. --- En date de : Mar 8.9.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: [GSOC] Code Submissions À: Mahout Dev List mahout-dev@lucene.apache.org Date: Mardi 8 Septembre 2009, 13h09 Hi Robin, David and Deneche, You will need to submit code samples. Please see http://groups.google.com/group/google-summer-of-code-announce/web/how-to-provide-google-with-sample-code -Grant
Re: build failure
just got the same error, nuking .m2 AND installing maven 2.2.1 solved the problem --- En date de : Mar 25.8.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: build failure À: mahout-dev@lucene.apache.org, isa...@apache..org Date: Mardi 25 Août 2009, 0h58 Tried the -U solution. No joy. I will try nuking .m2 next. On Mon, Aug 24, 2009 at 3:33 PM, Isabel Drost isa...@apache.org wrote: On Sunday 23 August 2009 16:24:13 Grant Ingersoll wrote: Try deleting your ~/.m2/repository. I should be sufficient to delete the resources-plugin in the repo only, or maybe running maven with -U enabled already helps? Isabel -- QOTD: Se Maomé não vai à montanha, a montanha vaia Maomé. |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`' -. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://main...@spaceboyz.net -- Ted Dunning, CTO DeepDyve
class not found bug ?
I moved recently some of the Decision Forest examples from the core project to the examples project. While in core they worked perfectly in hadoop 0..19.1 (pseudo-distributed), but now they don't !!! For example, running my org.apache.mahout.df.BuildForest gives the following exception: 09/08/17 12:02:36 INFO mapred.JobClient: Running job: job_200908171136_0020 09/08/17 12:02:37 INFO mapred.JobClient: map 0% reduce 0% 09/08/17 12:02:43 INFO mapred.JobClient: Task Id : attempt_200908171136_0020_m_00_0, Status : FAILED java.lang.NoClassDefFoundError: com/thoughtworks/xstream/XStream at org.apache.mahout.utils.StringUtils.clinit(StringUtils.java:28) at org.apache.mahout.df.mapred.Builder.getTreeBuilder(Builder.java:117) at org.apache.mahout.df.mapred.MapredMapper.configure(MapredMapper.java:74) at org.apache.mahout.df.mapred.partial.Step1Mapper.configure(Step1Mapper.java:75) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) ... So I tried instead to run one of mahout's : following the wiki, kmeans gives me the following error: ... 09/08/17 11:59:27 INFO kmeans.KMeansDriver: Iteration 4 ... 09/08/17 11:59:43 INFO kmeans.KMeansDriver: Clustering 09/08/17 11:59:43 INFO kmeans.KMeansDriver: Running Clustering 09/08/17 11:59:43 INFO kmeans.KMeansDriver: Input: output/data Clusters In: output/clusters-4 Out: output/points Distance: org.apache.mahout.utils.EuclideanDistanceMeasure 09/08/17 11:59:43 INFO kmeans.KMeansDriver: convergence: 0.5 Input Vectors: org.apache.mahout.matrix.SparseVector 09/08/17 11:59:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/08/17 11:59:43 INFO mapred..FileInputFormat: Total input paths to process : 2 09/08/17 11:59:43 INFO mapred.JobClient: Running job: job_200908171136_0019 09/08/17 11:59:44 INFO mapred.JobClient: map 0% reduce 0% 09/08/17 11:59:54 INFO mapred.JobClient: Task Id : attempt_200908171136_0019_m_00_0, Status : FAILED java.lang.NoClassDefFoundError: com/google/gson/reflect/TypeToken at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:637) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:277) at java.net.URLClassLoader.access$000(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:212) at java.security.AccessController.doPrivileged(Native Method) ... The problem seems related to the fact that mahout-core.jar is being packed inside examples.jar. So I modified maven/build.xml to pack the core classes instead (because they are available): Index: maven/build.xml === --- maven/build.xml (revision 804891) +++ maven/build.xml (working copy) @@ -45,9 +45,9 @@ includes=**/*.jar/ zipfileset dir=${core-lib} prefix=lib includes=**/*.jar excludes=hadoop-*.jar/ - zipfileset dir=../core/target/ prefix=lib includes=apache-mahout-core-${version}.jar/ + zipfileset dir=../core/target/classes/ zipfileset dir=${dest}/dependency prefix=lib - includes=**/*.jar/ + includes=**/*.jar excludes=apache-mahout-core-${version}.jar/ zipfileset dir=../core/target/dependency prefix=lib includes=**/*.jar/ /jar This seems to solve the problem, but I didn't try it on all examples __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail
Re : Error building Mahout
I'm getting it too when building from the base directory - Message d'origine De : Robin Anil robin.a...@gmail.com À : mahout-dev mahout-dev@lucene.apache.org Envoyé le : Mercredi, 22 Juillet 2009, 19h15mn 38s Objet : Error building Mahout I am getting this error on building mahout. mvn clean install -e take a look at the debug output. Since i am not very clear about how maven plugin work. I would appreciate some insight into the same. I believe copy resources is the stage where the jar files get copied to the target folder Robin Console dump belwo [INFO] Building jar: /home/robin/lucene/trunk/buildtools/target/mahout-buildtools-0.2-SNAPSHOT.jar [INFO] [install:install] [INFO] Installing /home/robin/lucene/trunk/buildtools/target/mahout-buildtools-0.2-SNAPSHOT.jar to /home/robin/.m2/repository/org/apache/mahout/mahout-buildtools/0.2-SNAPSHOT/mahout-buildtools-0.2-SNAPSHOT.jar [INFO] [INFO] Building Mahout Common Maven Parent [INFO]task-segment: [clean, install] [INFO] [INFO] [clean:clean] [INFO] [site:attach-descriptor] [INFO] [install:install] [INFO] Installing /home/robin/lucene/trunk/maven/pom.xml to /home/robin/.m2/repository/org/apache/mahout/mahout-parent/0.2-SNAPSHOT/mahout-parent-0.2-SNAPSHOT.pom [INFO] [INFO] Building Mahout core [INFO]task-segment: [clean, install] [INFO] [INFO] [ERROR] BUILD ERROR [INFO] [INFO] 'copy-resources' was specified in an execution, but not found in the plugin [INFO] [INFO] Trace org.apache.maven.lifecycle.LifecycleExecutionException: 'copy-resources' was specified in an execution, but not found in the plugin at org.apache.maven.lifecycle.DefaultLifecycleExecutor.bindExecutionToLifecycle(DefaultLifecycleExecutor.java:1359) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.bindPluginToLifecycle(DefaultLifecycleExecutor.java:1260) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.constructLifecycleMappings(DefaultLifecycleExecutor.java:1004) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:477) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:330) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:291) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:142) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:336) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:129) at org.apache.maven.cli.MavenCli.main(MavenCli.java:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315) at org.codehaus.classworlds.Launcher.launch(Launcher.java:255) at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430) at org.codehaus.classworlds.Launcher.main(Launcher.java:375)
Re : Error building Mahout
maven 2.1.0 deleting the local repository solves the problems, just hopes I wont have to do it often - Message d'origine De : Grant Ingersoll gsing...@apache.org À : mahout-dev@lucene.apache.org Envoyé le : Mercredi, 22 Juillet 2009, 19h42mn 04s Objet : Re: Error building Mahout What version of Mvn? Whenever I'm in doubt about Mvn, I delete the local repository (/home/robin/.m2/repository). On Jul 22, 2009, at 2:15 PM, Robin Anil wrote: I am getting this error on building mahout. mvn clean install -e take a look at the debug output. Since i am not very clear about how maven plugin work. I would appreciate some insight into the same. I believe copy resources is the stage where the jar files get copied to the target folder Robin Console dump belwo [INFO] Building jar: /home/robin/lucene/trunk/buildtools/target/mahout-buildtools-0.2-SNAPSHOT.jar [INFO] [install:install] [INFO] Installing /home/robin/lucene/trunk/buildtools/target/mahout-buildtools-0.2-SNAPSHOT.jar to /home/robin/.m2/repository/org/apache/mahout/mahout-buildtools/0.2-SNAPSHOT/mahout-buildtools-0.2-SNAPSHOT.jar [INFO] [INFO] Building Mahout Common Maven Parent [INFO]task-segment: [clean, install] [INFO] [INFO] [clean:clean] [INFO] [site:attach-descriptor] [INFO] [install:install] [INFO] Installing /home/robin/lucene/trunk/maven/pom.xml to /home/robin/.m2/repository/org/apache/mahout/mahout-parent/0.2-SNAPSHOT/mahout-parent-0.2-SNAPSHOT.pom [INFO] [INFO] Building Mahout core [INFO]task-segment: [clean, install] [INFO] [INFO] [ERROR] BUILD ERROR [INFO] [INFO] 'copy-resources' was specified in an execution, but not found in the plugin [INFO] [INFO] Trace org.apache.maven.lifecycle.LifecycleExecutionException: 'copy-resources' was specified in an execution, but not found in the plugin at org.apache.maven.lifecycle.DefaultLifecycleExecutor.bindExecutionToLifecycle(DefaultLifecycleExecutor.java:1359) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.bindPluginToLifecycle(DefaultLifecycleExecutor.java:1260) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.constructLifecycleMappings(DefaultLifecycleExecutor.java:1004) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:477) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:330) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:291) at org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:142) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:336) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:129) at org.apache.maven.cli.MavenCli.main(MavenCli.java:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315) at org.codehaus.classworlds.Launcher.launch(Launcher.java:255) at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430) at org.codehaus.classworlds.Launcher.main(Launcher.java:375) -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: [jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests
Actually, I'm not used any reducer at all, the output of the mappers is collected and handled by the main program after the end of the job. Running the job with 10 map tasks in a 10 instances (c1.medium) cluster takes 0h 11m 39s 209, speculative execution is on so 12 map tasks have been launched. running the same job with 5x10 map tasks takes 0h 11m 54s 962, 59 map tasks have been launched. And running the same job again with 5x10 map tasks with job parameter mapred.job.reuse.jvm.num.tasks=-1 (no limit how many tasks to run per jvm) takes 0h 11m 57s 115 --- En date de : Sam 18.7.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: [jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests À: mahout-dev@lucene.apache.org Date: Samedi 18 Juillet 2009, 20h36 This is interesting. Is the reduce trivial here? (if so, then and shuffling isn't the problem and you may have demonstrated this with your no output version) WHat happens if you increase the number of maps to 5x the number of nodes? On Sat, Jul 18, 2009 at 11:11 AM, Deneche A. Hakim (JIRA) j...@apache.orgwrote: It looks like building a single tree in a sequential manner is 2x faster than building the same tree with the cluster !!! I don't have a lot of experience with clusters, is it normal ??? may be 10 instances is just too small to get a good speedup, or may be there is a bug hiding somewhere (I can hear it walking in the code when the moon...) -- Ted Dunning, CTO DeepDyve
Re: problems downloading lucene-analyzers
Thanks Robin for the hint about squid I'd be happy to lock down a specific snapshot (say last nights), but I don't know the Maven syntax to do that. If you can find how, let me know and I'll happily commit it. I've searched on Google and it doesn't seem to be possible to lock a specific version of a snapshot: http://stackoverflow.com/questions/986040/maven-attempts-to-use-wrong-snapshot-version finally I've been able to download the snapshots, from now on I'll just use the -o parameter to stay offline --- En date de : Mar 30.6.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: Re: problems downloading lucene-analyzers À: mahout-dev@lucene.apache.org Date: Mardi 30 Juin 2009, 15h20 FWIW, it works for me. On Jun 30, 2009, at 6:54 AM, deneche abdelhakim wrote: I'm having problems with lucene-analyzers (2.9-SNAPSHOT) dependency, because its a snapshot mvn install downloads a new version every day, and most of the time I got checksum failures !!! Is any body else having the same problem ? mvn -version : Maven version: 2.0.9 Java version: 1.6.0_0 OS name: linux version: 2.6.28-13-generic arch: i386 Family: unix -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
problems downloading lucene-analyzers
I'm having problems with lucene-analyzers (2.9-SNAPSHOT) dependency, because its a snapshot mvn install downloads a new version every day, and most of the time I got checksum failures !!! Is any body else having the same problem ? mvn -version : Maven version: 2.0.9 Java version: 1.6.0_0 OS name: linux version: 2.6.28-13-generic arch: i386 Family: unix
[GSOC] Accepted Students
Hi, =D I've been accepted. And I'll be working on Random Forests =P Given it's my second participation, I have one advise : don't be shy to ask about anything related to your project on this list (starting from now), its the fastest way to learn about Mahout. Who else has been accepted ? - abdelhakim
Re: [GSOC] Accepted Students
Hi David, Welcome into Mahout =) The How To Contribute Wiki page is a must read, it gives you a quick overview about all you'll need to when contributing to Mahout. In my own experience you'll also need to: * know how to build the latest version of Mahout: http://cwiki.apache.org/MAHOUT/buildingmahout.html although, depending on your project you may skip the Taste Web part if you're not working with Taste. * know how to run an example in Hadoop, at least in pseudo-distributed: http://hadoop.apache.org/core/docs/current/quickstart.html --- En date de : Mar 21.4.09, David Hall d...@cs.stanford.edu a écrit : De: David Hall d...@cs.stanford.edu Objet: Re: [GSOC] Accepted Students À: mahout-dev@lucene.apache.org Date: Mardi 21 Avril 2009, 8h30 On Mon, Apr 20, 2009 at 11:18 PM, deneche abdelhakim a_dene...@yahoo.fr wrote: Hi, =D I've been accepted. And I'll be working on Random Forests =P Given it's my second participation, I have one advise : don't be shy to ask about anything related to your project on this list (starting from now), its the fastest way to learn about Mahout. Who else has been accepted ? I'm here. I'll be working on Latent Dirichlet Allocation. As for questions, what am I supposed to be reading during this community building period? I see: * http://cwiki.apache.org/MAHOUT/howtocontribute.html * http://www.apache.org/foundation/how-it-works.html plus skimming javadocs. Other suggestions? Either general, or more specific to my project? -- David - abdelhakim
Re: [gsoc] random forests
Here is a draft of my proposal ** Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests Student: AbdelHakim Deneche Student e-mail: ... Student Major: Phd in Computer Science Student Degree: Master in Computer Science Student Graduation: Spring 2011 Organization: The Apache Software Foundation Assigned Mentor: Abstract: My goal is to add the power of random/regression forests to Mahout. At the end of this summer one should be able to build random/regression forests for large, possibly, distributed datasets, store the forest and reuse it to classify new data. In addition, a demo on EC2 is planned. Detailed Description: This project is all about random/regression forests. The core component is the tree building algorithm from a random bootstrap from the whole dataset. I already wrote a detailed description on Mahout Wiki [RandomForests]. Given the size of the dataset, two distributed implementation are possible: 1. The most straightforward one deals with relatively small datasets. By small, I mean a dataset that can be replicated on every node of the cluster. Basically, each mapper has access to the whole dataset, so if the forest contains N trees and we have M mappers, each mapper runs the core building algorithm N/M times. This implementation is, relatively, easy because each mapper runs the basic building algorithm as it is. It is also of great interest if the user wants to try different parameters when building the forest. An out-of-core implementation is also possible to deal with datasets that cannot fit into the node memory. 2. The second implementation, which is the most difficult, is concerned with very large datasets that cannot fit in every machine of the cluster. In this case the mappers work differently, each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. The core building algorithm must thus be rewritten in a map-reduce form. This implementation can deal with datasets of any size, as long as they are on the cluster. Although the first implementation is easier to implement, the CPU and IO overhead of the out-of-core implementation are still unknown. A reference, non-parallel, implementation should thus be built to better understand the effects of the out-of-core implementation, especially for large datasets. This reference implementation is also usefull to asses the correctness of the distributed implementation. Working Plan and list of deliverables Must-Have: 1. reference implementation of Random/Regression Forests Building Algorithm: . Build a forest of trees, the basic algorithm (described in the wiki) takes a subset from the dataset as a training set and builds a decision tree. This algorithm is repeated for each tree of the forest. . The forest is stored in a file, this way it can be re-used, at any time, to classify new cases . At this step, the necessary changes to Mahout's Classifier interface are made to extend its use to more than Text datasets. 2. Study the effects of large datasets on the reference implementation . This step should guide our choice of the proper parallel implementation 3. Parallel implementation, choose one of the following: 3a. Parallel implementation A . When the dataset can be replicated to all computing nodes. . Each mapper has access to the whole dataset, if the forest contains N trees and we have M mappers, each mapper runs the basic building algorithm N/M times. The mapper if also responsible of computing the out-of-bag error estimation. . The reducer store the trees in the RF file, and merges the oob error estimations. 3b. Parallel implementation B: . When the dataset is so big that it can no longer fit on every computing node, it must be distributed over the cluster. . Each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. . In this case, the basic algorithm must be rewritten to fit in the map-reduce paradigm. Should-Have: 4. Run the Random Forest with a real dataset on EC2: . This step is important, because running the RF on a local dual core machine is different from running it on a real cluster with a real dataset. . This can make a good demo for Mahout . Amazon has put some interesting datasets to play with [PublicDatasets]. The US Census dataset comes in various sizes ranging from 2Go to 200Go, and should make a very good example. . At this stage it may be useful to implement [MAHOUT-71] (Dataset to Matrix Reader). Wanna-Have: 5. If there is still time, implement one or two other important features of RFs such as Variable importance and Proximity estimation Additional Information: I am a PhD student at the University Mentouri of Constantine. My primary research goal is a framework to help build Intelligent Adaptive Systems. For the purpose of my Master, I worked on
Re: [gsoc] random forests
Thank you for your answer, it just made me aware of many hidden-possible-future problems with my implementation. The first is that for any given application, the odds that the data will not fit in a single machine are small, especially if you have an out-of-core tree builder. Really, really big datasets are increasingly common, but are still a small minority of all datasets. by out-of-core you mean the builder can fetch the data directly from a file instead of working from in-memory only (?) One question I have about your plan is whether your step (1) involves building trees or forests only from data held in memory or whether it can be adapted to stream through the data (possibly several times). If a streaming implementation is viable, then it may well be that performance is still quite good for small datasets due to buffering. I was planning to distribute the dataset files to all workers using Hadoop's DistributedCache. I think that a streaming implementation is feasible, the basic tree building algorithm (described here http://cwiki.apache.org/MAHOUT/random-forests.html) would have to stream through the data (either in-memory or from a file) for each node of the tree. During this pass, it computes the information gain (IG) for the selected variables. This algorithm could be improved to compute the IG's for a list of nodes, thus reducing the total number of passes through the data. When building the forest, the list of nodes comes from all the trees built by the mapper. Another way to put this is that the key question is how single node computation scales with input size. If the scaling is relatively linear with data size, then your approach (3) will work no matter the data size. If scaling shows an evil memory size effect, then your approach (2) would be required for large data sets. I'll have to run some tests before answering this question, but I think that the memory usage of the improved algorithm (described above) will mainly be needed to store the IG's computations (variable probabilities...). One way to limit the memory usage is to limit the number of tree-nodes computed at each data pass. Increasing this limit should reduce the data passes but increase the memory usage, and vice versa. There is still one case that this approach, even out-of-core, cannot handle: very large datasets that cannot fit in the node hard-drive, and thus must be distributed across the cluster. abdelHakim --- En date de : Lun 30.3.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: [gsoc] random forests À: mahout-dev@lucene.apache.org Date: Lundi 30 Mars 2009, 0h59 I have two answers for you. The first is that for any given application, the odds that the data will not fit in a single machine are small, especially if you have an out-of-core tree builder. Really, really big datasets are increasingly common, but are still a small minority of all datasets. The second answer is that the odds that SOME mahout application will be too large for a single node are quite high. These aren't contradictory. They just describe the long-tail nature of problem sizes. One question I have about your plan is whether your step (1) involves building trees or forests only from data held in memory or whether it can be adapted to stream through the data (possibly several times). If a streaming implementation is viable, then it may well be that performance is still quite good for small datasets due to buffering. If streaming works, then a single node will be able to handle very large datasets but will just be kind of slow. As you point out, that can be remedied trivially. Another way to put this is that the key question is how single node computation scales with input size. If the scaling is relatively linear with data size, then your approach (3) will work no matter the data size. If scaling shows an evil memory size effect, then your approach (2) would be required for large data sets. On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim a_dene...@yahoo.frwrote: My question is : when Mahout.RF will be used in a real application, what are the odds that the dataset will be so large that it can't fit on every machine of the cluster ? the answer to this question should help me decide which implementation I'll choose. -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax)
Re: [gsoc] random forests
you should read in . 2a . This implementation is, relatively, easy given... --- En date de : Sam 28.3.09, deneche abdelhakim a_dene...@yahoo.fr a écrit : De: deneche abdelhakim a_dene...@yahoo.fr Objet: Re: [gsoc] random forests À: mahout-dev@lucene.apache.org Date: Samedi 28 Mars 2009, 16h14 I'm actually writing my working plan, and it looks like this: * 1. reference implementation of Random/Regression Forests Building Algorithm: . Build a forest of trees, the basic algorithm (described in the wiki) takes a subset from the dataset as a training set and builds a decision tree. This basic algorithm is repeated for each tree of the forest. . The forest is stored in a file, this way it can be used later to classify new cases 2a. distributed Implementation A: . When the dataset can be replicated to all computing nodes. . Each mapper has access to the whole dataset, if the forest contains N trees and we have M mappers, each mapper runs the basic building algorithm N/M times. . This implementation is, relatively, given that the reference implementation is available, because each mapper runs the basic building algorithm as it is. 2b. Distributed Implementation B: . When the dataset is so big that it can no longer fit on every computing node, it must be distributed over the cluster. . Each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. . In this case, the basic algorithm must be rewritten to fit in the map-reduce paradigm. 3. Run the Random Forest with a real dataset on EC2: . This step is important, because running the RF on a local dual core machine is way different from running it on a real cluster with a real dataset. . This can make for a good demo for Mahout 4. If there is still time, implement one or two other important features of RFs such as Variable importance and Proximity estimation * It is clear from the plan that I won't be able to do all those steps, and in some way I must choose only one implementation (2a or 2b) to do. The first implementation should take less time to implement than 2b and I'm quite sure I can go up to the 4th step, adding other features to the RF. BUT the second implementation is the only one capable of dealing with very large distributed datasets. My question is : when Mahout.RF will be used in a real application, what are the odds that the dataset will be so large that it can't fit on every machine of the cluster ? the answer to this question should help me decide which implementation I'll choose. --- En date de : Dim 22.3.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: [gsoc] random forests À: mahout-dev@lucene.apache.org Date: Dimanche 22 Mars 2009, 0h36 Great expression! You may be right about the nose-bleed tendency between the two methods. On Sat, Mar 21, 2009 at 4:46 AM, deneche abdelhakim a_dene...@yahoo.frwrote: I can't find a no-nose-bleeding algorithm -- Ted Dunning, CTO DeepDyve
Re: GSoC 2009-Discussion
talking about Random Forests, I think there are two possible ways to actually implement them: The first implementation is useful when the dataset is not that big (= 2Go perhaps) and thus can be distributed via Hadoop's DistributedCache. In this case each mapper has access to all the dataset and builds a subset of the forest. The second one is related to large datasets, and by large I mean datasets that cannot fit on every computing node. In this case each mapper processes a subset of the dataset for all the trees. Im more interested in the second implementation, so may be Samuel would be interested in the first...but of course if actually the community need them both :) --- En date de : Mar 24.3.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: GSoC 2009-Discussion À: mahout-dev@lucene.apache.org Date: Mardi 24 Mars 2009, 0h07 There are other algorithms of serious interest. Bayesian Additive Regression Trees (BART) would make a very interesting complement to Random Forests. I don't know how important it is to get a normal decision tree algorithm going because the cost to build these is often not that high. Boosted decision trees might be of interest, but probably not as much as BART. It might also be interesting to work with this student to implement some of the diagnostics associated with random forests. There is plenty to do. - Original Message From: Samuel Louvan samuel.lou...@gmail.com My questions: - I just notice in the mailing archive that other student also pretty serious to implement random forest algorithm. Should I select decision tree instead ? (for my future GSoC proposal) - Actually I found it would be interesting if I can combine Apache Nutch and Mahout so the idea is to implement web page segmentation + classifier inside a web crawler. By doing this, a crawler, for instance, can use the output of the classification to only follow certain links that lie on informative content parts. Is this interesting make sense for you guys? -- Ted Dunning, CTO DeepDyve
Re: [gsoc] random forests
Yeah, Breinman states that at each node, m variables are selected at random out of the M I modified the wiki page, in LearnUnprunedTree(X,Y) which builds iteratively a node at a time, I added this line: select m variables at random out of the M variables before searching the best split For j = 1 .. m ... --- En date de : Lun 16.3.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: [gsoc] random forests À: mahout-dev@lucene.apache.org Date: Lundi 16 Mars 2009, 7h26 Nice writeup. One thing that I was confused about for a long time is whether the choice of variables to use for splits is chosen once per tree or again at each split. I think that the latter interpretation is actually the correct one. You should check my thought. On Sun, Mar 15, 2009 at 1:53 AM, deneche abdelhakim a_dene...@yahoo.frwrote: I added a page to the wiki that describes how to build a random forest and how to use it to classify new cases. -- Ted Dunning, CTO DeepDyve
[gsoc] random forests
I added a page to the wiki that describes how to build a random forest and how to use it to classify new cases. http://cwiki.apache.org/confluence/display/MAHOUT/Random+Forests
Re: Mahout for 1.5 JVM
The following classes uses the Deque interface, which is not available in Java 1.5: . org.apache.mahout.classifier.bayes.BayesClassifier . org.apache.mahout.classifier.cbayes.CBayesClassifier --- En date de : Lun 9.3.09, Sean Owen sro...@gmail.com a écrit : De: Sean Owen sro...@gmail.com Objet: Re: Mahout for 1.5 JVM À: mahout-dev@lucene.apache.org Date: Lundi 9 Mars 2009, 22h17 Yeah I don't know of anything in my bits that actually uses a Java 6-only class, but could be proved wrong there. You can dig out my old build.xml file in a pinch to build just this bit -- I can write up a quick Ant build for you too for the same purpose. You do need to make sure you compile with Java 6 since I do surely use stuff like @Override on methods implementing interface methods which isn't allowed in Java 5, but which javac in Java 6 can take care of if source is 6 and target is 5. On Mon, Mar 9, 2009 at 9:13 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hm, yeah, 1.6 because of Hadoop, I forgot about that. I need only the Tasty part of Mahout, though, and that one doesn't really need to run on Hadoop. Any way to build just that (for 1.5)?
Re: Google SoC 2009
Im seriously considering Random Forests (RF) as my GSoC project, they seem interesting, and judging by how often they have been suggested, they are very useful to Mahout. I found the following discussion: http://markmail.org/message/dancn3n76ken6thb that gives many useful informations about RF, and the Breiman's web site contains a very clear description of the algorithm and its possible uses. A question through, the most basic use of RF is as a classifier. Does it mean it must implement org.apache.mahout.common.Classifier interface ? Im not quite sure but it seems dedicated to classify text documents, but RF could be useful for any kind of datasets --- En date de : Ven 27.2.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: Re: Google SoC 2009 À: mahout-dev@lucene.apache.org Date: Vendredi 27 Février 2009, 18h34 Priority is in the eye of the beholder in Apache land, so scratch the itch you are most interested in. Ultimately, we're interested in having a suite of ML libraries, but you certainly could do worse than to pick something that has proven to be useful, stable and well-used by lots of people over time. I think several of them have been suggested on another related thread, but things like neural nets, linear regression, random forests, self organizing maps are all of interest. -Grant On Feb 24, 2009, at 12:04 PM, Siddharth Prakash Singh wrote: Hi, No, I don't have any specific interest. I would rather like to work on implementing algorithm which is of most priority. Awaiting a response. Siddharth On Sat, Feb 21, 2009 at 2:43 AM, Isabel Drost isa...@apache.org wrote: On Friday 20 February 2009, Siddharth Prakash Singh wrote: I wish to contribute to mahout as google soc participant this year. I am interested in implementing a Map/Reduce enabled machine learning algo. Any suggestions please? Welcome Siddharth. Is there anything machine learning specific that interests you in particular? You can also have a look in the Mahout Wiki as well as the jira to find out more on which algorithms are already available and which are still missing. Isabel -- One father is more than a hundred schoolmasters. -- George Herbert |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://main...@spaceboyz.net --Siddharth Prakash Singh http://www.spsneo.com -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
GSoC 2009 proposition
Hi, Im planning to participate, again, at GSoC and I want to do it, again, with Mahout. This year, lets make Mahout run over Amazon EC2. This means building the proper AMIs, run some Mahout projects (the GA examples) over EC2, give feedback and write simple, clear How-Tos about running a Mahout project on EC2. The Mahout.GA examples (TSP and CDGA) should be good real-world scenarios about how one may need to use Mahout.GA on EC2. The TSP example should be modified to be able to run on a console and to load TSPLIB benchmarks, thus we can tackle more challenging TSP problems with the help of EC2. The CDGA example should run unmodified given, of course, that Hadoop is configured correctly on EC2 and the the benchmark is on HDFS. This two examples will give us three use cases about Mahout on EC2: 1. TSP can be run on a single, High-CPU, EC2 instance. In this case, Watchmaker's ConcurrentEvolutionEngine should take care of the multi-threading part (or at least I hope!) and there will be no need for Hadoop; 2. TSP can also be run over multiple EC2 instances with the help of Hadoop; 3. CDGA not only needs Hadoop to run, but its data should be on HDFS. So what do you think, is the elephant ready for a walk on EC2 ?
Re: GSoC 2009 proposition
Thanks for your fast answers :) I'll rethink this and post as soon as I get something --- En date de : Jeu 26.2.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: Re: GSoC 2009 proposition À: mahout-dev@lucene.apache.org Date: Jeudi 26 Février 2009, 16h20 You might have a look at http://www.lucidimagination.com/search/document/5ab9ddafa19ee04b/thought_offering_ec2_s3_based_services#2d096f39b02ec289 for some background thoughts. I think it's a nice idea and I've been meaning to use my Amazon credits for just such a thing for a while now, but not sure how high priority it is. You might consider extending/altering this thought to have more of a focus on developing demos (including code) of Mahout with real data sets on larger scale systems. Part of this might involve showing people how to do this on EC2, but the bigger focus to me should be on demoing/documenting Mahout's capabilities, versus showing how to run Mahout on any particular platform. On Feb 26, 2009, at 9:58 AM, deneche abdelhakim wrote: Hi, Im planning to participate, again, at GSoC and I want to do it, again, with Mahout. This year, lets make Mahout run over Amazon EC2. This means building the proper AMIs, run some Mahout projects (the GA examples) over EC2, give feedback and write simple, clear How-Tos about running a Mahout project on EC2. The Mahout.GA examples (TSP and CDGA) should be good real-world scenarios about how one may need to use Mahout.GA on EC2. The TSP example should be modified to be able to run on a console and to load TSPLIB benchmarks, thus we can tackle more challenging TSP problems with the help of EC2. The CDGA example should run unmodified given, of course, that Hadoop is configured correctly on EC2 and the the benchmark is on HDFS. This two examples will give us three use cases about Mahout on EC2: 1. TSP can be run on a single, High-CPU, EC2 instance. In this case, Watchmaker's ConcurrentEvolutionEngine should take care of the multi-threading part (or at least I hope!) and there will be no need for Hadoop; 2. TSP can also be run over multiple EC2 instances with the help of Hadoop; 3. CDGA not only needs Hadoop to run, but its data should be on HDFS. So what do you think, is the elephant ready for a walk on EC2 ? -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Towards 0.1
About MAHOUT-102 (https://issues.apache.org/jira/browse/MAHOUT-102), the patch is already available, is someone could just commit it. Also, I'm not able to make my patchs delete files (or directories) when applied, is it because I'm not a commiter or because I'm using TortoiseSVN ? --- En date de : Jeu 29.1.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: Re: Towards 0.1 À: mahout-dev@lucene.apache.org Date: Jeudi 29 Janvier 2009, 2h41 Feel free to move. I put in a Maven one, but it sounds like it is already fixed. Even building the candidate tonight, spawns at least 3 days for voting to take place. I definitely agree on getting 0.1 out. On Jan 28, 2009, at 1:16 PM, Sean Owen wrote: How about moving them all to 0.2 right now? There was essentially a consensus for this last week. I worry that we have been stuck in the current state for a while and don't want to continue indefinitely. 0.1, as the value implies, can be far from perfect. There is a negative consequence right now to having some publicity but no downloadable release at all. I'd prefer to get that release built tonight (from a kind person with gpg, pass it around for a last look that everything is in there properly, and post it. On Wed, Jan 28, 2009 at 2:39 PM, Grant Ingersoll gsing...@apache.org wrote: Yep, I'm looking at trying out the stuff. I think we need to go through the unresolved issues for 0.1 and either move them to 0.2 or close them.
Re: Re : @Override annotations
When you say 1.5, you mean the 1.5 JDK (or JRE in the case of Eclipse) ? Because, I just tried to compile the Mahout trunk in Eclipse using JRE 1.5.0_11 (and 5.0 compliance level of course), and got 628 errors in core/src (I didn't check core/test yet). The first 100 errors are as follows: . 98 errors related to the @Override, for example: The method accept(File) of type BayesFileFormatter.FileProcessor must override a superclass method mahout-core/src/org/apache/mahout/classifier BayesFileFormatter.java line 160 . 2 errors related to Deque cannot be resolved in . org.apache.mahout.classifier.bayes.BayesClassifier, and . org.apache.mahout.classifier.cbayes.CBayesClassifier May be I'm wrong, but Deque is available only in 1.6, no ? --- En date de : Jeu 22.1.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: Re : @Override annotations À: mahout-dev@lucene.apache.org Date: Jeudi 22 Janvier 2009, 10h05 I think mahout should compile with both 1.5 and 1.6. On Wed, Jan 21, 2009 at 11:23 PM, deneche abdelhakim a_dene...@yahoo.frwrote: Last time I tried to compile the Mahout trunk, I got a similar problem. In my case, I'm using Eclipse and the errors were caused by the JDK Compliance Level (in the project properties). In short, I was using JVM 1.6 JRE but with 5.0 compliance level (forgot to change it !). I found the answer in the following link: http://dev.eclipse.org/newslists/news.eclipse.newcomer/msg19329.html --- En date de : Jeu 22.1.09, Jeff Eastman j...@windwardsolutions.com a écrit : De: Jeff Eastman j...@windwardsolutions.com Objet: @Override annotations À: mahout-dev@lucene.apache.org Date: Jeudi 22 Janvier 2009, 6h07 I'm trying to compile the latest Mahout trunk on my MacBook using the JVM 1.6.0 JRE and the @Override annotations are causing a lot of errors. There must be a simple solution to this problem but I cannot recall it. Can somebody help? Jeff -- Ted Dunning, CTO DeepDyve 4600 Bohannon Drive, Suite 220 Menlo Park, CA 94025 www.deepdyve.com 650-324-0110, ext. 738 858-414-0013 (m)
Re : @Override annotations
Last time I tried to compile the Mahout trunk, I got a similar problem. In my case, I'm using Eclipse and the errors were caused by the JDK Compliance Level (in the project properties). In short, I was using JVM 1.6 JRE but with 5.0 compliance level (forgot to change it !). I found the answer in the following link: http://dev.eclipse.org/newslists/news.eclipse.newcomer/msg19329.html --- En date de : Jeu 22.1.09, Jeff Eastman j...@windwardsolutions.com a écrit : De: Jeff Eastman j...@windwardsolutions.com Objet: @Override annotations À: mahout-dev@lucene.apache.org Date: Jeudi 22 Janvier 2009, 6h07 I'm trying to compile the latest Mahout trunk on my MacBook using the JVM 1.6.0 JRE and the @Override annotations are causing a lot of errors. There must be a simple solution to this problem but I cannot recall it. Can somebody help? Jeff
Re : More proposed changes across code
5. BruteForceTravellingSalesman says copyright Daniel Dwyer -- can this be replaced by the standard copyright header? Oups, I tought I changed them all ! Yes you can replace it. __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail
Re: More proposed changes across code
--- En date de : Dim 19.10.08, Grant Ingersoll [EMAIL PROTECTED] a écrit : De: Grant Ingersoll [EMAIL PROTECTED] Objet: Re: More proposed changes across code À: mahout-dev@lucene.apache.org Date: Dimanche 19 Octobre 2008, 18h30 On Oct 19, 2008, at 11:16 AM, Sean Owen wrote: On Sun, Oct 19, 2008 at 4:07 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: Doesn't the javadoc tool used @inherit to fill in the inherited docs when viewing? Yes... I suppose I find that redundant. The subclass method gets documented exactly as the superclass does. It looks like the subclass had been explicitly documented, when it hadn't been. I think its intent is to copy in documentation and add to it; I am thinking only of cases where the javadoc only has a single element, [EMAIL PROTECTED] 3. UpdatableFloat/Long -- just use Float[1] / Long[1]? these classes don't seem to be used. Hmmm, they were used, but sure that works too. I can't find any usages of these classes, where are they? Right, they aren't used any longer. Feel free to remove. 5. BruteForceTravellingSalesman says copyright Daniel Dwyer -- can this be replaced by the standard copyright header? No, this is in fact his code, licensed under the ASL. I believe the current way we are handling it is correct. The original code is his, and the mods are ours. Roger that, will leave it. But two notes then... - what about all the other code that game from watchmaker? all the classes in the package say they came from watchmaker - I was told that for my stuff, yeah, I still own the code/copyright but am licensing a copy to this project, and so it all just gets licensed within Mahout according to the boilerplate which says Licensed to the ASF... I'm not a lawyer and don't want to pick nits but I do want to take extra care to get licensing right. Right. I believe the difference is you donated your code to the ASF, Daniel has merely published his code under the ASL, but has not donated to the ASF. It's a subtle distinction, I suppose.Any of the classes that came from watchmaker should say that, although I know many were developed by Deneche for the Watchmaker API. We can go review them again. In the case of the travellingSalesman example, I modified the original code to use Mahout when needed. My own modifications are a couple of lines in two or three classes, I included a readme.txt that describes the modified code and links to the original one. I replaced all the copyright headers with the standard one (I forgot BruteForceTravellingSalesman.java), and added a link to the original code in the class comments. I've been reading the Apache License 2.0, I'm not a lawyer and if I'm not mistaken, the travellingSalesman code included with Mahout is a Derivative Work of the original code, so we need to : . Point in the modified files that they have been changed, this files are: StrategyPanel.java, TravellingSalesman.java and EvolutionaryTravellingSalesman.java. . because the Watchmaker library contains a NOTICE.TXT file, Mahout must include a readable copy of the attribution notices contained within Watchmaker's NOTICE file. __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail
Re: Mahout on EC2
Ok, in this case its the main program that has a Swing GUI, the Map-Reduce Jobs have no GUIs at all. But yeah, it's always good to separate the GUI code from the logic. --- En date de : Dim 21.9.08, Ted Dunning [EMAIL PROTECTED] a écrit : De: Ted Dunning [EMAIL PROTECTED] Objet: Re: Mahout on EC2 À: mahout-dev@lucene.apache.org Date: Dimanche 21 Septembre 2008, 23h08 For the master machine that launches the map-reduce computation, you can tunnel an X display from somewhere else to display swing applications. You will also need to do the separation for the reason that Sean says... you will be running on many machines. On Sat, Sep 20, 2008 at 2:34 AM, Sean Owen [EMAIL PROTECTED] wrote: I think you can run a program that uses Swing - unless I am wrong this no longer result in an error when running on a 'headless' machine - for example a box without X11. But no I don't think there is anyway to interact with it, especially considering you might be running on many machines at once. But the same is true of the console - you won't be able to interact with the program that way either. It does sound good, in any event, to separate out Swing client code from the core logic. On 9/20/08, deneche abdelhakim [EMAIL PROTECTED] wrote: Sounds cool :) I'll do the TSP part, but it may take some time because I'm a bit busy (PhD's administrative stuff). There are many available large TSP benchmarks, and it seems that there is a common file format for them TSPLIB ( http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS ). So the TSP example should be modified to load those benchmark files. I have a question about EC2 : can you run Java Swing programs and see the GUI because the TSP example has a Swing GUI, or should we should make a console version of the example ? --- En date de : Ven 19.9.08, Grant Ingersoll [EMAIL PROTECTED] a écrit : De: Grant Ingersoll [EMAIL PROTECTED] Objet: Mahout on EC2 À: mahout-dev@lucene.apache.org Date: Vendredi 19 Septembre 2008, 17h18 Amazon has generously donated some credits, so I plan on putting Mahout up and doing some testing. Was wondering if people had suggestions on things they would like to see from Mahout. For starters, I'm going to put up a public image containing 0.1 when it's ready, but I'd also like to wiki up some examples. I.e. go here, get this data, put it in this format and then do X. We have some simple examples, but I think it would be cool to show how to do something a bit more complex, like maybe classify web pages according to DMOZ or to cluster on stuff, or maybe put in a large traveling salesman problem using the GA stuff Deneche did. Thoughts? Anyone else interested in setting up some use cases? -Grant -- ted
Re: Hardcoded paths in examples
From that perspective, I guess I think it's suboptimal to depend on Hadoop Path objects here in the unit test, since the tests are not actually using Hadoop. That ought to be separated. But even if the test code is not using Hadoop, it's still calling Hadoop code : Mappers, Reducers and all the happy family :) Then you have test code depending on external scripts -- in two places. Which would lead me to the conclusion that it's best, overall, if these tests are self-contained and cause their dependent data to be generated. I am not familiar with this code. Is that easy? infeasible? It's feasible...but not easy :( --- En date de : Lun 22.9.08, Sean Owen [EMAIL PROTECTED] a écrit : De: Sean Owen [EMAIL PROTECTED] Objet: Re: Hardcoded paths in examples À: mahout-dev@lucene.apache.org Date: Lundi 22 Septembre 2008, 11h47 I don't think that is necessary. I think it is fair to assume that one is running the tests from within the distribution directory and not have to resort to that abstraction. From that perspective, I guess I think it's suboptimal to depend on Hadoop Path objects here in the unit test, since the tests are not actually using Hadoop. That ought to be separated. But that aside, that still leaves the issue of whether one can depend on some build products existing in a test. I don't think it's a bad thing, as long as the Ant script ensures those build products exist. Then the question is, can you express the same dependency in Maven? I think you can? Then you have test code depending on external scripts -- in two places. Which would lead me to the conclusion that it's best, overall, if these tests are self-contained and cause their dependent data to be generated. I am not familiar with this code. Is that easy? infeasible? Sean On Mon, Sep 22, 2008 at 9:59 AM, Karl Wettin [EMAIL PROTECTED] wrote: Hmm, if this is test/resources, shouldn't they be accessed using getResourceAsStream instead? I'll see what I can do. 22 sep 2008 kl. 10.15 skrev Sean Owen: Oh OK. Well +1 to using the same path, yes. If it is easier to adapt to Maven's location, OK. On 9/22/08, deneche abdelhakim [EMAIL PROTECTED] wrote: Dumb question: why does example code depend on test code? Can this be solved by severing that dependency? It's not from the example code but from the example's test code. In this case the example's tries to access a directory (wdbc) put into test/resources. The content of test/resources is automatically copied by ant into build/test-classes/ That means that the maven test builds will fail unless the ant test was first executed. I suppose that's OK, but I'd prefere if we could come up with some fix. I suppose the simplest one would be to use the maven file paths (target/test-classes). But in this case, the ant test builds will probably fail unless the maven test was first executed ! Why not use the same file path for both ant and maven, or at least copy the content of the ressources in a common directory... --- En date de : Dim 21.9.08, Sean Owen [EMAIL PROTECTED] a écrit : De: Sean Owen [EMAIL PROTECTED] Objet: Re: Hardcoded paths in examples À: mahout-dev@lucene.apache.org Date: Dimanche 21 Septembre 2008, 18h07 Dumb question: why does example code depend on test code? Can this be solved by severing that dependency? On 9/21/08, Karl Wettin [EMAIL PROTECTED] wrote: There are a bunch of hardcoded paths in the tests of the examples module. Stuff like this: Path inpath = new Path(build/test-classes/wdbc); That means that the maven test builds will fail unless the ant test was first executed. I suppose that's OK, but I'd prefere if we could come up with some fix. I suppose the simplest one would be to use the maven file paths (target/test-classes). karl
Re : Mahout on EC2
Sounds cool :) I'll do the TSP part, but it may take some time because I'm a bit busy (PhD's administrative stuff). There are many available large TSP benchmarks, and it seems that there is a common file format for them TSPLIB (http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS). So the TSP example should be modified to load those benchmark files. I have a question about EC2 : can you run Java Swing programs and see the GUI because the TSP example has a Swing GUI, or should we should make a console version of the example ? --- En date de : Ven 19.9.08, Grant Ingersoll [EMAIL PROTECTED] a écrit : De: Grant Ingersoll [EMAIL PROTECTED] Objet: Mahout on EC2 À: mahout-dev@lucene.apache.org Date: Vendredi 19 Septembre 2008, 17h18 Amazon has generously donated some credits, so I plan on putting Mahout up and doing some testing. Was wondering if people had suggestions on things they would like to see from Mahout. For starters, I'm going to put up a public image containing 0.1 when it's ready, but I'd also like to wiki up some examples. I.e. go here, get this data, put it in this format and then do X. We have some simple examples, but I think it would be cool to show how to do something a bit more complex, like maybe classify web pages according to DMOZ or to cluster on stuff, or maybe put in a large traveling salesman problem using the GA stuff Deneche did. Thoughts? Anyone else interested in setting up some use cases? -Grant
Re : FYI Cloud Computing Resources
I came across the following competition http://www.netflixprize.com/index It's about recommender systems, so I think it's a Taste stuff. The training dataset consists of more than 100M ratings. - Message d'origine De : Josh Myer [EMAIL PROTECTED] À : mahout-dev@lucene.apache.org Envoyé le : Mercredi, 30 Juillet 2008, 18h19mn 25s Objet : Re: FYI Cloud Computing Resources On Wed, Jul 30, 2008 at 11:26:29AM -0400, Grant Ingersoll wrote: http://research.yahoo.com/node/2328 It _MAY_ (stressed, emphasized, etc.) be possible for Mahouters (or are we just Mahouts?) to get some access to these resources. One big question is where can we get some fairly large data sets (large, but not super large, I think, but am not sure) If you have ideas, etc. please let us know. It's worth plugging (theinfo), http://theinfo.org/. It's a project to collect references to datasets, and may help here. Unfortunately, it seems to be laggy at the moment. I'll poke Aaron about that =) HtH, -- Josh Myer [EMAIL PROTECTED]
Re : Going to move us to Hadoop 0.18.0, Java 6
Go on, I will do my part, I just hope GA likes Java 6 :P - Message d'origine De : Sean Owen [EMAIL PROTECTED] À : mahout-dev@lucene.apache.org Envoyé le : Samedi, 30 Août 2008, 21h26mn 45s Objet : Re: Going to move us to Hadoop 0.18.0, Java 6 So I should hold off on committing changes that use Java 6? let me know when you're ready, or if it's going to be difficult to move to 6 for you. i also wasn't totally clear whether the folks doing the 0.1 release want to stay on Java 5 for that or not. On Tue, Aug 26, 2008 at 3:56 AM, Xiance SI(司宪策) [EMAIL PROTECTED] wrote: I have to get Leopard first, now using Tiger, the newest possible Java is 5.0. _ Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr
Re : the Job jar file doesn't contain the core jar in it.
You should run the job task in the examples directory (ant job), it will generate a file (in examples/build) called apache-mahout-examples-0.1-dev.job, this is the jar (even if it ends with .job) that contains both the examples and the core. - Message d'origine De : Robin Anil [EMAIL PROTECTED] À : mahout-dev@lucene.apache.org Envoyé le : Dimanche, 17 Août 2008, 19h55mn 58s Objet : the Job jar file doesn't contain the core jar in it. Any idea how the examples should be run? Robin _ Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr
Mahout.GA, what comes next ?
Now that the Class Discovery (CD) example is up and running, it's time to think about what to do next. I already have some ideas, but I want to check with the community first. I see two possible ways ahead of me: A.Enhance the (CD) example a1. handle categorical attributes a2. generate dataset infos (attributes type and range), possibly using a small map-reduce program a3. multi-class classification, instead of binary classification B.Investigate other distributed models, for example the insular model. Any other suggestion is appreciated _ Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr
Re: Problems running the examples
The test run fine, its the examples that didn't run correctly, but I found a way to run them, by playing with the HADOOP_HEAPSIZE option in conf/hadoop-env.sh, it defaults to 1000 MB, I just set it to 128 and now its ok... by the way the Taste examples are missing a dependency (ejb.jar), is there a good reason not to include it (License issues perhaps) ? --- En date de : Mar 1.7.08, Jeff Eastman [EMAIL PROTECTED] a écrit : De: Jeff Eastman [EMAIL PROTECTED] Objet: Re: Problems running the examples À: mahout-dev@lucene.apache.org Date: Mardi 1 Juillet 2008, 17h32 I had to use an -Xmx 256m to get the tests to run without heap problems. Jeff deneche abdelhakim wrote: I've been using Eclipse for all my testing and all just works fine. But now I want to build and test the examples using ant. I managed to modify the build.xml to generate the examples job. But when I run one of the examples (for example : ...clustering.syntheticcontrol.canopy.Job) I get the following errors: $ ~/hadoop-0.17.0/bin/hadoop jar apache-mahout-0.1-dev-ex.jar org.apache.mahout .clustering.syntheticcontrol.canopy.Job Error occurred during initialization of VM Could not reserve enough space for object heap Could not create the Java virtual machine. any hints on how to solve this ??? _ Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr _ Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr
Re : getting started with mahout, failing tests
I just did a fresh checkout and all the tests are successfull !!! --- En date de : Sam 21.6.08, Allen Day [EMAIL PROTECTED] a écrit : De: Allen Day [EMAIL PROTECTED] Objet: getting started with mahout, failing tests À: mahout-dev@lucene.apache.org Date: Samedi 21 Juin 2008, 8h00 Hi, I finally had a chance to get mahout checked out and built today. I want to get up to speed so I can start using/contributing. I can get the compile target to build successfully, but I'm getting errors from the test target. [junit] Test org.apache.mahout.clustering.canopy.TestCanopyCreation FAILED [junit] Test org.apache.mahout.matrix.TestSparseMatrix FAILED [junit] Test org.apache.mahout.matrix.TestSparseVector FAILED Is this normal for now? -Allen _ Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr
Re: GSOC Mahout.GA, next steps ?
I found a cool introduction to evolutionary algorithms, I added it to the wiki if someone is interested... --- En date de : Mer 28.5.08, Grant Ingersoll [EMAIL PROTECTED] a écrit : De: Grant Ingersoll [EMAIL PROTECTED] Objet: Re: GSOC Mahout.GA, next steps ? À: mahout-dev@lucene.apache.org Date: Mercredi 28 Mai 2008, 13h11 This sounds good. I don't know a lot about GAs, so if others have insight, that would be great. It would also be handy if you could put up a section on the Wiki about GAs and maybe post some links to basic papers there, so people that aren't familiar can go do some background reading. I will try to get to MAHOUT-56 this week, but others can jump in and review as well. -Grant On May 27, 2008, at 4:52 AM, deneche abdelhakim wrote: In a GA there are many things that can be distributed, and one should always start with the most compute demanding task . This is very problem dependent, but in most cases the fitness evaluation function (FEF) is the part to distribute. The FEF evaluates each single individual in the population, and it may need some datas (D) to do so. For example in the traveling Salesman Problem, the problem is defined by a set of cities and the distances between them, the FEF needs those distances to evaluate the individuals. I see 2 ways to distribute the FEF: A. if the datas D is not big and can fit in each single cluster node, then the easiest solution is to use each Mapper to evaluate one individual and to pass the Datas D to all the mappers (using some Job parameter or the DistributedCache). The input of the job is the population of individuals. For someone used to work with Watchmaker, the solution A is straightforward, he needs to change one line of code. B. if the datas D are really big and span over multiple nodes, then the FEF should be writen in the form of Mappers-Reducers, the population of individuals is passed to all the mappers (again using the DistributedCache or a Job parameter) and the datas D are now the input of the Job. [MAHOUT-56] contains a possible implementation for solution A. Now I should start thinking about solution B and all I need is a problem that uses very big datasets. I already proposed one in my GSoC proposal, it consists of using a Genetic Algorithm to find good binary classification rule for a given dataset. But I am open to any other suggestion. __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail _ Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr
OutOfMemory Exception !
I checked the last version of Mahout (rev. 662372) and got the following exception with many tests (the list of this tests is at the end of this post): java.io.IOException: Job failed! the following message is printed in System.err : java.lang.OutOfMemoryError: Java heap space I think its somehow caused by using Hadoop 0.17.0 as my own tests run perfectly with Hadoop 0.16.4 here are the tests that don't pass org.apache.mahout.clustering.canopy.TestCanopyCreation: testCanopyGenManhattanMR testCanopyGenEuclideanMR testClusteringManhattanMR testClusteringEuclideanMR testClusteringManhattanMRWithPayload testClusteringEuclideanMRWithPayload testUserDefinedDistanceMeasure org.apache.mahout.clustering.meanshift.TestMeanShift testCanopyEuclideanMRJob __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail
Re: Gene Expression Programming in Mahout
I am working on using Hadoop to distribute the fitness evaluation of (hopefully) any problem written using the Watchmaker framework [https://watchmaker.dev.java.net/]. I already provided a patch with some code [http://issues.apache.org/jira/browse/MAHOUT-56] that let you distribute the evaluation of the population over the cluster (each node will evaluate a subset of the population). thank you for the links, I will take a look at some papers, but in the mean time could you tell me please : wich part of the GEP algorithm needs to be distributed (I'm guessing it's the fitness evaluation part) ? --- En date de : Lun 2.6.08, juber patel [EMAIL PROTECTED] a écrit : De: juber patel [EMAIL PROTECTED] Objet: Re: Gene Expression Programming in Mahout À: mahout-dev@lucene.apache.org, [EMAIL PROTECTED] Date: Lundi 2 Juin 2008, 19h34 yes, GEP is related to GA and I feel it provides a more generic way of defining populations, fitness functions etc. with the possibility of a wide range of grammars for the encoding of the Individual. This flexibility can be hugely effective when we can use the computing power of clusters. here is some biblio: http://www.gene-expression-programming.com/GEPBiblio.asp Deneche, could you just give me an idea about your work so far? juber On Mon, Jun 2, 2008 at 11:48 AM, Isabel Drost [EMAIL PROTECTED] wrote: On Sunday 01 June 2008, juber patel wrote: I have been lurking on this list for some time now. I would really like to contribute to Mahout. As I had discussed earlier, I would like to include my code, Amiba (http://amiba.sourceforge.net/) in Mahout. I feel this is the right place for that code. Sounds great! It implements Gene Expression Programming but it is sequential. I would like to adapt it for Hadoop and for that I am reading up on Hadoop. If you have any questions, feel free to ask us or post your questions to the Hadoop mailinglists. Could you tell me again if this fits well with Mahout. And if you don't mind including it in Mahout. Sure. You might want to coordinate with Deneche Abdelhakim who is working in GA for GSoC - as I understand, Gene Expression Programming is related to GA? Isabel -- #if _FP_W_TYPE_SIZE 32#error Here's a nickel kid. Go buy yourself a real computer.#endif-- linux/arch/sparc64/double.h |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] -- Juber Patel http://juberpatel.googlepages.com __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail
Re: GSOC Mahout.GA, next steps ?
Ted Dunning [EMAIL PROTECTED] wrote: Conceptually, at least, it would be good to have the option for fitness functions to be expressed as map-reduce programs. Unfortunately, having mappers spawn MR programs runs the real risk of dead-lock, especially on less than grandiose clusters. To me, that indicates that if the fitness function is nasty enough to require map-reduce to compute, then either: a) the executive that manages the population and generates mutations should be written in sequential form or b) the evolutionary algorithm has to be written in such a way as to be able to manipulate a map-reduce program so that evolution and evaluation can be merged into a single (composite) map-reduce program. I vote for (a) because if fitness computations are so complex as to need MR, then the cost of sorting the population will be negligible. (a) has another advantage too: one can start by writing its program in a sequential form, test it with a small dataset, then rewrite only the fitness function in a M-R form. This raises the question of how the population should be communicated to the parallel evaluator. I don't know if there are many ways to do it in Hadoop, but how about writing the population into a file and pass it with the DistributedCache ? -- abdelhakim __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail
Re: Thoughts on timeline for first release?
UCI : http://archive.ics.uci.edu/ml/ --- En date de : Mer 21.5.08, Jeff Eastman [EMAIL PROTECTED] a écrit : De: Jeff Eastman [EMAIL PROTECTED] Objet: Re: Thoughts on timeline for first release? À: mahout-dev@lucene.apache.org Date: Mercredi 21 Mai 2008, 17h10 Does anybody have some links to datasets we can use for clustering examples? I'm thinking we could publish an EC2 AMI that includes Hadoop and Mahout, along with a script to deploy it on a cluster, upload the examples and run clustering on it. Is that too ambitious? I'm kinda hoping that we can use 0.17 which advertises simpler EC2 deployment than 0.16. If that won't meet our schedule then maybe I should work through the 0.16 deployment. Jeff Grant Ingersoll wrote: I was thinking we should get the Taste stuff in (seems to be pretty close to done) and I would like to get Mahout-9 (Naive Bayes) in. This would give us a pretty nice release, I think. Namely, a couple of clustering implementations, a classifier, and, of course, Taste. I think I can finish up my part in the next week or so. Then, we will need to start to figure out all the fun of releases (signatures, notices.txt, etc.) I'd also like to see us have an easy to use demo of the clustering stuff, but it is all right if we don't. -Grant On May 21, 2008, at 1:23 AM, Sean Owen wrote: Just curious, what are people thinking about the timeline for a first, very early release, like an 0.1 release? any open tasks that I could pick up to help? Without rushing anything, I'm keen to retire my current project site and forward everybody that's interested to Mahout. As long as there's a .jar distro someone can pick up and use, that's cool. Sean __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail
About Contributing
as part of my gsoc project I started adapting one of Watchmaker examples (TSP) to use with Mahout. I believe that the next step is to start a Jira issue and post an svn patch, isn't it ? I also did a fresh checkout of Mahout and run ant test in the core directory and got a wonderful Tests failed even before I added my own code :( The test that fails is the following: org.apache.mahout.cf.taste.impl.LoadTest.testItemLoad() It seems that it takes more than 120 sec (the allowedTimeSec specified in the test) to load I wonder, before I start hitting the keyboard with my head, if it is just normal that this test don't pass !!! __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail
RE : Google Summer of Code
Hi Robin, I am very happy that I've been accepted, thanks to the Mahout Community that kindly commented on my draft. So we are four students, that's cool. I wish us good work and great fun in this summer. Hakim Robin Anil [EMAIL PROTECTED] a écrit : Hi Everyone, This is one of those days where I wake up and see that I have got accepted to GSoc with Mahout (:32-all-out:) . I am really excited to kick start the work. I know I have a lot to understand in terms of coding practices, the whole workflow/process. And i would like to congratulate and say hi to my fellow Gsoc'ers Farid, Yun and Abdel, Hi to my mentor Ian Holsman and to rest of the community. I am usually online of google talk: if you use it do add me: [EMAIL PROTECTED] Cheers and Good Day Robin __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail
RE : About the Mahout.GA Comment
The number of running algorithms don't depend on the number of processors, in fact this kind of algorithms is used even if there is only one single processor because of its good search properties. You can imagine it as a single big GA with a distributed population and each individual can have its own set of operators. Abdel Hakim Ted Dunning wrote : I think it is a very bad idea to tie the algorithm to the number of processors being used in this way. A program should produce identical results on any machine, subject only to PRNG seeding issues. On 4/11/08 8:52 PM, deneche abdelhakim wrote: And there are other reasons to distribute a GA: for example, you may want to run a different version of the algorithm (a different population and perhaps a different set of operators) in each computing node, and from time to time some individuals will migrate from one node to another...this kind of distribution has proven to be more effective cause it searches a larger space. - Envoyé avec Yahoo! Mail. Une boite mail plus intelligente.
RE : About the Mahout.GA Comment
I don't know the exact term, but may be I should have said computing process, so each processor (or computing node) can run many computing processes... Ted Dunning [EMAIL PROTECTED] a écrit : How is computing node not a processor? On 4/12/08 9:26 PM, deneche abdelhakim wrote: The number of running algorithms don't depend on the number of processors, in fact this kind of algorithms is used even if there is only one single processor because of its good search properties. You can imagine it as a single big GA with a distributed population and each individual can have its own set of operators. Abdel Hakim Ted Dunning wrote : I think it is a very bad idea to tie the algorithm to the number of processors being used in this way. A program should produce identical results on any machine, subject only to PRNG seeding issues. On 4/11/08 8:52 PM, deneche abdelhakim wrote: And there are other reasons to distribute a GA: for example, you may want to run a different version of the algorithm (a different population and perhaps a different set of operators) in each computing node, and from time to time some individuals will migrate from one node to another...this kind of distribution has proven to be more effective cause it searches a larger space. - Envoyé avec Yahoo! Mail. Une boite mail plus intelligente. - Envoyé avec Yahoo! Mail. Une boite mail plus intelligente.
About the Mahout.GA Comment
Hi Grant, You wrote the following comment on my GSoC proposal: Could someone w/ a little more GA knowledge comment on the use of WatchMaker? What I wonder is if it is possible to distribute some of the watchmaker functionality? Do you want to know if there are more other ways to distribute a GA ? May not be needed for this proposal, but I am curious as to how much work is done in Watchmaker vs. the actual fitness function. I dont understand... Abdel Hakim Deneche - Envoyé avec Yahoo! Mail. Une boite mail plus intelligente.
GSoC Evolutionary Algorithm Proposal
I've written my proposal, and because I could no more change it after I submit it to GSoc, I first post it here if someone have some suggestions you are welcome. I will wait until saturday morning to post it to the GSoC ** Application for Summer of Code 2008Mahout Project Deneche Abdel Hakim Codename Mahout.GA I. Synopsis I will add a genetic algorithm (GA) for binary classification on large datasets to the Mahout project. To gain time I will use an existing framework for genetic algorithms WatchMaker [WatchMaker] with an Apache Software License. I will also add a parallelized measure that indicates the quality of classification rules on a given dataset, this measure will be available independently of the GA. And if I have enough time I will make the GA more generic and apply it on a different problem (multiclass classification). II. Project A GA works by evolving a population of individuals toward a desired goal. To get a satisfying solution, the GA needs to run thousands of iterations with hundreds of individuals. For each iteration and individual the fitness is calculated, it indicates the closeness of that individual to the desired solution. The main advantage of GAs is there ability to find solution of problems given only a fitness measure (and of course a sufficient CPU power), this is particularly helpful when the problem is complex and no mathematical solution is available. My primary goal is to implement the GA described in [GA]. It uses a fitness function that is easy to implement and can benefit from the Map-Reduce strategy to exploit distributed computing (when the training dataset is very large). It will be available as ready to use tool (Mahout.GA) that discovers binary classification rules for any given dataset. Concretely, the main program will launch the GA using WatchMaker, each time the GA needs to evaluate the fitness of the population it calls a specific class given by us, this class will configure and launch a Hadoop Job on a distributed cluster. My secondary goal is to make Mahout.GA problem independent, thus allowing us to use it for different problems such as multiclass classification, optimization, clustering. This will be done by implementing a ready to use generic fitness function for WatchMaker that calls internally Hadoop. As a proof of concept I will use it for multiclass classification (if I don't run out of time of course!). III. Profit for Mahout 1.The GA will be integrated with Mahout as a ready to use rule discovering tool for binary classification; 2.Explore the integration of existing frameworks with Mahout, for example how to design the program in a way that the framework libraries will not be needed in the slave nodes (technically its feasible, but I still need to learn how to do it); 3.The parallelized fitness function can be used independently of Mahout.GA. It’s a good measure of the quality of binary classification rules; 4.Simplify the process of using Mahout.GA for other problems. The user will still need to design the solutions representation and to implement a fitness function, but all the Hadoop stuff should be hidden or at least made simpler; 5.Apply the generalized Mahout.GA to multiclass classification and write a corresponding tutorial that explains how to use Mahout.GA to solve new problems. IV. Success Criteria Main goals 1.Implement the parallelized fitness function described in [GA] and validate its results on a small dataset; 2.Implement Mahout.GA for binary classification rule discovery. A simpler (not parallelized) version of this algorithm should also be implemented to validate the results of Mahout.GA; Secondary goals 1.Allow the parallelized fitness function to be used independently of Mahout.GA; 2.Use Mahout.GA on a different problem (multiclass classification) and write a corresponding tutorial. V. Roadmap [April, 14: accepted students known] 1.Familiarize myself with Hadoop Modify one of the examples of Hadoop to simulate an iterative process. For each iteration, a new Job is executed with different parameters, and its results are imported back by the program. 2.Implement the GA without parallelism a.Start by implementing the tutorial example that comes with WatchMaker; b.Implement my own Individual and Fitness function classes; c.Validate the algorithm using a small dataset, and find the parameters that will give acceptable results. 3.Prepare whatever I may need in the development period [May, 26 coding starts] 4.Implement the parallelized fitness function a.Use Hadoop Map-Reduce to implement it [2 weeks]; b.Validate it on a small dataset [1 week]. 5.Implement Mahout.GA a.Write an intermediary component between WatchMaker and the parallelized fitness function. This component takes a population, configures and launches a Job, waits for its
GSoC Evolutionary Algorithm Idea
Hi Im a PhD student on AI and adaptive systems, I have been working on evolutionary algorithms for the last 4 years. I implemented my own Aritifial Immune System with Matlab and as a Java extension to Yale, I also worked with a C++ framework for multi-objective optimization. My project is to build a classification genetic algorithm in Mahoot. I've already done some research and found the following paper Discovering Comprehensible Classification Rules with a Genetic Algorithm its a Genetic Algorithm for binary classification. The fitness function (that iterates over all the training dataset) can benefit from the Map-Reduce model of Hadoop. I plan to use some an existing open source framework for the genetic algorithm, the framework should take care of all tha GA stuff, and I will be left with: . the representation of individuals, as described in the article . the fitness function that uses Hadoop This algorithm can also be adapted to work with more than two classes...but that's another story What do you think about it ? Abdel Hakim Deneche Mentouri Unversity of Constantine, Algeria _ Envoyez avec Yahoo! Mail. Capacité de stockage illimitée pour vos emails. http://mail.yahoo.fr