Re: VOTE: take 2: mahout-collections-1.0

2010-04-11 Thread deneche abdelhakim
+1

On Mon, Apr 12, 2010 at 4:50 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 +1 (on trust, really)

 On Sun, Apr 11, 2010 at 6:49 PM, Benson Margulies 
 bimargul...@gmail.comwrote:

 https://repository.apache.org/content/repositories/orgapachemahout-015/

 contains (this time for sure) all the artifacts for release 1.0 of the
 mahout-collections component. This is the first independent release of
 collections from the rest of mahout; it differs from the version
 released with mahout 0.3 only in removing a dependency on slf4j.

 This vote will remain open for 72 hours.




Re: VOTE: release mahout-collections-codegen 1.0

2010-04-07 Thread deneche abdelhakim
+1

On Thu, Apr 8, 2010 at 2:57 AM, Drew Farris drew.far...@gmail.com wrote:
 +1

 On Tue, Apr 6, 2010 at 9:08 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 In order to decouple the mahout-collections library from the rest of
 Mahout, to allow more frequent releases and other good things, we
 propose to release the code generator for the collections library as a
 separate Maven artifact. (Followed, in short order, by the collections
 library proper.) This is proposed release 1.0 of
 mahout-collections-codegen-plugin. This is intended as a maven-only
 release; we'll put the artifacts in the Mahout download area as well,
 but we don't ever expect anyone to use this except from Maven,
 inasmuch as it is a maven plugin.

 The release artifacts are in the Nexus stage, as follows.

 https://repository.apache.org/content/repositories/orgapachemahout-006/

 This vote will remain open for 72 hours.




Re: [DISCUSS] Mahout TLP Board Resolution

2010-03-18 Thread deneche abdelhakim
should be Abdelhakim Deneche ... cause my first name is 'Abdelhakim

On Thu, Mar 18, 2010 at 1:07 PM, Grant Ingersoll gsing...@apache.org wrote:
 So here's the update:

 X. Establish the Apache Mahout Project

   WHEREAS, the Board of Directors deems it to be in the best
   interests of the Foundation and consistent with the
   Foundation's purpose to establish a Project Management
   Committee charged with the creation and maintenance of
   open-source software related to a machine learning platform
   for distribution at no charge to the public.

   NOW, THEREFORE, BE IT RESOLVED, that a Project Management
   Committee (PMC), to be known as the Apache Mahout Project,
   be and hereby is established pursuant to Bylaws of the
   Foundation; and be it further

   RESOLVED, that the Apache Mahout Project be and hereby is
   responsible for the creation and maintenance of software
   related to a machine learning platform; and be it further

   RESOLVED, that the office of Vice President, Apache Mahout be
   and hereby is created, the person holding such office to
   serve at the direction of the Board of Directors as the chair
   of the Apache Mahout Project, and to have primary responsibility
   for management of the projects within the scope of
   responsibility of the Apache Mahout Project; and be it further

   RESOLVED, that the persons listed immediately below be and
   hereby are appointed to serve as the initial members of the
   Apache Mahout Project:

        • Deneche Abdelhakim adene...@...
        • Isabel Drost (isa...@...)
        • Ted Dunning (tdunn...@...)
        • Jeff Eastman (jeast...@...)
        • Drew Farris (d...@...)
        • Grant Ingersoll (gsing...@...)
        • Benson Margulies (bimargul...@...)
        • Sean Owen (sro...@...)
        • Robin Anil (robina...@...)
        • Jake Mannix  (jman...@...)

   RESOLVED, that the Apache Mahout Project be and hereby
   is tasked with the migration and rationalization of the Apache
   Lucene Mahout sub-project; and be it further

   RESOLVED, that all responsibilities pertaining to the Apache
   Lucene Mahout sub-project encumbered upon the
   Apache Mahout Project are hereafter discharged.

 NOW, THEREFORE, BE IT FURTHER RESOLVED, that Sean Owen
   be appointed to the office of Vice President, Apache Mahout, to
   serve in accordance with and subject to the direction of the
   Board of Directors and the Bylaws of the Foundation until
   death, resignation, retirement, removal or disqualification,
   or until a successor is appointed.


Re: [DISCUSS] Mahout TLP Board Resolution

2010-03-18 Thread deneche abdelhakim
close, actually:

عبد الحكيم

=D


On Thu, Mar 18, 2010 at 6:41 PM, Benson Margulies bimargul...@gmail.com wrote:
 Or perhaps:

 عبدل حكيم

 ?


 On Thu, Mar 18, 2010 at 1:34 PM, deneche abdelhakim adene...@gmail.com 
 wrote:
 should be Abdelhakim Deneche ... cause my first name is 'Abdelhakim

 On Thu, Mar 18, 2010 at 1:07 PM, Grant Ingersoll gsing...@apache.org wrote:
 So here's the update:

 X. Establish the Apache Mahout Project

   WHEREAS, the Board of Directors deems it to be in the best
   interests of the Foundation and consistent with the
   Foundation's purpose to establish a Project Management
   Committee charged with the creation and maintenance of
   open-source software related to a machine learning platform
   for distribution at no charge to the public.

   NOW, THEREFORE, BE IT RESOLVED, that a Project Management
   Committee (PMC), to be known as the Apache Mahout Project,
   be and hereby is established pursuant to Bylaws of the
   Foundation; and be it further

   RESOLVED, that the Apache Mahout Project be and hereby is
   responsible for the creation and maintenance of software
   related to a machine learning platform; and be it further

   RESOLVED, that the office of Vice President, Apache Mahout be
   and hereby is created, the person holding such office to
   serve at the direction of the Board of Directors as the chair
   of the Apache Mahout Project, and to have primary responsibility
   for management of the projects within the scope of
   responsibility of the Apache Mahout Project; and be it further

   RESOLVED, that the persons listed immediately below be and
   hereby are appointed to serve as the initial members of the
   Apache Mahout Project:

        • Deneche Abdelhakim adene...@...
        • Isabel Drost (isa...@...)
        • Ted Dunning (tdunn...@...)
        • Jeff Eastman (jeast...@...)
        • Drew Farris (d...@...)
        • Grant Ingersoll (gsing...@...)
        • Benson Margulies (bimargul...@...)
        • Sean Owen (sro...@...)
        • Robin Anil (robina...@...)
        • Jake Mannix  (jman...@...)

   RESOLVED, that the Apache Mahout Project be and hereby
   is tasked with the migration and rationalization of the Apache
   Lucene Mahout sub-project; and be it further

   RESOLVED, that all responsibilities pertaining to the Apache
   Lucene Mahout sub-project encumbered upon the
   Apache Mahout Project are hereafter discharged.

 NOW, THEREFORE, BE IT FURTHER RESOLVED, that Sean Owen
   be appointed to the office of Vice President, Apache Mahout, to
   serve in accordance with and subject to the direction of the
   Board of Directors and the Bylaws of the Foundation until
   death, resignation, retirement, removal or disqualification,
   or until a successor is appointed.




Re: [DISCUSS] Mahout TLP Board Resolution

2010-03-15 Thread deneche abdelhakim
just to get it right: not being in the PMC doesn't mean I'm no more a
committer, right ?

On Mon, Mar 15, 2010 at 6:08 PM, Jake Mannix jake.man...@gmail.com wrote:
 +1 and I'm in (my email @apache is just jmannix btw, for some reason its not
 listed on those resolutions)

 On Mar 15, 2010 9:07 AM, Robin Anil robin.a...@gmail.com wrote:

 I'm in :) :thumbs up:


 On Mon, Mar 15, 2010 at 8:01 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Now that 0.3 is almost out and also given discussions over on 
 gene...@lucene.a.o, I think we ca...



Re: [jira] Commented: (MAHOUT-323) Classify new data using Decision Forest

2010-03-07 Thread deneche abdelhakim
oops, will attach it as soon as possible. I really wonder why submit a
patch and attach a patch are two different operations in JIRA ?

On Sat, Mar 6, 2010 at 10:08 PM, Robin Anil (JIRA) j...@apache.org wrote:

    [ 
 https://issues.apache.org/jira/browse/MAHOUT-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12842313#action_12842313
  ]

 Robin Anil commented on MAHOUT-323:
 ---

 No patch? Forgot to attach? Don't bother too much about the code freeze. 
 Since this is a feature that could help people use RF as a classifier even 
 more than it can now, i guess you can keep it for 0.3 with some documentation 
 ofcourse. But before that attach :)

 Classify new data using Decision Forest
 ---

                 Key: MAHOUT-323
                 URL: https://issues.apache.org/jira/browse/MAHOUT-323
             Project: Mahout
          Issue Type: Improvement
          Components: Classification
    Affects Versions: 0.4
            Reporter: Deneche A. Hakim
            Assignee: Deneche A. Hakim

 When building a Decision Forest we should be able to store it somewhere and 
 use it later to classify new datasets

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




Re: [jira] Commented: (MAHOUT-323) Classify new data using Decision Forest

2010-03-07 Thread deneche abdelhakim
yes, I'm planning to make DF look more like a Mahout classifier. I
will take a look at bayes.

On Sun, Mar 7, 2010 at 7:09 PM, Robin Anil (JIRA) j...@apache.org wrote:

    [ 
 https://issues.apache.org/jira/browse/MAHOUT-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12842459#action_12842459
  ]

 Robin Anil commented on MAHOUT-323:
 ---

 Hey deneche. can we have a single point entry to the classifier. You are free 
 to modify Train and Test Classifier of bayes. Or keep the same naming 
 convention in df and in bayes?



 Classify new data using Decision Forest
 ---

                 Key: MAHOUT-323
                 URL: https://issues.apache.org/jira/browse/MAHOUT-323
             Project: Mahout
          Issue Type: Improvement
          Components: Classification
    Affects Versions: 0.4
            Reporter: Deneche A. Hakim
            Assignee: Deneche A. Hakim
         Attachments: mahout-323.patch


 When building a Decision Forest we should be able to store it somewhere and 
 use it later to classify new datasets

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




Re: Welcome Drew Farris

2010-02-18 Thread deneche abdelhakim
Welcome Drew

=D

On Fri, Feb 19, 2010 at 5:02 AM, Grant Ingersoll gsing...@apache.org wrote:

 On Feb 18, 2010, at 8:32 PM, Drew Farris wrote:

  There's lots more stuff I'd like to get in there,
 now I only need to figure how to squeeze 48 hours of consciousness
 into a day.

 I believe there is a compression algorithm for that.



Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread deneche abdelhakim
 One important question in my mind here is how does this effect 0.20 based
 jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and
 deneche is also maintaining two version it seems. I will check the
 AbstractJob and see

although I maintain two versions of Decision Forests, one with the old
api and with the new one, the differences between the two APIs are so
important that I can't just keep working on the two versions. Thus all
the new stuff is being committed using the new API and as far as I can
say it seems to work great.

On Thu, Feb 4, 2010 at 4:48 PM, Robin Anil robin.a...@gmail.com wrote:
 On Thu, Feb 4, 2010 at 7:28 PM, Sean Owen sro...@gmail.com wrote:

 On Thu, Feb 4, 2010 at 12:28 PM, Robin Anil robin.a...@gmail.com wrote:
  3rd thing:
  I am planning to convert the launcher code to implement ToolRunner.
 Anyone
  volunteer to help me with that?

 I had wished to begin standardizing how we write these jobs, yes.

 If you see AbstractJob, you'll see how I've unified my three jobs and
 how I'm trying to structure them. It implements ToolRunner so all that
 is already taken care of.

 I think some standardization is really useful, to solve problems like
 this and others, and I'll offer this as a 'draft' for further work. No
 real point in continuing to solve these things individually.

 One important question in my mind here is how does this effect 0.20 based
 jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and
 deneche is also maintaining two version it seems. I will check the
 AbstractJob and see


  5th The release:
  Fix a date for 0.3 release? We should look to improve quality in this
  release. i.e In-terms of running the parts of the code each of us haven't
  tested (like I have run bayes and fp growth many a time, So, I will focus
 on
  running clustering algorithms and try out various options see if there is
  any issue) provide feedback so that the one who wrote it can help tweak
 it?

 Maybe, maybe not. There are always 100 things that could be worked on,
 and that will never change -- it'll never be 'done'. The question of a
 release, at this point, is more like, has enough time elapsed / has
 enough progress been made to warrant a new point release? I think we
 are at that point now.

 The question is not what big things can we do -- 'big' is for 0.4 or
 beyond now -- but what small wins can we get in, or what small changes
 are necessary to tie up loose ends to make a roughly coherent release.
 In that sense, no, I'm not sure I'd say things like what you describe
 should be in for 0.3. I mean we could, but then it's months away, and
 isn't that just what we call 0.4?

 Everyone's had a week or two to move towards 0.3 so I believe it's
 time to begin pushing on these issues, closing then / resolving them /
 moving to 0.4 by end of week. Then set the wheel in motion first thing
 next week, since it'll still be some time before everyone's on board.




Re: dependency question: mahout-examples - watchmaker-swing - jfreechart - jcommons?

2010-01-29 Thread deneche abdelhakim
The only example that actually uses watchmaker-swing is Travelling
Salesman, mainly because it was a direct port of an existing
watchmaker example. And if I remember well, it does not actually use
JFreeChart...so I think it's safe to exclude it.

On Sat, Jan 30, 2010 at 5:19 AM, Drew Farris drew.far...@gmail.com wrote:
 I spent some time looking at the licenses for the dependencies
 included in the binary release built as a part of MAHOUT-215, and I'm
 wondering if anyone knows whether code in mahout-examples uses
 directly or indirectly any of the jfreechart code is included as a
 transient dependency of the watchmaker-swing library.

 The issue at hand is that jfreechart pulls in something called
 jcommons, which appears to be licensed under GPL. It is my
 understanding that mahout shouldn't include GPL licensed dependencies
 in a binary release.

 So, if mahout doesn't use jfreechart in any way via watchmaker-swing,
 I can set an exclusion for it in the dependency declaration and thus
 prevent the inclusion of jcommons. Mahout builds and test complete
 fine with this exclusion set, but that's not the whole story of
 course.

 Drew



Re: Unit test failure

2010-01-16 Thread deneche abdelhakim
Yeah, its probably due to the way I used to generate random data...the
problem is that I never get this error =P so it's very difficult to
fix...I'll try my best as soon as I have some time. In the mean time,
rerunning 'mvn clean install' again generally does the trick.

On Sat, Jan 16, 2010 at 6:58 PM, Grant Ingersoll gsing...@apache.org wrote:
 try rerunning... I think that one has intermittent failures.  Perhaps Deneche 
 can dig in.  You will likely need to look in the Hadoop logs too.
 On Jan 16, 2010, at 12:49 PM, Benson Margulies wrote:

 https://issues.apache.org/jira/browse/MAHOUT-258

 The error message:

 testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest)
 Time elapsed: 6.731 sec   ERROR!
 java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)

 does not give me much to go on.

 I don't see how adding new Set classes to my tree could cause this ...




Re: Unit test lag?

2010-01-16 Thread deneche abdelhakim
I'm getting similar slowdowns with my VirtualBox Ubuntu 9.04

I'm suspecting that the problem is not -only- caused by RandomUtils because:

1. I'm familiar with MerseneTwisterRNG slowdowns (I use it a lot) but
the test time used to be reported accurately by maven. Now maven
reports that a test took less than a second but it actually took a lot
more !

2. Most of my tests actually call RandomUtils.useTestSeed() in setup()
(InMemInputSplitTest included) but the tests still take a lot of time,
and again its not reported accurately by maven

3. I generally launch a 'mvn clean install' every Thursday. I never
got this slowdowns until last Thursday (dit we change anything that
could have caused this slowdowns)

On Sun, Jan 17, 2010 at 12:33 AM, Benson Margulies
bimargul...@gmail.com wrote:

 Unit tests should generally be using a fixed seed and not need to load a
 secure seed from dev/random.  I would say that RandomUtils is probably the
 problem here.  The secure seed should be loaded lazily only if the test seed
 is not in use.

 The problem, as I see it, is that the uncommons-math package start
 initializing a random seed as soon as you touch it, whether you need
 it or not. RandomUtils can only avoid this by avoiding uncommons-math
 in unit test mode.




 --
 Ted Dunning, CTO
 DeepDyve




Re: Unit test lag?

2010-01-16 Thread deneche abdelhakim
removing the maven repository does not solve the problem, neither a
fresh checkout of the trunk.

but older revisions don't show any slowdown!!! I tried the following revisions:

Those old revisions seem Ok:

r896946 | srowen | 2010-01-07 19:02:41 +0100 (Thu, 07 Jan 2010) | 1 line
MAHOUT-238

r897134 | robinanil | 2010-01-08 09:23:22 +0100 (Fri, 08 Jan 2010) | 1 line
MAHOUT-221 Missed out two files while checking in FP-Bonsai

r897405 | adeneche | 2010-01-09 11:02:49 +0100 (Sat, 09 Jan 2010) | 1 line
MAHOUT-216


 The slowdowns start at this revision !!!

r897440 | srowen | 2010-01-09 13:53:25 +0100 (Sat, 09 Jan 2010) | 1 line
Code style adjustments; enabled/fixed TestSamplingIterator



On Sun, Jan 17, 2010 at 5:47 AM, deneche abdelhakim adene...@gmail.com wrote:
 I'm getting similar slowdowns with my VirtualBox Ubuntu 9.04

 I'm suspecting that the problem is not -only- caused by RandomUtils because:

 1. I'm familiar with MerseneTwisterRNG slowdowns (I use it a lot) but
 the test time used to be reported accurately by maven. Now maven
 reports that a test took less than a second but it actually took a lot
 more !

 2. Most of my tests actually call RandomUtils.useTestSeed() in setup()
 (InMemInputSplitTest included) but the tests still take a lot of time,
 and again its not reported accurately by maven

 3. I generally launch a 'mvn clean install' every Thursday. I never
 got this slowdowns until last Thursday (dit we change anything that
 could have caused this slowdowns)

 On Sun, Jan 17, 2010 at 12:33 AM, Benson Margulies
 bimargul...@gmail.com wrote:

 Unit tests should generally be using a fixed seed and not need to load a
 secure seed from dev/random.  I would say that RandomUtils is probably the
 problem here.  The secure seed should be loaded lazily only if the test seed
 is not in use.

 The problem, as I see it, is that the uncommons-math package start
 initializing a random seed as soon as you touch it, whether you need
 it or not. RandomUtils can only avoid this by avoiding uncommons-math
 in unit test mode.




 --
 Ted Dunning, CTO
 DeepDyve





Re: Welcome Benson Marguiles as Mahout Committer

2010-01-13 Thread deneche abdelhakim
Welcome =D

On Wed, Jan 13, 2010 at 10:36 PM, Drew Farris drew.far...@gmail.com wrote:
 Congratulations Benson. It is wonderful to see your great work in the
 mahout-math (and the future mahout-collections?) come together quickly.

 On Wed, Jan 13, 2010 at 3:28 PM, Grant Ingersoll gsing...@apache.orgwrote:

 The Lucene PMC is pleased to welcome the addition of Benson Marguiles as a
 committer on Mahout.  I hope you'll join me in offering Benson a warm
 welcome.

 Benson, Lucene tradition is that new committers provide a little bit of a
 background about who they are, so feel free to step up and do so.

 Cheers,
 Grant



Re: svn commit: r896922 [1/3] - in /lucene/mahout/trunk: core/src/main/java/org/apache/mahout/common/ core/src/main/java/org/apache/mahout/fpm/pfpgrowth/ core/src/main/java/org/apache/mahout/fpm/pfp

2010-01-08 Thread deneche abdelhakim
the build is successful, thanks =D

On Fri, Jan 8, 2010 at 9:23 AM, Robin Anil robin.a...@gmail.com wrote:
 Try Now



Re: [jira] Resolved: (MAHOUT-71) Dataset to Matrix Reader

2010-01-03 Thread deneche abdelhakim
yep :p

On Sun, Jan 3, 2010 at 4:41 PM, Sean Owen (JIRA) j...@apache.org wrote:

     [ 
 https://issues.apache.org/jira/browse/MAHOUT-71?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
  ]

 Sean Owen resolved MAHOUT-71.
 -

       Resolution: Later
    Fix Version/s:     (was: 0.3)

 Looks like this is inactive now?

 Dataset to Matrix Reader
 

                 Key: MAHOUT-71
                 URL: https://issues.apache.org/jira/browse/MAHOUT-71
             Project: Mahout
          Issue Type: New Feature
            Reporter: Deneche A. Hakim
            Assignee: Deneche A. Hakim
            Priority: Minor

 This component should allow the input datasets to be read as Matrix Rows.
 A Map-Reduce Algorithm should handle any dataset in a matrix format, where 
 the collumns are the attributes (and one of them is the Label) and the rows 
 are the datas.
 Working with Hadoop, we'll need to pass the dataset in the mapper's input, 
 so it must be a file (or many files). We'll then need a custom InputFormat 
 to feed the mappers with the data, and here comes the lovely-named row-wise 
 splitting matrix input format.
 Now we want to be able to work with any given dataset file format (including 
 the ARFF and my custom format), and thus the InputFormat needs a decoder 
 that converts the dataset lines into matrix rows.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




Re: [math] watch out for Windows

2009-12-29 Thread deneche abdelhakim
last time I tried, running Hadoop 0.20 on Windows was impossible for
me...should we still try to support Windows ? I found that installing
Ubuntu on Windows using Virtual Box is the easiest way to use Hadoop
inside Windows

On Mon, Dec 28, 2009 at 8:47 PM, Benson Margulies bimargul...@gmail.com wrote:
 Robin  I just established that the new code generator isn't working
 on Windows at all. I'm in process on a repair.



Re: Publish code quality reports on web-site?

2009-12-03 Thread deneche abdelhakim
I'm not planing to make new changes to 'mapred', my new code should go
to 'mapreduce'

On Thu, Dec 3, 2009 at 3:34 PM, Isabel Drost isa...@apache.org wrote:
 On Thu Sean Owen sro...@gmail.com wrote:

 I suggest our current stance be that we use 0.20.x, with the old APIs.
 When 0.21 comes out and stabilizes, we move. So I suggest keeping
 these and deleting 'mapred' at that point.

 Sounds good to me.

 Isabel



Re: Publish code quality reports on web-site?

2009-11-28 Thread deneche abdelhakim
df/mapred works with the old hadoop API
df/mapreduce works with hadoop 0.20 API

On Saturday, November 28, 2009, Sean Owen sro...@gmail.com wrote:
 I'm all for generating and publishing this.


 The CPD results highlight a question I had: what's up with the amount
 of duplication between org/apache/mahout/df/mapred and
 org/apache/mahout/df/mapreduce -- what is the difference supposed to
 be.


 PMD is complaining a lot about the foo == false vs !foo style. I
 prefer the latter too but we had agreed to use the former, so we could
 disable this check if possible.


 Checkstyle: can we set it to allow a 120 character line, and adjust it
 to consider an indent to be 2 spaces? it's flagging like every line of
 code right now !


 On that note, if possible, I would suggest disabling the following
 FindBugs checks, as they are flagging a lot of stuff that isn't
 'wrong', to me.

 SE_NO_SERIALVERSIONID
 I completely disagree with it. serialVersionUID itself is bad
 practice, in my book.

 EI_EXPOSE_REP2
 it's a fair point but only relevant to security, and we have no such
 issue. The items it flags are done on purpose for performance, it
 looks like.

 SQL_PREPARED_STATEMENT_GENERATED_FROM_NONCONSTANT_STRING
 SQL_NONCONSTANT_STRING_PASSED_TO_EXECUTE
 It's a good point in general, but I'm the only one writing JDBC code,
 and there is actually no security issue here. It's a false positive
 and we could disable this.

 SE_BAD_FIELD
 This one is a little aggressive. It assumes that types not known to be
 Serializable must not be Serializable, which isn't true.

 RV_RETURN_VALUE_IGNORED
 It's a decent idea but flags a lot of legitimate code. For example
 it's complaining about ignoring Queue.poll(), which, like a lot of
 Collection API methods,

 UWF_FIELD_NOT_INITIALIZED_IN_CONSTRUCTOR
 I don't necessarily agree with this one, explicitly setting fields to
 null and primitives to zero? tidy but I'm not used to it.


 I didn't see anything big flagged, good, but we should all have a look
 at the results and tweak accordingly. In some cases it had a good
 small point, or I was indifferent about the approach it was suggesting
 versus what was in the code, so I changed to comply with the check.


 On Fri, Nov 27, 2009 at 8:26 PM, Isabel Drost isa...@apache.org wrote:

 Hello,

 I just ran several code analysis reports over the Mahout source code.
 Results are published at

 http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html

 It includes several reports on code quality, test coverage, java docs
 and the like. When generated regularly say on Hudson I think it could
 be beneficial both for us (for getting a quick impression of where
 cleanup is necessary most) as well as for potential users.

 I would like to see a third tab added to our homepage that points to
 a page containing reports for each of our modules. I would try to cleanup the
 generated site a little before - we certainly do not need the Project
 information stuff in there, as most of this is already generated
 through forest. In addition I can take care of setting up a hudson
 job to recreate the site on a regular schedule.

 Cheers,
 Isabel

 --
  |\      _,,,---,,_       Web:   http://www.isabel-drost.de
  /,`.-'`'    -.  ;-;;,_
  |,4-  ) )-,_..;\ (  `'-'
 '---''(_/--'  `-'\_) (fL)  IM:  xmpp://main...@spaceboyz.net





Re: 0.2 status

2009-11-12 Thread deneche abdelhakim
please use Decision Forests instead of Random Forests



On Thu, Nov 12, 2009 at 9:01 AM, Robin Anil robin.a...@gmail.com wrote:
 Please edit/add stuff.

 Robin


 ==

 Apache Mahout 0.2 has been released and is now available for public
 download. Apache Mahout is a subproject of Apache Lucene with the goal
 of delivering scalable machine learning algorithm implementations
 under the Apache license.
 link
 Mahout is a machine learning library meant to scale to the size of
 data we manage today. Built on top of the powerful map/reduce paradigm
 of Apache Hadoop project, Mahout lets you run popular machine learning
 methods like clustering, collaborative filtering, classification over
 Terabytes of data over thousands of computers.

 The complete changelist can be found here:
 http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278

 New Mahout 0.2 features include

 - Major performance enhancements in Collaborative Filtering,
 Classification and Clustering
 - New: Latent Dirichlet Allocation(LDA) implementation for topic modelling
 - New: Frequent Itemset Mining for mining top-k patterns from a list
 of transactions
 - New: Random Forests implementation for Decision Tree classification
 (In Memory  Partial Data)
 - New: HBase storage support for Naive Bayes model building and classification
 - New: Generation of vectors from Text documents for use with Mahout 
 Algorithms
 - Performance improvements in various Vector implementations
 - Tons of bug fixes and code cleanup



 On Thu, Nov 12, 2009 at 9:06 AM, Grant Ingersoll gsing...@apache.org wrote:

 Anyone care to writeup a release announcement?  Here's Solr's: 
 http://lucene.grantingersoll.com/2009/11/10/apache-solr-1-4-0-offically-released/

 I've cleaned up the build quite a bit and am now testing preparing the 
 artifacts w/ the much simpler build (no more installing third party libs, 
 they are all up under o.a.mahout in the Maven repo).  I'd like to have 
 everything ready to go once the artifacts are put up for a vote.

 Thanks,
 Grant



Re: [jira] Commented: (MAHOUT-184) Code tweaks for .df.* code

2009-10-02 Thread deneche abdelhakim
Sure.

On Fri, Oct 2, 2009 at 8:59 AM, Isabel Drost (JIRA) j...@apache.org wrote:

    [ 
 https://issues.apache.org/jira/browse/MAHOUT-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761501#action_12761501
  ]

 Isabel Drost commented on MAHOUT-184:
 -

 Looks good to me. Deneche, could you please also have a look at the patch to 
 spot any issues early on?

 I would prefer using CLI for the job implementation 
 (core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java), 
 but that can be done in a later patch.



 Code tweaks for .df.* code
 --

                 Key: MAHOUT-184
                 URL: https://issues.apache.org/jira/browse/MAHOUT-184
             Project: Mahout
          Issue Type: Improvement
            Reporter: Sean Owen
            Assignee: Sean Owen
            Priority: Minor
             Fix For: 0.2

         Attachments: Tweaks_to__df__.patch


 This follows on my last email to the mailing list, and code inspection. It's 
 big enough I made a patch. No surprises I hope given the consensus on code 
 style and practice. Might be some good takeaways in here, or points for 
 further discussion.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




commit rights ?

2009-09-27 Thread deneche abdelhakim
I'm trying to commit [MAHOUT-122 |
https://issues.apache.org/jira/browse/MAHOUT-122], but I'm getting the
following error:

svn: Commit failed (details follow):
svn: Server sent unexpected return value (403 Forbidden) in response
to MKACTIVITY request for
'/repos/asf/!svn/act/de296129-b366-459b-b184-c95f10139e7e'

I'm using the following command:

svn commit --username adeneche --password *** --message 'MAHOUT-122
Decision Forests Reference Implementation'


Re: commit rights ?

2009-09-27 Thread deneche abdelhakim
Yes ! that was it...thanks for the answer, I would have spent 99 years
3 months and 6 days before finding the problem myself =P

On Sun, Sep 27, 2009 at 12:39 PM, Grant Ingersoll gsing...@apache.org wrote:
 Yeah, you're in the committers list, so I'd check that you are using https

 -Grant

 On Sep 27, 2009, at 6:47 AM, Simon Willnauer wrote:

 Are you commiting into a http or https path. you must check out via
 https in order to commit this has been an issue for many new
 commiters.

 Simon

 On Sun, Sep 27, 2009 at 8:49 AM, deneche abdelhakim adene...@apache.org
 wrote:

 I'm trying to commit [MAHOUT-122 |
 https://issues.apache.org/jira/browse/MAHOUT-122], but I'm getting the
 following error:

 svn: Commit failed (details follow):
 svn: Server sent unexpected return value (403 Forbidden) in response
 to MKACTIVITY request for
 '/repos/asf/!svn/act/de296129-b366-459b-b184-c95f10139e7e'

 I'm using the following command:

 svn commit --username adeneche --password *** --message 'MAHOUT-122
 Decision Forests Reference Implementation'


 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search




Re: svn commit: r816569 - in /lucene/mahout/trunk/examples/src: main/java/org/apache/mahout/classifier/bayes/ main/java/org/apache/mahout/clustering/meanshift/ main/java/org/apache/mahout/clustering

2009-09-21 Thread deneche abdelhakim
yes its meant to be run twice, one time selecting the training samples
and the next time the testing samples. It assumes that RNG will return
the exact same numbers twice.

On Mon, Sep 21, 2009 at 1:54 PM, Sean Owen sro...@gmail.com wrote:
 I rolled it back. So the reader depends on the seed and the exact
 behavior of the RNG? I have no doubt it is needed if intended, just
 checking that it's intended.

 (I also fixed build-reuters.sh)

 On Sun, Sep 20, 2009 at 1:55 PM, Sean Owen sro...@gmail.com wrote:
 Sorry I will investigate when back at my workstation. I remember
 something like this but thought I preserved the seed. Guess I missed
 something. My bad, I try not to ever change semantics.



Re: svn commit: r816569 - in /lucene/mahout/trunk/examples/src: main/java/org/apache/mahout/classifier/bayes/ main/java/org/apache/mahout/clustering/meanshift/ main/java/org/apache/mahout/clustering

2009-09-20 Thread deneche abdelhakim
The change in 
examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/hadoop/DatasetSplit.java
could lead to a bug. The problem is in the following modification:

-  rng = new MersenneTwisterRNG(split.getSeed());
+  rng = RandomUtils.getRandom();

rng is supposed to use the seed given by split.

I tried to correct this line my self, but I'm having problems
committing the change. I'm getting the following message from svn:

svn: Commit failed (details follow):
svn: Server sent unexpected return value (403 Forbidden) in response
to MKACTIVITY request for
'/repos/asf/!svn/act/627fc1d8-98ad-4046-ae77-41962e731928'

although I successfully committed my changes to the site.


On Fri, Sep 18, 2009 at 11:01 AM,  sro...@apache.org wrote:
 Author: srowen
 Date: Fri Sep 18 10:01:12 2009
 New Revision: 816569

 URL: http://svn.apache.org/viewvc?rev=816569view=rev
 Log:
 Bit of cleanup and, I think, a fix to the WikipediaDatasetCreatorMapper?

 Modified:
    
 lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java
    
 lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/meanshift/DisplayMeanShift.java
    
 lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputMapper.java
    
 lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/CDRule.java
    
 lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/hadoop/DatasetSplit.java
    
 lucene/mahout/trunk/examples/src/test/java/org/apache/mahout/ga/watchmaker/cd/CDCrossoverTest.java
    
 lucene/mahout/trunk/examples/src/test/java/org/apache/mahout/ga/watchmaker/cd/hadoop/CDMapperTest.java
    
 lucene/mahout/trunk/examples/src/test/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosToolTest.java

 Modified: 
 lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java
 URL: 
 http://svn.apache.org/viewvc/lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java?rev=816569r1=816568r2=816569view=diff
 ==
 --- 
 lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java
  (original)
 +++ 
 lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaDatasetCreatorMapper.java
  Fri Sep 18 10:01:12 2009
 @@ -42,12 +42,15 @@

  public class WikipediaDatasetCreatorMapper extends MapReduceBase implements
     MapperLongWritable, Text, Text, Text {
 -  private static final Logger log = 
 LoggerFactory.getLogger(WikipediaDatasetCreatorMapper.class);

 -  private static SetString inputCategories = null;
 -  private static boolean exactMatchOnly = false;
 -  private static Analyzer analyzer;
 +  private static final Logger log = 
 LoggerFactory.getLogger(WikipediaDatasetCreatorMapper.class);
   private static final Pattern SPACE_NON_ALPHA_PATTERN = 
 Pattern.compile([\\s\\W]);
 +  private static final Pattern OPEN_TEXT_TAG_PATTERN = 
 Pattern.compile(text xml:space=\preserve\);
 +  private static final Pattern CLOSE_TEXT_TAG_PATTERN = 
 Pattern.compile(/text);
 +
 +  private SetString inputCategories = null;
 +  private boolean exactMatchOnly = false;
 +  private Analyzer analyzer;

   @Override
   public void map(LongWritable key, Text value,
 @@ -59,7 +62,7 @@
     String catMatch = findMatchingCategory(document);

     if(!catMatch.equals(Unknown)){
 -      document = StringEscapeUtils.unescapeHtml(document.replaceFirst(text 
 xml:space=\preserve\, ).replaceAll(/text, ));
 +      document = 
 StringEscapeUtils.unescapeHtml(CLOSE_TEXT_TAG_PATTERN.matcher(OPEN_TEXT_TAG_PATTERN.matcher(document).replaceFirst()).replaceAll());
       TokenStream stream = analyzer.tokenStream(catMatch, new 
 StringReader(document));
       Token token = new Token();
       while((token = stream.next(token)) != null){
 @@ -69,18 +72,19 @@
     }
   }

 -  public static String findMatchingCategory(String document){
 +  private String findMatchingCategory(String document){
     int startIndex = 0;
     int categoryIndex;
 -    String match = null; // TODO this is never updated?
     while((categoryIndex = document.indexOf([[Category:, startIndex))!=-1)
     {
       categoryIndex+=11;
       int endIndex = document.indexOf(]], categoryIndex);
 -      if(endIndex=document.length() || endIndex  0) break;
 +      if (endIndex = document.length() || endIndex  0) {
 +        break;
 +      }
       String category = document.substring(categoryIndex, 
 endIndex).toLowerCase().trim();
       //categories.add(category.toLowerCase());
 -      if (exactMatchOnly == true  inputCategories.contains(category)){
 +      if (exactMatchOnly  inputCategories.contains(category)){
         return category;
       } else if (exactMatchOnly == false){
         for (String 

Re: Updating the Web site

2009-09-16 Thread deneche abdelhakim
forest is installed in my home directory :(

--- En date de : Mar 15.9.09, Grant Ingersoll gsing...@apache.org a écrit :

 De: Grant Ingersoll gsing...@apache.org
 Objet: Re: Updating the Web site
 À: mahout-dev@lucene.apache.org
 Date: Mardi 15 Septembre 2009, 14h14
 Hmm, make sure you have proper
 permissions to write on the forrest  
 install.  I believe Forrest downloads stuff to its
 directories.  I  
 recall seeing similar things.  Very annoying.
 
 On Sep 15, 2009, at 7:12 AM, deneche abdelhakim wrote:
 
  I'm already using Java 1.5 !
 
  --- En date de : Mar 15.9.09, Grant Ingersoll gsing...@apache.org 
 
  a écrit :
 
  De: Grant Ingersoll gsing...@apache.org
  Objet: Re: Updating the Web site
  À: mahout-dev@lucene.apache.org
  Date: Mardi 15 Septembre 2009, 12h54
  Forrest has a bug w/ JDK 1.6, just
  switch to 1.5 for it and it should
  work.
 
  On Sep 15, 2009, at 6:24 AM, deneche abdelhakim
 wrote:
 
  I followed the instructions available here:
 
  http://cwiki.apache.org/MAHOUT/howtoupdatethewebsite.html
 
  in order to add my name to the committer list
 =P
 
  when running 'forrest run' but I'm getting
 broken
  links:
 
  X [0] skin/images/current.gif
    BROKEN:
  /home/hakim/apache-forrest-0.8/main/webapp/. (Is
 a
  directory)
  X [0] skin/images/page.gif
    BROKEN:
  /home/hakim/apache-forrest-0.8/main/webapp/. (Is
 a
  directory)
  X [0] skin/images/chapter.gif
    BROKEN:
  /home/hakim/apache-forrest-0.8/main/webapp/. (Is
 a
  directory)
 
  it also sais that Your site would still be
 generated,
  but some
  pages would be broken.
 
  svn status shows me that I only modified
  src/documentation/content/
  xdocs/whoweare.xml
 
  can I proceed anyway and copy the site to the
 publish
  directory ?
 
 
 
 
 
 
 
 
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem
 (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
 using Solr/Lucene:
 http://www.lucidimagination.com/search
 
 





Re: Updating the Web site

2009-09-16 Thread deneche abdelhakim
 and js files to site ...
Copying 12 files to /home/hakim/mahout/site/build/site/skin
Copying 5 files to /home/hakim/mahout/site/build/site/skin

Finished copying the non-generated resources.
Now Cocoon will generate the rest.

Static site will be generated at:
/home/hakim/mahout/site/build/site

Cocoon will report the status of each document:
  - in column 1: *=okay X=brokenLink ^=pageSkipped (see FAQ).
  
 
cocoon 2.2.0-dev
Copyright (c) 1999-2005 Apache Software Foundation. All rights reserved.
Build: December 8 2005 (TargetVM=1.4, SourceVM=1.4, Debug=on, Optimize=on)
 


* [1/20][20/20]   5.635s 8.3Kb   linkmap.html
* [2/20][1/19]1.282s 6.9Kb   releases.html
* [3/21][2/22]1.022s 16.3Kb  index.html
* [4/21][1/19]0.509s 7.2Kb   developer-resources.html
* [5/20][0/0] 2.717s 2.3Kb   linkmap.pdf
* [7/18][0/0] 0.154s 4.2Kb   skin/profile.css
* [8/17][0/0] 2.909s 348bskin/images/rc-b-l-15-1body-2menu-3menu.png
* [11/16]   [2/20]0.461s 30.3Kb  taste.html
* [13/14]   [0/0] 0.856s 32.9Kb  taste.pdf
* [14/13]   [0/0] 22.791s 33.9Kb  index.pdf
* [18/9][0/0] 0.077s 5.1Kb   developer-resources.pdf
* [19/8][0/0] 0.09s  4.4Kb   releases.pdf
* [20/8][1/19]0.327s 9.7Kb   mailinglists.html
* [21/7][0/0] 0.259s 5.5Kb   mailinglists.pdf
* [22/6][0/0] 0.511s 2.9Kb   skin/basic.css
* [23/6][1/19]0.326s 7.2Kb   whoweare.html
* [24/5][0/0] 0.103s 4.1Kb   whoweare.pdf
* [26/4][1/19]0.322s 6.7Kb   systemrequirements.html
* [27/3][0/0] 0.079s 3.3Kb   systemrequirements.pdf
* [28/15]   [13/13]   0.143s 12.4Kb  skin/screen.css
* [29/14]   [0/0] 0.035s 390bskin/images/rc-t-r-15-1body-2menu-3menu.png
X [0] skin/images/current.gif   BROKEN: 
/home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory)
* [31/12]   [0/0] 0.052s 214b
skin/images/rc-t-r-5-1header-2tab-unselected-3tab-unselected.png
* [32/11]   [0/0] 0.018s 200b
skin/images/rc-b-r-5-1header-2tab-selected-3tab-selected.png
X [0] skin/images/page.gif  BROKEN: 
/home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory)
* [34/9][0/0] 0.019s 209b
skin/images/rc-t-l-5-1header-2tab-selected-3tab-selected.png
* [35/8][0/0] 0.022s 214b
skin/images/rc-t-r-5-1header-2searchbox-3searchbox.png
X [0] skin/images/chapter.gif   BROKEN: 
/home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory)
* [37/6][0/0] 0.029s 199b
skin/images/rc-t-l-5-1header-2tab-unselected-3tab-unselected.png
* [38/5][0/0] 0.055s 215b
skin/images/rc-t-r-5-1header-2tab-selected-3tab-selected.png
* [40/3][0/0] 0.049s 319bskin/images/rc-b-r-15-1body-2menu-3menu.png
* [41/2][0/0] 0.018s 199b
skin/images/rc-t-l-5-1header-2searchbox-3searchbox.png
* [42/1][0/0] 0.025s 1.2Kb   skin/print.css
Total time: 0 minutes 44 seconds,  Site size: 213,417 Site pages: 30
Java Result: 1

  Copying broken links file to site root.
  
Copying 1 file to /home/hakim/mahout/site/build/site

BUILD FAILED
/home/hakim/apache-forrest-0.8/main/targets/site.xml:180: Error building site.

There appears to be a problem with your site build.

Read the output above:
* Cocoon will report the status of each document:
- in column 1: *=okay X=brokenLink ^=pageSkipped (see FAQ).
* Even if only one link is broken, you will still get failed.
* Your site would still be generated, but some pages would be broken.
  - See /home/hakim/mahout/site/build/site/broken-links.xml

Total time: 1 minute 5 seconds
***

--- En date de : Mer 16.9.09, Grant Ingersoll gsing...@apache.org a écrit :

 De: Grant Ingersoll gsing...@apache.org
 Objet: Re: Updating the Web site
 À: mahout-dev@lucene.apache.org
 Date: Mercredi 16 Septembre 2009, 15h35
 What's the full log say?
 
 On Sep 16, 2009, at 7:15 AM, deneche abdelhakim wrote:
 
  forest is installed in my home directory :(
 
  --- En date de : Mar 15.9.09, Grant Ingersoll gsing...@apache.org 
 
  a écrit :
 
  De: Grant Ingersoll gsing...@apache.org
  Objet: Re: Updating the Web site
  À: mahout-dev@lucene.apache.org
  Date: Mardi 15 Septembre 2009, 14h14
  Hmm, make sure you have proper
  permissions to write on the forrest
  install.  I believe Forrest downloads stuff
 to its
  directories.  I
  recall seeing similar things.  Very
 annoying.
 
  On Sep 15, 2009, at 7:12 AM, deneche abdelhakim
 wrote:
 
  I'm already using Java 1.5 !
 
  --- En date de : Mar 15.9.09, Grant Ingersoll
 gsing...@apache.org
 
  a écrit :
 
  De: Grant Ingersoll gsing...@apache.org
  Objet: Re: Updating the Web site
  À: mahout-dev@lucene.apache.org
  Date: Mardi

Re: Updating the Web site

2009-09-16 Thread deneche abdelhakim
its working =D

--- En date de : Mer 16.9.09, Grant Ingersoll gsing...@apache.org a écrit :

 De: Grant Ingersoll gsing...@apache.org
 Objet: Re: Updating the Web site
 À: mahout-dev@lucene.apache.org
 Date: Mercredi 16 Septembre 2009, 16h08
 svn up and try again
 
 On Sep 16, 2009, at 10:00 AM, Grant Ingersoll wrote:
 
  Now when I did a forrest clean I get the same error.
  
  On Sep 16, 2009, at 9:44 AM, deneche abdelhakim
 wrote:
  
  'forrest site' gives me:
 
 **
  Apache Forrest.  Run 'forrest -projecthelp'
 to list options
  
  Buildfile:
 /home/hakim/apache-forrest-0.8/main/forrest.build.xml
  
  check-java-version:
  This is apache-forrest-0.8
  Using Java 1.5 from
 /usr/lib/jvm/java-1.5.0-sun-1.5.0.19/jre
  
  init-props:
  
  echo-settings:
  
  check-skin:
  
  init-proxy:
  
  fetch-skins-descriptors:
  
  fetch-skin:
  
  unpack-skins:
  
  init-skins:
  
  fetch-plugins-descriptors:
  Fetching plugins descriptor: http://forrest.apache.org/plugins/plugins.xml
  Getting: http://forrest.apache.org/plugins/plugins.xml
  To:
 /home/hakim/mahout/site/build/tmp/plugins-1.xml
  local file date : Wed Dec 03 01:37:14 CET 2008
  Not modified - so not downloaded
  Fetching plugins descriptor: 
  http://forrest.apache.org/plugins/whiteboard-plugins.xml
  Getting: http://forrest.apache.org/plugins/whiteboard-plugins.xml
  To:
 /home/hakim/mahout/site/build/tmp/plugins-2.xml
  local file date : Thu Jan 15 04:07:07 CET 2009
  Not modified - so not downloaded
  Plugin list loaded from http://forrest.apache.org/plugins/plugins.xml.
  Plugin list loaded from 
  http://forrest.apache.org/plugins/whiteboard-plugins.xml.
  
  init-plugins:
  Copying 1 file to
 /home/hakim/mahout/site/build/tmp
  Copying 1 file to
 /home/hakim/mahout/site/build/tmp
  Copying 1 file to
 /home/hakim/mahout/site/build/tmp
  Copying 1 file to
 /home/hakim/mahout/site/build/tmp
  Copying 1 file to
 /home/hakim/mahout/site/build/tmp
  
  
    --
      Installing plugin:
 org.apache.forrest.plugin.output.pdf
  
    --
  
  
  check-plugin:
  org.apache.forrest.plugin.output.pdf is available
 in the build dir. Trying to update it...
  
  init-props:
  
  echo-settings:
  
  init-proxy:
  
  fetch-plugins-descriptors:
  
  fetch-plugin:
  Trying to find the description of
 org.apache.forrest.plugin.output.pdf in the different
 descriptor files
  Using the descriptor file
 /home/hakim/mahout/site/build/tmp/plugins-1.xml...
  Processing
 /home/hakim/mahout/site/build/tmp/plugins-1.xml to
 /home/hakim/mahout/site/build/tmp/pluginlist2fetchbuild.xml
  Loading stylesheet
 /home/hakim/apache-forrest-0.8/main/var/pluginlist2fetch.xsl
  
  fetch-local-unversioned-plugin:
  
  get-local:
  Trying to locally get
 org.apache.forrest.plugin.output.pdf
  Looking in local
 /home/hakim/apache-forrest-0.8/plugins
  Found !
  
  init-build-compiler:
  
  echo-init:
  
  init:
  
  compile:
  
  jar:
  
  local-deploy:
  Locally deploying
 org.apache.forrest.plugin.output.pdf
  
  build:
  Plugin org.apache.forrest.plugin.output.pdf
 deployed ! Ready to configure
  
  fetch-remote-unversioned-plugin-version-forrest:
  
 
 fetch-remote-unversioned-plugin-unversion-forrest:
  
  has-been-downloaded:
  
  downloaded-message:
  
  uptodate-message:
  
  not-found-message:
  Fetch-plugin Ok, installing !
  
  unpack-plugin:
  
  install-plugin:
  
  configure-plugin:
  
  configure-output-plugin:
  Mounting output plugin:
 org.apache.forrest.plugin.output.pdf
  Processing
 /home/hakim/mahout/site/build/tmp/output.xmap to
 /home/hakim/mahout/site/build/tmp/output.xmap.new
  Loading stylesheet
 /home/hakim/apache-forrest-0.8/main/var/pluginMountSnippet.xsl
  Moving 1 file to
 /home/hakim/mahout/site/build/tmp
  
  configure-plugin-locationmap:
  Mounting plugin locationmap for
 org.apache.forrest.plugin.output.pdf
  Processing
 /home/hakim/mahout/site/build/tmp/locationmap.xml to
 /home/hakim/mahout/site/build/tmp/locationmap.xml.new
  Loading stylesheet
 /home/hakim/apache-forrest-0.8/main/var/pluginLmMountSnippet.xsl
  Moving 1 file to
 /home/hakim/mahout/site/build/tmp
  
  init:
  
  -prepare-classpath:
  
  check-contentdir:
  
  examine-proj:
  
  validation-props:
  
  validate-xdocs:
  8 file(s) have been successfully validated.
  ...validated xdocs
  
  validate-skinconf:
  1 file(s) have been successfully validated.
  ...validated skinconf
  
  validate-sitemap:
  ...validated project sitemap
  
  validate-skins-stylesheets:
  ...validated skin stylesheets
  
  validate-skins:
  
  validate-skinchoice:
  ...validated existence of skin 'lucene'
  
  validate-stylesheets:
  
  validate:
  
  site:
  
  Copying the various non-generated resources to
 site.
  Warnings will be issued if the optional project
 resources are not found.
  This is often the case, because

Re : Welcome the newest Mahouts!

2009-09-15 Thread deneche abdelhakim
Got my Apache account yesterday 8D

being a coder I always find it different to write other things than code =P, so 
my biography will probably be weird:

I am an algerian PhD student, I'm expecting to use machine learning algorithms 
(probably evolutionary computing) and distributed computing (mahout? maybe). 
During my master I worked on Artificial Immune Systems applied to pattern 
recognition.
I like coding, mainly in Java, but also in C# (although being at pro-noob level 
in C#).
The past two years I learned a lot with mahout's community, and I'm looking 
forward to learn much more.



--- En date de : Mer 26.8.09, Grant Ingersoll gsing...@apache.org a écrit :

 De: Grant Ingersoll gsing...@apache.org
 Objet: Welcome the newest Mahouts!
 À: mahout-u...@lucene.apache.org, Mahout Dev List 
 mahout-dev@lucene.apache.org
 Date: Mercredi 26 Août 2009, 16h57
 I am pleased to announce that the
 Lucene PMC has voted to add Deneche Abdelhakim, Robin Anil
 and David Hall as Mahout committers.  Deneche, Robin
 and David have all made significant contributions to Mahout
 in regards to classification, clustering, evolutionary
 programming and general usage and utilities. 
 Furthermore, all three are or have been pursuing studies in
 machine learning at University, so we look for more great
 things as well!
 
 I hope you will join me in extending them a warm
 welcome.  I know I look forward to working with them
 and continuing to build on Mahout's capabilities on our way
 to a 1.0 release.
 
 Also, it is customary that each new committer take the time
 to introduce themselves on the mailing list with a brief
 bio/background so we can all better get to know you.
 
 Finally, if you're interested in knowing more about what's
 involved in becoming a committer or would simply like to
 contribute to Mahout, see http://cwiki.apache.org/MAHOUT/howtocontribute.html 
 and
 http://cwiki.apache.org/MAHOUT/howtobecomeacommitter.html.
 
 Congrats to Deneche, Robin and David!
 
 -Grant
 





Updating the Web site

2009-09-15 Thread deneche abdelhakim
I followed the instructions available here:

http://cwiki.apache.org/MAHOUT/howtoupdatethewebsite.html

in order to add my name to the committer list =P

when running 'forrest run' but I'm getting broken links:

X [0] skin/images/current.gif   
  BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory)
X [0] skin/images/page.gif  
  BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory)
X [0] skin/images/chapter.gif   
  BROKEN: /home/hakim/apache-forrest-0.8/main/webapp/. (Is a directory)

it also sais that Your site would still be generated, but some pages would be 
broken.

svn status shows me that I only modified 
src/documentation/content/xdocs/whoweare.xml

can I proceed anyway and copy the site to the publish directory ? 





Re: Re : Welcome the newest Mahouts!

2009-09-15 Thread deneche abdelhakim
 Can you tell more on what you will be working on, which
 problems you
 are trying to solve?

I'm expecting to work on Discrete Tomography, probably reconstruction 
algorithms. But the final decision isn't not mine, so I may end up working on 
something else =P

--- En date de : Mar 15.9.09, Isabel Drost isa...@apache.org a écrit :

 De: Isabel Drost isa...@apache.org
 Objet: Re: Re : Welcome the newest Mahouts!
 À: mahout-dev@lucene.apache.org
 Date: Mardi 15 Septembre 2009, 12h29
 On Tue, 15 Sep 2009 10:11:56 +
 (GMT)
 deneche abdelhakim a_dene...@yahoo.fr
 wrote:
 
  Got my Apache account yesterday 8D
 
 Congratulations! And a warm welcome from me of course.
 
 
  I am an algerian PhD student, I'm expecting to use
 machine learning
  algorithms (probably evolutionary computing) and
 distributed
  computing (mahout? maybe).
 
 Can you tell more on what you will be working on, which
 problems you
 are trying to solve?
 
 
  The past two years I learned a lot with mahout's
 community, and I'm
  looking forward to learn much more.
 
 Hope you'll enjoy your time here.
 
 Isabel
 





Re: Updating the Web site

2009-09-15 Thread deneche abdelhakim
I'm already using Java 1.5 !

--- En date de : Mar 15.9.09, Grant Ingersoll gsing...@apache.org a écrit :

 De: Grant Ingersoll gsing...@apache.org
 Objet: Re: Updating the Web site
 À: mahout-dev@lucene.apache.org
 Date: Mardi 15 Septembre 2009, 12h54
 Forrest has a bug w/ JDK 1.6, just
 switch to 1.5 for it and it should  
 work.
 
 On Sep 15, 2009, at 6:24 AM, deneche abdelhakim wrote:
 
  I followed the instructions available here:
 
  http://cwiki.apache.org/MAHOUT/howtoupdatethewebsite.html
 
  in order to add my name to the committer list =P
 
  when running 'forrest run' but I'm getting broken
 links:
 
  X [0] skin/images/current.gif    
   BROKEN:
 /home/hakim/apache-forrest-0.8/main/webapp/. (Is a
 directory)
  X [0] skin/images/page.gif    
   BROKEN:
 /home/hakim/apache-forrest-0.8/main/webapp/. (Is a
 directory)
  X [0] skin/images/chapter.gif    
   BROKEN:
 /home/hakim/apache-forrest-0.8/main/webapp/. (Is a
 directory)
 
  it also sais that Your site would still be generated,
 but some  
  pages would be broken.
 
  svn status shows me that I only modified
 src/documentation/content/ 
  xdocs/whoweare.xml
 
  can I proceed anyway and copy the site to the publish
 directory ?
 
 
 
 
 





Re: JIRA permission ?

2009-09-15 Thread deneche abdelhakim
Thanks!

--- En date de : Mar 15.9.09, Isabel Drost isa...@apache.org a écrit :

 De: Isabel Drost isa...@apache.org
 Objet: Re: JIRA permission ?
 À: mahout-dev@lucene.apache.org
 Date: Mardi 15 Septembre 2009, 17h23
 On Tue, 15 Sep 2009 14:52:28 +
 (GMT)
 deneche abdelhakim a_dene...@yahoo.fr
 wrote:
 
  now that I'm a committer ( 8D ) I suppose I can assign
 JIRA issues to
  myself. Do I need a special permission to do that ?
 because I'm not
  able to find a way to do it =P
 
 I added you as committer to jira. You should be able to
 assign JIRA
 issues to yourself now.
 
 Isabel
 





Re : Comprehensive study on Java Memory Optimization

2009-09-14 Thread deneche abdelhakim
Thanks Robin =D

--- En date de : Lun 14.9.09, Robin Anil robin.a...@gmail.com a écrit :

 De: Robin Anil robin.a...@gmail.com
 Objet: Comprehensive study on Java Memory Optimization
 À: mahout-dev mahout-dev@lucene.apache.org
 Date: Lundi 14 Septembre 2009, 9h08
 Hope it would be useful.
 Link:
 http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memory-efficient-java-tutorial.pdf
 
 Robin
 






Re : [GSOC] Code Submissions

2009-09-08 Thread deneche abdelhakim
done.

--- En date de : Mar 8.9.09, Grant Ingersoll gsing...@apache.org a écrit :

 De: Grant Ingersoll gsing...@apache.org
 Objet: [GSOC] Code Submissions
 À: Mahout Dev List mahout-dev@lucene.apache.org
 Date: Mardi 8 Septembre 2009, 13h09
 Hi Robin, David and Deneche,
 
 You will need to submit code samples.  Please see 
 http://groups.google.com/group/google-summer-of-code-announce/web/how-to-provide-google-with-sample-code
 
 -Grant
 






Re: build failure

2009-08-26 Thread deneche abdelhakim
just got the same error, nuking .m2 AND installing maven 2.2.1 solved the 
problem


--- En date de : Mar 25.8.09, Ted Dunning ted.dunn...@gmail.com a écrit :

 De: Ted Dunning ted.dunn...@gmail.com
 Objet: Re: build failure
 À: mahout-dev@lucene.apache.org, isa...@apache..org
 Date: Mardi 25 Août 2009, 0h58
 Tried the -U solution.  No joy.
 
 I will try nuking .m2 next.
 
 On Mon, Aug 24, 2009 at 3:33 PM, Isabel Drost isa...@apache.org
 wrote:
 
  On Sunday 23 August 2009 16:24:13 Grant Ingersoll
 wrote:
   Try deleting your ~/.m2/repository.
 
  I should be sufficient to delete the resources-plugin
 in the repo only, or
  maybe running maven with -U enabled already helps?
 
  Isabel
 
  --
  QOTD: Se Maomé não vai à montanha, a montanha vaia
 Maomé.
    |\     
 _,,,---,,_   
    Web:   http://www.isabel-drost.de
   /,`.-'`'    -.  ;-;;,_
   |,4-  ) )-,_..;\ (  `'-'
  '---''(_/--'  `-'\_) (fL)  IM: 
 xmpp://main...@spaceboyz.net
 
 
 
 
 -- 
 Ted Dunning, CTO
 DeepDyve
 






class not found bug ?

2009-08-17 Thread deneche abdelhakim
I moved recently some of the Decision Forest examples from the core project 
to the examples project. While in core they worked perfectly in hadoop 0..19.1 
(pseudo-distributed), but now they don't !!!

For example, running my org.apache.mahout.df.BuildForest gives the following 
exception:


09/08/17 12:02:36 INFO mapred.JobClient: Running job: job_200908171136_0020
09/08/17 12:02:37 INFO mapred.JobClient:  map 0% reduce 0%
09/08/17 12:02:43 INFO mapred.JobClient: Task Id : 
attempt_200908171136_0020_m_00_0, Status : FAILED
java.lang.NoClassDefFoundError: com/thoughtworks/xstream/XStream
at org.apache.mahout.utils.StringUtils.clinit(StringUtils.java:28)
at org.apache.mahout.df.mapred.Builder.getTreeBuilder(Builder.java:117)
at 
org.apache.mahout.df.mapred.MapredMapper.configure(MapredMapper.java:74)
at 
org.apache.mahout.df.mapred.partial.Step1Mapper.configure(Step1Mapper.java:75)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
...


So I tried instead to run one of mahout's : following the wiki, kmeans gives me 
the following error:

...
09/08/17 11:59:27 INFO kmeans.KMeansDriver: Iteration 4
...
09/08/17 11:59:43 INFO kmeans.KMeansDriver: Clustering 
09/08/17 11:59:43 INFO kmeans.KMeansDriver: Running Clustering
09/08/17 11:59:43 INFO kmeans.KMeansDriver: Input: output/data Clusters In: 
output/clusters-4 Out: output/points Distance: 
org.apache.mahout.utils.EuclideanDistanceMeasure
09/08/17 11:59:43 INFO kmeans.KMeansDriver: convergence: 0.5 Input Vectors: 
org.apache.mahout.matrix.SparseVector
09/08/17 11:59:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
09/08/17 11:59:43 INFO mapred..FileInputFormat: Total input paths to process : 2
09/08/17 11:59:43 INFO mapred.JobClient: Running job: job_200908171136_0019
09/08/17 11:59:44 INFO mapred.JobClient:  map 0% reduce 0%
09/08/17 11:59:54 INFO mapred.JobClient: Task Id : 
attempt_200908171136_0019_m_00_0, Status : FAILED
java.lang.NoClassDefFoundError: com/google/gson/reflect/TypeToken
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:637)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
at java.security.AccessController.doPrivileged(Native Method)
...

The problem seems related to the fact that mahout-core.jar is being packed 
inside examples.jar. So I modified maven/build.xml to pack the core classes 
instead (because they are available):

Index: maven/build.xml
===
--- maven/build.xml (revision 804891)
+++ maven/build.xml (working copy)
@@ -45,9 +45,9 @@
   includes=**/*.jar/
   zipfileset dir=${core-lib} prefix=lib
   includes=**/*.jar excludes=hadoop-*.jar/
-  zipfileset dir=../core/target/ prefix=lib 
includes=apache-mahout-core-${version}.jar/
+  zipfileset dir=../core/target/classes/
   zipfileset dir=${dest}/dependency prefix=lib
-  includes=**/*.jar/
+  includes=**/*.jar 
excludes=apache-mahout-core-${version}.jar/
   zipfileset dir=../core/target/dependency prefix=lib
   includes=**/*.jar/
 /jar

This seems to solve the problem, but I didn't try it on all examples

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail



Re : Error building Mahout

2009-07-23 Thread deneche abdelhakim
I'm getting it too when building from the base directory



- Message d'origine 
De : Robin Anil robin.a...@gmail.com
À : mahout-dev mahout-dev@lucene.apache.org
Envoyé le : Mercredi, 22 Juillet 2009, 19h15mn 38s
Objet : Error building Mahout

I am getting this error on building mahout. mvn clean install -e take
a look at the debug output. Since i am not very clear about how maven
plugin work. I would appreciate some insight into the same.

I believe copy resources is the stage where the jar files get copied
to the target folder

Robin

Console dump belwo



[INFO] Building jar:
/home/robin/lucene/trunk/buildtools/target/mahout-buildtools-0.2-SNAPSHOT.jar
[INFO] [install:install]
[INFO] Installing
/home/robin/lucene/trunk/buildtools/target/mahout-buildtools-0.2-SNAPSHOT.jar
to 
/home/robin/.m2/repository/org/apache/mahout/mahout-buildtools/0.2-SNAPSHOT/mahout-buildtools-0.2-SNAPSHOT.jar
[INFO] 
[INFO] Building Mahout Common Maven Parent
[INFO]task-segment: [clean, install]
[INFO] 
[INFO] [clean:clean]
[INFO] [site:attach-descriptor]
[INFO] [install:install]
[INFO] Installing /home/robin/lucene/trunk/maven/pom.xml to
/home/robin/.m2/repository/org/apache/mahout/mahout-parent/0.2-SNAPSHOT/mahout-parent-0.2-SNAPSHOT.pom
[INFO] 
[INFO] Building Mahout core
[INFO]task-segment: [clean, install]
[INFO] 
[INFO] 
[ERROR] BUILD ERROR
[INFO] 
[INFO] 'copy-resources' was specified in an execution, but not found
in the plugin
[INFO] 
[INFO] Trace
org.apache.maven.lifecycle.LifecycleExecutionException:
'copy-resources' was specified in an execution, but not found in the
plugin
at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.bindExecutionToLifecycle(DefaultLifecycleExecutor.java:1359)
at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.bindPluginToLifecycle(DefaultLifecycleExecutor.java:1260)
at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.constructLifecycleMappings(DefaultLifecycleExecutor.java:1004)
at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:477)
at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:330)
at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:291)
at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:142)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:336)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:129)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:287)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
at org.codehaus.classworlds.Launcher.main(Launcher.java:375)






Re : Error building Mahout

2009-07-23 Thread deneche abdelhakim
maven 2.1.0

deleting the local repository solves the problems, just hopes I wont have to do 
it often



- Message d'origine 
De : Grant Ingersoll gsing...@apache.org
À : mahout-dev@lucene.apache.org
Envoyé le : Mercredi, 22 Juillet 2009, 19h42mn 04s
Objet : Re: Error building Mahout

What version of Mvn?  Whenever I'm in doubt about Mvn, I delete the local 
repository (/home/robin/.m2/repository).


On Jul 22, 2009, at 2:15 PM, Robin Anil wrote:

 I am getting this error on building mahout. mvn clean install -e take
 a look at the debug output. Since i am not very clear about how maven
 plugin work. I would appreciate some insight into the same.
 
 I believe copy resources is the stage where the jar files get copied
 to the target folder
 
 Robin
 
 Console dump belwo
 
 
 
 [INFO] Building jar:
 /home/robin/lucene/trunk/buildtools/target/mahout-buildtools-0.2-SNAPSHOT.jar
 [INFO] [install:install]
 [INFO] Installing
 /home/robin/lucene/trunk/buildtools/target/mahout-buildtools-0.2-SNAPSHOT.jar
 to 
 /home/robin/.m2/repository/org/apache/mahout/mahout-buildtools/0.2-SNAPSHOT/mahout-buildtools-0.2-SNAPSHOT.jar
 [INFO] 
 
 [INFO] Building Mahout Common Maven Parent
 [INFO]task-segment: [clean, install]
 [INFO] 
 
 [INFO] [clean:clean]
 [INFO] [site:attach-descriptor]
 [INFO] [install:install]
 [INFO] Installing /home/robin/lucene/trunk/maven/pom.xml to
 /home/robin/.m2/repository/org/apache/mahout/mahout-parent/0.2-SNAPSHOT/mahout-parent-0.2-SNAPSHOT.pom
 [INFO] 
 
 [INFO] Building Mahout core
 [INFO]task-segment: [clean, install]
 [INFO] 
 
 [INFO] 
 
 [ERROR] BUILD ERROR
 [INFO] 
 
 [INFO] 'copy-resources' was specified in an execution, but not found
 in the plugin
 [INFO] 
 
 [INFO] Trace
 org.apache.maven.lifecycle.LifecycleExecutionException:
 'copy-resources' was specified in an execution, but not found in the
 plugin
 at 
 org.apache.maven.lifecycle.DefaultLifecycleExecutor.bindExecutionToLifecycle(DefaultLifecycleExecutor.java:1359)
 at 
 org.apache.maven.lifecycle.DefaultLifecycleExecutor.bindPluginToLifecycle(DefaultLifecycleExecutor.java:1260)
 at 
 org.apache.maven.lifecycle.DefaultLifecycleExecutor.constructLifecycleMappings(DefaultLifecycleExecutor.java:1004)
 at 
 org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:477)
 at 
 org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:330)
 at 
 org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:291)
 at 
 org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:142)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:336)
 at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:129)
 at org.apache.maven.cli.MavenCli.main(MavenCli.java:287)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
 at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
 at 
 org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
 at org.codehaus.classworlds.Launcher.main(Launcher.java:375)

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
Solr/Lucene:
http://www.lucidimagination.com/search





Re: [jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests

2009-07-18 Thread deneche abdelhakim

Actually, I'm not used any reducer at all, the output of the mappers is 
collected and handled by the main program after the end of the job.

Running the job with 10 map tasks in a 10 instances (c1.medium) cluster takes 
0h 11m 39s 209, speculative execution is on so 12 map tasks have been launched.

running the same job with 5x10 map tasks takes 0h 11m 54s 962, 59 map tasks 
have been launched.

And running the same job again with 5x10 map tasks with job parameter 
mapred.job.reuse.jvm.num.tasks=-1 (no limit how many tasks to run per jvm) 
takes 0h 11m 57s 115 

--- En date de : Sam 18.7.09, Ted Dunning ted.dunn...@gmail.com a écrit :

 De: Ted Dunning ted.dunn...@gmail.com
 Objet: Re: [jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests
 À: mahout-dev@lucene.apache.org
 Date: Samedi 18 Juillet 2009, 20h36
 This is interesting.
 
 Is the reduce trivial here? (if so, then and shuffling
 isn't the problem and
 you may have demonstrated this with your no output
 version)
 
 WHat happens if you increase the number of maps to 5x the
 number of nodes?
 
 
 
 On Sat, Jul 18, 2009 at 11:11 AM, Deneche A. Hakim (JIRA)
 j...@apache.orgwrote:
 
  It looks like building a single tree in a sequential
 manner is 2x faster
  than building the same tree with the cluster !!! I
 don't have a lot of
  experience with clusters, is it normal ??? may be 10
 instances is just too
  small to get a good speedup, or may be there is a bug
 hiding somewhere (I
  can hear it walking in the code when the moon...)
 
 
 
 
 -- 
 Ted Dunning, CTO
 DeepDyve
 





Re: problems downloading lucene-analyzers

2009-07-01 Thread deneche abdelhakim

Thanks Robin for the hint about squid

 I'd be happy to lock down a specific snapshot (say last nights), but I 
 don't know the Maven syntax to do that.  If you can find how, let me know
 and I'll happily commit it.

I've searched on Google and it doesn't seem to be possible to lock a specific 
version of a snapshot:

http://stackoverflow.com/questions/986040/maven-attempts-to-use-wrong-snapshot-version

finally I've been able to download the snapshots, from now on I'll just use the 
-o parameter to stay offline

--- En date de : Mar 30.6.09, Grant Ingersoll gsing...@apache.org a écrit :

 De: Grant Ingersoll gsing...@apache.org
 Objet: Re: problems downloading lucene-analyzers
 À: mahout-dev@lucene.apache.org
 Date: Mardi 30 Juin 2009, 15h20
 
 FWIW, it works for me.
 
 On Jun 30, 2009, at 6:54 AM, deneche abdelhakim wrote:
 
  
  I'm having problems with lucene-analyzers
 (2.9-SNAPSHOT) dependency, because its a snapshot mvn
 install downloads a new version every day, and most of the
 time I got checksum failures !!! Is any body else having the
 same problem ?
  
  mvn -version :
  Maven version: 2.0.9
  Java version: 1.6.0_0
  OS name: linux version: 2.6.28-13-generic arch:
 i386 Family: unix
  
  
  
  
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem
 (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
 http://www.lucidimagination.com/search
 
 





problems downloading lucene-analyzers

2009-06-30 Thread deneche abdelhakim

I'm having problems with lucene-analyzers (2.9-SNAPSHOT) dependency, because 
its a snapshot mvn install downloads a new version every day, and most of the 
time I got checksum failures !!! Is any body else having the same problem ?

mvn -version :
Maven version: 2.0.9
Java version: 1.6.0_0
OS name: linux version: 2.6.28-13-generic arch: i386 Family: unix






[GSOC] Accepted Students

2009-04-21 Thread deneche abdelhakim

Hi, 

=D

I've been accepted. And I'll be working on Random Forests 

=P

Given it's my second participation, I have one advise : don't be shy to ask 
about anything related to your project on this list (starting from now), its 
the fastest way to learn about Mahout.

Who else has been accepted ?

-
abdelhakim





Re: [GSOC] Accepted Students

2009-04-21 Thread deneche abdelhakim

Hi David, Welcome into Mahout =)

The How To Contribute Wiki page is a must read, it gives you a quick overview 
about all you'll need to when contributing to Mahout.

In my own experience you'll also need to:
* know how to build the latest version of Mahout:

http://cwiki.apache.org/MAHOUT/buildingmahout.html

although, depending on your project you may skip the Taste Web part if you're 
not working with Taste.

* know how to run an example in Hadoop, at least in pseudo-distributed:

http://hadoop.apache.org/core/docs/current/quickstart.html

--- En date de : Mar 21.4.09, David Hall d...@cs.stanford.edu a écrit :

 De: David Hall d...@cs.stanford.edu
 Objet: Re: [GSOC] Accepted Students
 À: mahout-dev@lucene.apache.org
 Date: Mardi 21 Avril 2009, 8h30
 On Mon, Apr 20, 2009 at 11:18 PM,
 deneche abdelhakim a_dene...@yahoo.fr
 wrote:
 
  Hi,
 
  =D
 
  I've been accepted. And I'll be working on Random
 Forests
 
  =P
 
  Given it's my second participation, I have one advise
 : don't be shy to ask about anything related to your project
 on this list (starting from now), its the fastest way to
 learn about Mahout.
 
  Who else has been accepted ?
 
 I'm here. I'll be working on Latent Dirichlet Allocation.
 
 As for questions, what am I supposed to be reading during
 this
 community building period? I see:
 
 * http://cwiki.apache.org/MAHOUT/howtocontribute.html
 * http://www.apache.org/foundation/how-it-works.html
 
 plus skimming javadocs.
 
 Other suggestions? Either general, or more specific to my
 project?
 
 -- David
 
 
  -
  abdelhakim
 
 
 
 
 





Re: [gsoc] random forests

2009-03-31 Thread deneche abdelhakim

Here is a draft of my proposal

**
Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests

Student: AbdelHakim Deneche
Student e-mail: ...

Student Major: Phd in Computer Science
Student Degree: Master in Computer Science
Student Graduation: Spring 2011

Organization: The Apache Software Foundation
Assigned Mentor:


Abstract:

My goal is to add the power of random/regression forests to Mahout. At the end 
of this summer one should be able to build random/regression forests for large, 
possibly, distributed datasets, store the forest and reuse it to classify new 
data. In addition, a demo on EC2 is planned.


Detailed Description:

This project is all about random/regression forests. The core component is the 
tree building algorithm from a random bootstrap from the whole dataset. I 
already wrote a detailed description on Mahout Wiki [RandomForests]. Given the 
size of the dataset, two distributed implementation are possible:

1. The most straightforward one deals with relatively small datasets. By small, 
I mean a dataset that can be replicated on every node of the cluster. 
Basically, each mapper has access to the whole dataset, so if the forest 
contains N trees and we have M mappers, each mapper runs the core building 
algorithm N/M times. This implementation is, relatively, easy because each 
mapper runs the basic building algorithm as it is. It is also of great 
interest if the user wants to try different parameters when building the 
forest. An out-of-core implementation is also possible to deal with datasets 
that cannot fit into the node memory.

2. The second implementation, which is the most difficult, is concerned with 
very large datasets that cannot fit in every machine of the cluster. In this 
case the mappers work differently, each mapper has access to a subset from the 
dataset, thus all the mappers collaborate to build each tree of the forest. The 
core building algorithm must thus be rewritten in a map-reduce form. This 
implementation can deal with datasets of any size, as long as they are on the 
cluster.

Although the first implementation is easier to implement, the CPU and IO 
overhead of the out-of-core implementation are still unknown. A reference, 
non-parallel, implementation should thus be built to better understand the 
effects of the out-of-core implementation, especially for large datasets. This 
reference implementation is also usefull to asses the correctness of the 
distributed implementation.


Working Plan and list of deliverables

Must-Have:
1. reference implementation of Random/Regression Forests Building Algorithm:
 . Build a forest of trees, the basic algorithm (described in the wiki) takes a 
subset from the dataset as a training set and builds a decision tree. This 
algorithm is repeated for each tree of the forest.
 . The forest is stored in a file, this way it can be re-used, at any time, to 
classify new cases
 . At this step, the necessary changes to Mahout's Classifier interface are 
made to extend its use to more than Text datasets.

2. Study the effects of large datasets on the reference implementation
 . This step should guide our choice of the proper parallel implementation

3. Parallel implementation, choose one of the following:
 3a. Parallel implementation A
  . When the dataset can be replicated to all computing nodes.
  . Each mapper has access to the whole dataset, if the forest contains N trees 
and we have M mappers, each mapper runs the basic building algorithm N/M times. 
The mapper if also responsible of computing the out-of-bag error estimation.
  . The reducer store the trees in the RF file, and merges the oob error 
estimations.
 3b. Parallel implementation B:
 . When the dataset is so big that it can no longer fit on every computing 
node, it must be distributed over the cluster.
 . Each mapper has access to a subset from the dataset, thus all the mappers 
collaborate to build each tree of the forest.
 . In this case, the basic algorithm must be rewritten to fit in the map-reduce 
paradigm.

Should-Have:
4. Run the Random Forest with a real dataset on EC2:
 . This step is important, because running the RF on a local dual core machine 
is different from running it on a real cluster with a real dataset.
 . This can make a good demo for Mahout
 . Amazon has put some interesting datasets to play with [PublicDatasets].
   The US Census dataset comes in various sizes ranging from 2Go to 200Go, and 
should make a very good example.
 . At this stage it may be useful to implement [MAHOUT-71] (Dataset to Matrix 
Reader).

Wanna-Have:
5. If there is still time, implement one or two other important features of RFs 
such as Variable importance and Proximity estimation


Additional Information:
I am a PhD student at the University Mentouri of Constantine. My primary 
research goal is a framework to help build Intelligent Adaptive Systems. For 
the purpose of my Master, I worked on 

Re: [gsoc] random forests

2009-03-30 Thread deneche abdelhakim

Thank you for your answer, it just made me aware of many hidden-possible-future 
problems with my implementation.

 The first is that for any given application, the odds that
 the data will not fit in a single machine are small, especially if you 
 have an out-of-core tree builder.  Really, really big datasets are
 increasingly common, but are still a small minority of all datasets.

by out-of-core you mean the builder can fetch the data directly from a file 
instead of working from in-memory only (?)

 One question I have about your plan is whether your step (1) involves
 building trees or forests only from data held in memory or whether it 
 can be adapted to stream through the data (possibly several
 times).  If a streaming implementation is viable, then it may well be 
 that performance is still quite good for small datasets due to buffering.

I was planning to distribute the dataset files to all workers using Hadoop's 
DistributedCache. I think that a streaming implementation is feasible, the 
basic tree building algorithm (described here 
http://cwiki.apache.org/MAHOUT/random-forests.html) would have to stream 
through the data (either in-memory or from a file) for each node of the tree. 
During this pass, it computes the information gain (IG) for the selected 
variables. 
This algorithm could be improved to compute the IG's for a list of nodes, thus 
reducing the total number of passes through the data. When building the forest, 
the list of nodes comes from all the trees built by the mapper.

 Another way to put this is that the key question is how single node
 computation scales with input size.  If the scaling is relatively linear
 with data size, then your approach (3) will work no matter the data size.
 If scaling shows an evil memory size effect, then your approach (2) 
 would be required for large data sets.

I'll have to run some tests before answering this question, but I think that 
the memory usage of the improved algorithm (described above) will mainly be 
needed to store the IG's computations (variable probabilities...). One way to 
limit the memory usage is to limit the number of tree-nodes computed at each 
data pass. Increasing this limit should reduce the data passes but increase the 
memory usage, and vice versa.

There is still one case that this approach, even out-of-core, cannot handle: 
very large datasets that cannot fit in the node hard-drive, and thus must be 
distributed across the cluster.

abdelHakim
--- En date de : Lun 30.3.09, Ted Dunning ted.dunn...@gmail.com a écrit :

 De: Ted Dunning ted.dunn...@gmail.com
 Objet: Re: [gsoc] random forests
 À: mahout-dev@lucene.apache.org
 Date: Lundi 30 Mars 2009, 0h59
 I have two answers for you.
 
 The first is that for any given application, the odds that
 the data will not
 fit in a single machine are small, especially if you have
 an out-of-core
 tree builder.  Really, really big datasets are
 increasingly common, but are
 still a small minority of all datasets.
 
 The second answer is that the odds that SOME mahout
 application will be too
 large for a single node are quite high.
 
 These aren't contradictory.  They just describe the
 long-tail nature of
 problem sizes.
 
 One question I have about your plan is whether your step
 (1) involves
 building trees or forests only from data held in memory or
 whether it can be
 adapted to stream through the data (possibly several
 times).  If a streaming
 implementation is viable, then it may well be that
 performance is still
 quite good for small datasets due to buffering.
 
 If streaming works, then a single node will be able to
 handle very large
 datasets but will just be kind of slow.  As you point
 out, that can be
 remedied trivially.
 
 Another way to put this is that the key question is how
 single node
 computation scales with input size.  If the scaling is
 relatively linear
 with data size, then your approach (3) will work no matter
 the data size.
 If scaling shows an evil memory size effect, then your
 approach (2) would be
 required for large data sets.
 
 On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim a_dene...@yahoo.frwrote:
 
  My question is : when Mahout.RF will be used in a real
 application, what
  are the odds that the dataset will be so large that it
 can't fit on every
  machine of the cluster ?
 
  the answer to this question should help me decide
 which implementation I'll
  choose.
 
 
 
 
 -- 
 Ted Dunning, CTO
 DeepDyve
 
 111 West Evelyn Ave. Ste. 202
 Sunnyvale, CA 94086
 www.deepdyve.com
 408-773-0110 ext. 738
 858-414-0013 (m)
 408-773-0220 (fax)
 





Re: [gsoc] random forests

2009-03-28 Thread deneche abdelhakim

you should read in . 2a

. This implementation is, relatively, easy given...

--- En date de : Sam 28.3.09, deneche abdelhakim a_dene...@yahoo.fr a écrit :

 De: deneche abdelhakim a_dene...@yahoo.fr
 Objet: Re: [gsoc] random forests
 À: mahout-dev@lucene.apache.org
 Date: Samedi 28 Mars 2009, 16h14
 
 I'm actually writing my working plan, and it looks like
 this:
 
 *
 1. reference implementation of Random/Regression Forests
 Building Algorithm: 
  . Build a forest of trees, the basic algorithm (described
 in the wiki) takes a subset from the dataset as a training
 set and builds a decision tree. This basic algorithm is
 repeated for each tree of the forest. 
  . The forest is stored in a file, this way it can be used
 later to classify new cases
 
 2a. distributed Implementation A: 
  . When the dataset can be replicated to all computing
 nodes.
  . Each mapper has access to the whole dataset, if the
 forest contains N trees and we have M mappers, each mapper
 runs the basic building algorithm N/M times.
  . This implementation is, relatively, given that the
 reference implementation is available, because each mapper
 runs the basic building algorithm as it is.
 
 2b. Distributed Implementation B:
  . When the dataset is so big that it can no longer fit on
 every computing node, it must be distributed over the
 cluster. 
  . Each mapper has access to a subset from the dataset,
 thus all the mappers collaborate to build each tree of the
 forest.
  . In this case, the basic algorithm must be rewritten to
 fit in the map-reduce paradigm.
 
 3. Run the Random Forest with a real dataset on EC2:
  . This step is important, because running the RF on a
 local dual core machine is way different from running it on
 a real cluster with a real dataset.
  . This can make for a good demo for Mahout
 
 4. If there is still time, implement one or two other
 important features of RFs such as Variable importance and
 Proximity estimation
 *
 
 It is clear from the plan that I won't be able to do all
 those steps, and in some way I must choose only one
 implementation (2a or 2b) to do. The first implementation
 should take less time to implement than 2b and I'm quite
 sure I can go up to the 4th step, adding other features to
 the RF. BUT the second implementation is the only one
 capable of dealing with very large distributed datasets.
 
 My question is : when Mahout.RF will be used in a real
 application, what are the odds that the dataset will be so
 large that it can't fit on every machine of the cluster ? 
 
 the answer to this question should help me decide which
 implementation I'll choose.
 
 --- En date de : Dim 22.3.09, Ted Dunning ted.dunn...@gmail.com
 a écrit :
 
  De: Ted Dunning ted.dunn...@gmail.com
  Objet: Re: [gsoc] random forests
  À: mahout-dev@lucene.apache.org
  Date: Dimanche 22 Mars 2009, 0h36
  Great expression!
  
  You may be right about the nose-bleed tendency between
 the
  two methods.
  
  On Sat, Mar 21, 2009 at 4:46 AM, deneche abdelhakim
 a_dene...@yahoo.frwrote:
  
   I can't find a no-nose-bleeding algorithm
  
  
  
  
  -- 
  Ted Dunning, CTO
  DeepDyve
  
 
 
 
 





Re: GSoC 2009-Discussion

2009-03-24 Thread deneche abdelhakim

talking about Random Forests, I think there are two possible ways to actually 
implement them:

The first implementation is useful when the dataset is not that big (= 2Go 
perhaps) and thus can be distributed via Hadoop's DistributedCache. In this 
case each mapper has access to all the dataset and builds a subset of the 
forest.

The second one is related to large datasets, and by large I mean datasets that 
cannot fit on every computing node. In this case each mapper processes a subset 
of the dataset for all the trees.

Im more interested in the second implementation, so may be Samuel would be 
interested in the first...but of course if actually the community need them 
both :)

--- En date de : Mar 24.3.09, Ted Dunning ted.dunn...@gmail.com a écrit :

 De: Ted Dunning ted.dunn...@gmail.com
 Objet: Re: GSoC 2009-Discussion
 À: mahout-dev@lucene.apache.org
 Date: Mardi 24 Mars 2009, 0h07
 There are other algorithms of serious
 interest.  Bayesian Additive
 Regression Trees (BART) would make a very interesting
 complement to Random
 Forests.  I don't know how important it is to get a
 normal decision tree
 algorithm going because the cost to build these is often
 not that high.
 Boosted decision trees might be of interest, but probably
 not as much as
 BART.
 
 It might also be interesting to work with this student to
 implement some of
 the diagnostics associated with random forests.  There
 is plenty to do.
 
 
 - Original Message 
 
   From: Samuel Louvan samuel.lou...@gmail.com
 
  My questions:
   - I just notice in the mailing archive that other
 student also pretty
   serious to implement random forest algorithm.
 Should I select
     decision tree instead ? (for my
 future GSoC proposal)
   - Actually I found it would be interesting if I
 can combine Apache
   Nutch and Mahout so the idea is to implement web
 page segmentation +
   classifier inside
     a web crawler. By doing this, a
 crawler, for instance, can use the
   output of the classification to  only follow
 certain links that lie on
   informative content parts.
     Is this interesting  make
 sense for you guys?
 
 
 
 
 -- 
 Ted Dunning, CTO
 DeepDyve
 





Re: [gsoc] random forests

2009-03-17 Thread deneche abdelhakim

Yeah, Breinman states that 

at each node, m variables are selected at random out of the M

I modified the wiki page, in LearnUnprunedTree(X,Y) which builds iteratively a 
node at a time, I added this line:

select m variables at random out of the M variables

before searching the best split

For j = 1 .. m
  ...

--- En date de : Lun 16.3.09, Ted Dunning ted.dunn...@gmail.com a écrit :

 De: Ted Dunning ted.dunn...@gmail.com
 Objet: Re: [gsoc] random forests
 À: mahout-dev@lucene.apache.org
 Date: Lundi 16 Mars 2009, 7h26
 Nice writeup.
 
 One thing that I was confused about for a long time is
 whether the choice of
 variables to use for splits is chosen once per tree or
 again at each split.
 
 I think that the latter interpretation is actually the
 correct one.  You
 should check my thought.
 
 On Sun, Mar 15, 2009 at 1:53 AM, deneche abdelhakim a_dene...@yahoo.frwrote:
 
  I added a page to the wiki that describes how to build
 a random forest and
  how to use it to classify new cases.
 
 
 
 
 -- 
 Ted Dunning, CTO
 DeepDyve
 





[gsoc] random forests

2009-03-15 Thread deneche abdelhakim

I added a page to the wiki that describes how to build a random forest and how 
to use it to classify new cases.

http://cwiki.apache.org/confluence/display/MAHOUT/Random+Forests







Re: Mahout for 1.5 JVM

2009-03-09 Thread deneche abdelhakim

The following classes uses the Deque interface, which is not available in Java 
1.5:

. org.apache.mahout.classifier.bayes.BayesClassifier
. org.apache.mahout.classifier.cbayes.CBayesClassifier


--- En date de : Lun 9.3.09, Sean Owen sro...@gmail.com a écrit :

 De: Sean Owen sro...@gmail.com
 Objet: Re: Mahout for 1.5 JVM
 À: mahout-dev@lucene.apache.org
 Date: Lundi 9 Mars 2009, 22h17
 Yeah I don't know of anything in my
 bits that actually uses a Java
 6-only class, but could be proved wrong there. You can dig
 out my old
 build.xml file in a pinch to build just this bit -- I can
 write up a
 quick Ant build for you too for the same purpose.
 
 You do need to make sure you compile with Java 6 since I do
 surely use
 stuff like @Override on methods implementing interface
 methods which
 isn't allowed in Java 5, but which javac in Java 6 can take
 care of if
 source is 6 and target is 5.
 
 On Mon, Mar 9, 2009 at 9:13 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com
 wrote:
 
  Hm, yeah, 1.6 because of Hadoop, I forgot about that.
  I need only the Tasty part of Mahout, though, and that one
 doesn't really need to run on Hadoop.  Any way to build
 just that (for 1.5)?
 





Re: Google SoC 2009

2009-03-03 Thread deneche abdelhakim

Im seriously considering Random Forests (RF) as my GSoC project, they seem 
interesting, and judging by how often they have been suggested, they are very 
useful to Mahout. I found the following discussion:

http://markmail.org/message/dancn3n76ken6thb

that gives many useful informations about RF, and the Breiman's web site 
contains a very clear description of the algorithm and its possible uses. 

A question through, the most basic use of RF is as a classifier. Does it mean 
it must implement org.apache.mahout.common.Classifier interface ? Im not quite 
sure but it seems dedicated to classify text documents, but RF could be useful 
for any kind of datasets

--- En date de : Ven 27.2.09, Grant Ingersoll gsing...@apache.org a écrit :

 De: Grant Ingersoll gsing...@apache.org
 Objet: Re: Google SoC 2009
 À: mahout-dev@lucene.apache.org
 Date: Vendredi 27 Février 2009, 18h34
 Priority is in the eye of the beholder in Apache land, so
 scratch the itch you are most interested in.  Ultimately,
 we're interested in having a suite of ML libraries, but
 you certainly could do worse than to pick something that has
 proven to be useful, stable and well-used by lots of people
 over time.   I think several of them have been suggested on
 another related thread, but things like neural nets, linear
 regression, random forests, self organizing maps are all of
 interest.
 
 -Grant
 
 On Feb 24, 2009, at 12:04 PM, Siddharth Prakash Singh
 wrote:
 
  Hi,
  
  No, I don't have any specific interest. I would
 rather like to work on
  implementing algorithm which is of most priority.
  
  Awaiting a response.
  Siddharth
  
  On Sat, Feb 21, 2009 at 2:43 AM, Isabel Drost
 isa...@apache.org wrote:
  On Friday 20 February 2009, Siddharth Prakash
 Singh wrote:
  I wish to contribute to mahout as google soc
 participant this year. I
  am interested in implementing a Map/Reduce
 enabled machine learning
  algo.
  Any suggestions please?
  
  Welcome Siddharth. Is there anything machine
 learning specific that interests
  you in particular?
  
  You can also have a look in the Mahout Wiki as
 well as the jira to find out
  more on which algorithms are already available and
 which are still missing.
  
  Isabel
  
  
  --
  One father is more than a hundred schoolmasters.  
  -- George Herbert
   |\  _,,,---,,_   Web:  
 http://www.isabel-drost.de
   /,`.-'`'-.  ;-;;,_
   |,4-  ) )-,_..;\ (  `'-'
  '---''(_/--'  `-'\_) (fL) 
 IM:  xmpp://main...@spaceboyz.net
  
  
  
  
  --Siddharth Prakash Singh
  http://www.spsneo.com
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem
 (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
 http://www.lucidimagination.com/search





GSoC 2009 proposition

2009-02-26 Thread deneche abdelhakim

Hi,
Im planning to participate, again, at GSoC and I want to do it, again, with 
Mahout.
This year, lets make Mahout run over Amazon EC2. This means building the proper 
AMIs, run some Mahout projects (the GA examples) over EC2, give feedback and 
write simple, clear How-Tos about running a Mahout project on EC2.

The Mahout.GA examples (TSP and CDGA) should be good real-world scenarios about 
how one may need to use Mahout.GA on EC2. The TSP example should be modified to 
be able to run on a console and to load TSPLIB benchmarks, thus we can tackle 
more challenging TSP problems with the help of EC2. The CDGA example should run 
unmodified given, of course, that Hadoop is configured correctly on EC2 and the 
the benchmark is on HDFS.

This two examples will give us three use cases about Mahout on EC2:

1. TSP can be run on a single, High-CPU, EC2 instance. In this case, 
Watchmaker's ConcurrentEvolutionEngine should take care of the multi-threading 
part (or at least I hope!) and there will be no need for Hadoop;

2. TSP can also be run over multiple EC2 instances with the help of Hadoop;

3. CDGA not only needs Hadoop to run, but its data should be on HDFS.


So what do you think, is the elephant ready for a walk on EC2 ?


 


Re: GSoC 2009 proposition

2009-02-26 Thread deneche abdelhakim

Thanks for your fast answers :) I'll rethink this and post as soon as I get 
something


--- En date de : Jeu 26.2.09, Grant Ingersoll gsing...@apache.org a écrit :

 De: Grant Ingersoll gsing...@apache.org
 Objet: Re: GSoC 2009 proposition
 À: mahout-dev@lucene.apache.org
 Date: Jeudi 26 Février 2009, 16h20
 You might have a look at
 http://www.lucidimagination.com/search/document/5ab9ddafa19ee04b/thought_offering_ec2_s3_based_services#2d096f39b02ec289
 for some background thoughts.
 
 I think it's a nice idea and I've been meaning to
 use my Amazon credits for just such a thing for a while now,
 but not sure how high priority it is.
 
 You might consider extending/altering this thought to have
 more of a focus on developing demos (including code) of
 Mahout with real data sets on larger scale systems.  Part of
 this might involve showing people how to do this on EC2, but
 the bigger focus to me should be on demoing/documenting
 Mahout's capabilities, versus showing how to run Mahout
 on any particular platform.
 
 
 On Feb 26, 2009, at 9:58 AM, deneche abdelhakim wrote:
 
  
  Hi,
  Im planning to participate, again, at GSoC and I want
 to do it, again, with Mahout.
  This year, lets make Mahout run over Amazon EC2. This
 means building the proper AMIs, run some Mahout projects
 (the GA examples) over EC2, give feedback and write simple,
 clear How-Tos about running a Mahout project on EC2.
  
  The Mahout.GA examples (TSP and CDGA) should be good
 real-world scenarios about how one may need to use Mahout.GA
 on EC2. The TSP example should be modified to be able to run
 on a console and to load TSPLIB benchmarks, thus we can
 tackle more challenging TSP problems with the help of EC2.
 The CDGA example should run unmodified given, of course,
 that Hadoop is configured correctly on EC2 and the the
 benchmark is on HDFS.
  
  This two examples will give us three use cases about
 Mahout on EC2:
  
  1. TSP can be run on a single, High-CPU, EC2 instance.
 In this case, Watchmaker's ConcurrentEvolutionEngine
 should take care of the multi-threading part (or at least I
 hope!) and there will be no need for Hadoop;
  
  2. TSP can also be run over multiple EC2 instances
 with the help of Hadoop;
  
  3. CDGA not only needs Hadoop to run, but its data
 should be on HDFS.
  
  
  So what do you think, is the elephant
 ready for a walk on EC2 ?
  
  
  
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem
 (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
 http://www.lucidimagination.com/search





Re: Towards 0.1

2009-01-30 Thread deneche abdelhakim
About MAHOUT-102 (https://issues.apache.org/jira/browse/MAHOUT-102), the patch 
is already available, is someone could just commit it.

Also, I'm not able to make my patchs delete files (or directories) when 
applied, is it because I'm not a commiter or because I'm using TortoiseSVN ?

--- En date de : Jeu 29.1.09, Grant Ingersoll gsing...@apache.org a écrit :

 De: Grant Ingersoll gsing...@apache.org
 Objet: Re: Towards 0.1
 À: mahout-dev@lucene.apache.org
 Date: Jeudi 29 Janvier 2009, 2h41
 Feel free to move.  I put in a Maven one, but it sounds like
 it is already fixed.  Even building the candidate tonight,
 spawns at least 3 days for voting to take place.
 
 I definitely agree on getting 0.1 out.
 
 
 On Jan 28, 2009, at 1:16 PM, Sean Owen wrote:
 
  How about moving them all to 0.2 right now? There was
 essentially a
  consensus for this last week. I worry that we have
 been stuck in the
  current state for a while and don't want to
 continue indefinitely.
  0.1, as the value implies, can be far from perfect.
 There is a
  negative consequence right now to having some
 publicity but no
  downloadable release at all. I'd prefer to get
 that release built
  tonight (from a kind person with gpg, pass it around
 for a last look
  that everything is in there properly, and post it.
  
  On Wed, Jan 28, 2009 at 2:39 PM, Grant Ingersoll
 gsing...@apache.org wrote:
  Yep, I'm looking at trying out the stuff.  I
 think we need to go through the
  unresolved issues for 0.1 and either move them to
 0.2 or close them.





Re: Re : @Override annotations

2009-01-22 Thread deneche abdelhakim
When you say 1.5, you mean the 1.5 JDK (or JRE in the case of Eclipse) ?

Because, I just tried to compile the Mahout trunk in Eclipse using JRE 1.5.0_11 
(and 5.0 compliance level of course), and got 628 errors in core/src (I didn't 
check core/test yet). 

The first 100 errors are as follows:
. 98 errors related to the @Override, for example:

The method accept(File) of type BayesFileFormatter.FileProcessor must override 
a superclass method  mahout-core/src/org/apache/mahout/classifier
BayesFileFormatter.java line 160

. 2 errors related to Deque cannot be resolved in
  . org.apache.mahout.classifier.bayes.BayesClassifier, and
  . org.apache.mahout.classifier.cbayes.CBayesClassifier

May be I'm wrong, but Deque is available only in 1.6, no ?

--- En date de : Jeu 22.1.09, Ted Dunning ted.dunn...@gmail.com a écrit :

 De: Ted Dunning ted.dunn...@gmail.com
 Objet: Re: Re : @Override annotations
 À: mahout-dev@lucene.apache.org
 Date: Jeudi 22 Janvier 2009, 10h05
 I think mahout should compile with both 1.5 and 1.6.
 
 On Wed, Jan 21, 2009 at 11:23 PM, deneche abdelhakim
 a_dene...@yahoo.frwrote:
 
  Last time I tried to compile the Mahout trunk, I got a
 similar problem. In
  my case, I'm using Eclipse and the errors were
 caused by the JDK Compliance
  Level (in the project properties). In short, I was
 using JVM 1.6 JRE but
  with 5.0 compliance level (forgot to change it !).
 
  I found the answer in the following link:
 
 
 http://dev.eclipse.org/newslists/news.eclipse.newcomer/msg19329.html
 
 
  --- En date de : Jeu 22.1.09, Jeff Eastman
 j...@windwardsolutions.com a
  écrit :
 
   De: Jeff Eastman
 j...@windwardsolutions.com
   Objet: @Override annotations
   À: mahout-dev@lucene.apache.org
   Date: Jeudi 22 Janvier 2009, 6h07
   I'm trying to compile the latest Mahout trunk
 on my
   MacBook using the JVM 1.6.0 JRE and the @Override
   annotations are causing a lot of errors. There
 must be a
   simple solution to this problem but I cannot
 recall it. Can
   somebody help?
  
   Jeff
 
 
 
 
 
 
 -- 
 Ted Dunning, CTO
 DeepDyve
 4600 Bohannon Drive, Suite 220
 Menlo Park, CA 94025
 www.deepdyve.com
 650-324-0110, ext. 738
 858-414-0013 (m)





Re : @Override annotations

2009-01-21 Thread deneche abdelhakim
Last time I tried to compile the Mahout trunk, I got a similar problem. In my 
case, I'm using Eclipse and the errors were caused by the JDK Compliance Level 
(in the project properties). In short, I was using JVM 1.6 JRE but with 5.0 
compliance level (forgot to change it !).

I found the answer in the following link:

http://dev.eclipse.org/newslists/news.eclipse.newcomer/msg19329.html


--- En date de : Jeu 22.1.09, Jeff Eastman j...@windwardsolutions.com a 
écrit :

 De: Jeff Eastman j...@windwardsolutions.com
 Objet: @Override annotations
 À: mahout-dev@lucene.apache.org
 Date: Jeudi 22 Janvier 2009, 6h07
 I'm trying to compile the latest Mahout trunk on my
 MacBook using the JVM 1.6.0 JRE and the @Override
 annotations are causing a lot of errors. There must be a
 simple solution to this problem but I cannot recall it. Can
 somebody help?
 
 Jeff





Re : More proposed changes across code

2008-10-20 Thread deneche abdelhakim
 5. BruteForceTravellingSalesman says copyright Daniel
 Dwyer -- can
 this be replaced by the standard copyright header?

Oups, I tought I changed them all ! Yes you can replace it.

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail


Re: More proposed changes across code

2008-10-20 Thread deneche abdelhakim



--- En date de : Dim 19.10.08, Grant Ingersoll [EMAIL PROTECTED] a écrit :

 De: Grant Ingersoll [EMAIL PROTECTED]
 Objet: Re: More proposed changes across code
 À: mahout-dev@lucene.apache.org
 Date: Dimanche 19 Octobre 2008, 18h30
 On Oct 19, 2008, at 11:16 AM, Sean Owen wrote:
 
  On Sun, Oct 19, 2008 at 4:07 PM, Grant Ingersoll  
  [EMAIL PROTECTED] wrote:
  Doesn't the javadoc tool used @inherit to fill
 in the inherited  
  docs when
  viewing?
 
  Yes... I suppose I find that redundant. The subclass
 method gets
  documented exactly as the superclass does. It looks
 like the subclass
  had been explicitly documented, when it hadn't
 been. I think its
  intent is to copy in documentation and add to it; I am
 thinking only
  of cases where the javadoc only has a single element,
 [EMAIL PROTECTED]
 
 
  3. UpdatableFloat/Long -- just use Float[1] /
 Long[1]? these classes
  don't seem to be used.
 
  Hmmm, they were used, but sure that works too.
 
  I can't find any usages of these classes, where
 are they?
 
 Right, they aren't used any longer.  Feel free to
 remove.
 
 
 
 
  5. BruteForceTravellingSalesman says
 copyright Daniel Dwyer -- can
  this be replaced by the standard copyright
 header?
 
  No, this is in fact his code, licensed under the
 ASL.  I believe  
  the current
  way we are handling it is correct.  The original
 code is his, and  
  the mods
  are ours.
 
  Roger that, will leave it. But two notes then...
  - what about all the other code that game from
 watchmaker? all the
  classes in the package say they came from watchmaker
  - I was told that for my stuff, yeah, I still own the
 code/copyright
  but am licensing a copy to this project, and so it all
 just gets
  licensed within Mahout according to the boilerplate
 which says
  Licensed to the ASF...
 
  I'm not a lawyer and don't want to pick nits
 but I do want to take
  extra care to get licensing right.
 
 Right.  I believe the difference is you donated your code
 to the ASF,  
 Daniel has merely published his code under the ASL, but has
 not  
 donated to the ASF.  It's a subtle distinction, I
 suppose.Any of  
 the classes that came from watchmaker should say that,
 although I know  
 many were developed by Deneche for the Watchmaker API.  We
 can go  
 review them again.

In the case of the travellingSalesman example, I modified the original code to 
use Mahout when needed. My own modifications are a couple of lines in two or 
three classes, I included a readme.txt that describes the modified code and 
links to the original one. I replaced all the copyright headers with the 
standard one (I forgot BruteForceTravellingSalesman.java), and added a link to 
the original code in the class comments.
I've been reading the Apache License 2.0, I'm not a lawyer and if I'm not 
mistaken, the travellingSalesman code included with Mahout is a Derivative 
Work of the original code, so we need to :
. Point in the modified files that they have been changed, this files are: 
StrategyPanel.java, TravellingSalesman.java and 
EvolutionaryTravellingSalesman.java.
. because the Watchmaker library contains a NOTICE.TXT file, Mahout must 
include a readable copy of the attribution notices contained within 
Watchmaker's NOTICE file.

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail


Re: Mahout on EC2

2008-09-22 Thread deneche abdelhakim
Ok, in this case its the main program that has a Swing GUI, the Map-Reduce Jobs 
have no GUIs at all. But yeah, it's always good to separate the GUI code from 
the logic.


--- En date de : Dim 21.9.08, Ted Dunning [EMAIL PROTECTED] a écrit :

 De: Ted Dunning [EMAIL PROTECTED]
 Objet: Re: Mahout on EC2
 À: mahout-dev@lucene.apache.org
 Date: Dimanche 21 Septembre 2008, 23h08
 For the master machine that launches the map-reduce
 computation, you can
 tunnel an X display from somewhere else to display swing
 applications.
 
 You will also need to do the separation for the reason that
 Sean says... you
 will be running on many machines.
 
 On Sat, Sep 20, 2008 at 2:34 AM, Sean Owen
 [EMAIL PROTECTED] wrote:
 
  I think you can run a program that uses Swing - unless
 I am wrong this
  no longer result in an error when running on a
 'headless' machine -
  for example a box without X11.
 
  But no I don't think there is anyway to interact
 with it, especially
  considering you might be running on many machines at
 once.
 
  But the same is true of the console - you won't be
 able to interact
  with the program that way either.
 
  It does sound good, in any event, to separate out
 Swing client code
  from the core logic.
 
  On 9/20/08, deneche abdelhakim
 [EMAIL PROTECTED] wrote:
   Sounds cool :)
  
   I'll do the TSP part, but it may take some
 time because I'm a bit busy
   (PhD's administrative stuff).
  
   There are many available large TSP benchmarks,
 and it seems that there is
  a
   common file format for them TSPLIB
   (
 
 http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS
  ).
   So the TSP example should be modified to load
 those benchmark files.
  
   I have a question about EC2 : can you run Java
 Swing programs and see the
   GUI because the TSP example has a Swing GUI, or
 should we should make a
   console version of the example ?
  
   --- En date de : Ven 19.9.08, Grant Ingersoll
 [EMAIL PROTECTED] a
   écrit :
  
   De: Grant Ingersoll
 [EMAIL PROTECTED]
   Objet: Mahout on EC2
   À: mahout-dev@lucene.apache.org
   Date: Vendredi 19 Septembre 2008, 17h18
   Amazon has generously donated some credits,
 so I plan on
   putting
   Mahout up and doing some testing.  Was
 wondering if people
   had
   suggestions on things they would like to see
 from Mahout.
   For
   starters, I'm going to put up a public
 image containing
   0.1 when it's
   ready, but I'd also like to wiki up some
 examples.
   I.e. go here, get
   this data, put it in this format and then do
 X.  We have
   some simple
   examples, but I think it would be cool to
 show how to do
   something a
   bit more complex, like maybe classify web
 pages according
   to DMOZ or
   to cluster on stuff, or maybe put in a large
 traveling
   salesman
   problem using the GA stuff Deneche did.
  
   Thoughts?  Anyone else interested in setting
 up some use
   cases?
  
   -Grant
  
  
  
  
 
 
 
 
 -- 
 ted





Re: Hardcoded paths in examples

2008-09-22 Thread deneche abdelhakim
 From that perspective, I guess I think it's suboptimal
 to depend on Hadoop Path objects here in the unit test, since the tests
 are not actually using Hadoop. That ought to be separated.

But even if the test code is not using Hadoop, it's still calling Hadoop code : 
Mappers, Reducers and all the happy family :)

 Then you have test code depending on external scripts -- in
 two places. Which would lead me to the conclusion that it's
 best, overall, if these tests are self-contained and cause their 
 dependent data to be generated. I am not familiar with this code. Is 
 that easy? infeasible?

It's feasible...but not easy :(

--- En date de : Lun 22.9.08, Sean Owen [EMAIL PROTECTED] a écrit :

 De: Sean Owen [EMAIL PROTECTED]
 Objet: Re: Hardcoded paths in examples
 À: mahout-dev@lucene.apache.org
 Date: Lundi 22 Septembre 2008, 11h47
 I don't think that is necessary. I think it is fair to
 assume that one
 is running the tests from within the distribution directory
 and not
 have to resort to that abstraction.
 
 From that perspective, I guess I think it's suboptimal
 to depend on
 Hadoop Path objects here in the unit test, since the tests
 are not
 actually using Hadoop. That ought to be separated.
 
 But that aside, that still leaves the issue of whether one
 can depend
 on some build products existing in a test. I don't
 think it's a bad
 thing, as long as the Ant script ensures those build
 products exist.
 Then the question is, can you express the same dependency
 in Maven? I
 think you can?
 
 Then you have test code depending on external scripts -- in
 two
 places. Which would lead me to the conclusion that it's
 best, overall,
 if these tests are self-contained and cause their dependent
 data to be
 generated. I am not familiar with this code. Is that easy?
 infeasible?
 
 Sean
 
 On Mon, Sep 22, 2008 at 9:59 AM, Karl Wettin
 [EMAIL PROTECTED] wrote:
  Hmm, if this is test/resources, shouldn't they be
 accessed using
  getResourceAsStream instead? I'll see what I can
 do.
 
  22 sep 2008 kl. 10.15 skrev Sean Owen:
 
  Oh OK. Well +1 to using the same path, yes. If it
 is easier to adapt
  to Maven's location, OK.
 
  On 9/22/08, deneche abdelhakim
 [EMAIL PROTECTED] wrote:
 
  Dumb question: why does example code
 depend on test code?
  Can this be solved by severing that
 dependency?
 
  It's not from the example code but from
 the example's test code. In this
  case the example's tries to access a
 directory (wdbc) put into
  test/resources. The content of test/resources
 is automatically copied by
  ant
  into build/test-classes/
 
 
  That means that the maven test builds will
 fail unless
  the ant test was first executed. I suppose
 that's OK, but
  I'd prefere if we could come up with
 some fix. I suppose the simplest
  one
  would be to use the maven file paths
 (target/test-classes).
 
  But in this case, the ant test builds will
 probably fail unless the maven
  test was first executed ! Why not use the same
 file path for both ant and
  maven, or at least copy the content of the
 ressources in a common
  directory...
 
  --- En date de : Dim 21.9.08, Sean Owen
 [EMAIL PROTECTED] a écrit :
 
  De: Sean Owen [EMAIL PROTECTED]
  Objet: Re: Hardcoded paths in examples
  À: mahout-dev@lucene.apache.org
  Date: Dimanche 21 Septembre 2008, 18h07
  Dumb question: why does example code
 depend on test code?
  Can this be
  solved by severing that dependency?
 
  On 9/21/08, Karl Wettin
 [EMAIL PROTECTED]
  wrote:
 
  There are a bunch of hardcoded paths
 in the tests of
 
  the examples
 
  module. Stuff like this:
 
 Path inpath = new
 
  Path(build/test-classes/wdbc);
 
  That means that the maven test builds
 will fail unless
 
  the ant test
 
  was first executed. I suppose
 that's OK, but
 
  I'd prefere if we could
 
  come up with some fix. I suppose the
 simplest one
 
  would be to use the
 
  maven file paths
 (target/test-classes).
 
 
  karl
 
 
 
 
 
 
 





Re : Mahout on EC2

2008-09-20 Thread deneche abdelhakim
Sounds cool :)

I'll do the TSP part, but it may take some time because I'm a bit busy (PhD's 
administrative stuff).

There are many available large TSP benchmarks, and it seems that there is a 
common file format for them TSPLIB 
(http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS).
 So the TSP example should be modified to load those benchmark files.

I have a question about EC2 : can you run Java Swing programs and see the GUI 
because the TSP example has a Swing GUI, or should we should make a console 
version of the example ?

--- En date de : Ven 19.9.08, Grant Ingersoll [EMAIL PROTECTED] a écrit :

 De: Grant Ingersoll [EMAIL PROTECTED]
 Objet: Mahout on EC2
 À: mahout-dev@lucene.apache.org
 Date: Vendredi 19 Septembre 2008, 17h18
 Amazon has generously donated some credits, so I plan on
 putting  
 Mahout up and doing some testing.  Was wondering if people
 had  
 suggestions on things they would like to see from Mahout. 
 For  
 starters, I'm going to put up a public image containing
 0.1 when it's  
 ready, but I'd also like to wiki up some examples. 
 I.e. go here, get  
 this data, put it in this format and then do X.  We have
 some simple  
 examples, but I think it would be cool to show how to do
 something a  
 bit more complex, like maybe classify web pages according
 to DMOZ or  
 to cluster on stuff, or maybe put in a large traveling
 salesman  
 problem using the GA stuff Deneche did.
 
 Thoughts?  Anyone else interested in setting up some use
 cases?
 
 -Grant





Re : FYI Cloud Computing Resources

2008-09-03 Thread deneche abdelhakim
I came across the following competition

http://www.netflixprize.com/index


It's about recommender systems, so I think it's a Taste stuff. The training 
dataset consists of more than 100M ratings.


- Message d'origine 
De : Josh Myer [EMAIL PROTECTED]
À : mahout-dev@lucene.apache.org
Envoyé le : Mercredi, 30 Juillet 2008, 18h19mn 25s
Objet : Re: FYI Cloud Computing Resources

On Wed, Jul 30, 2008 at 11:26:29AM -0400, Grant Ingersoll wrote:
 http://research.yahoo.com/node/2328
 
 It _MAY_ (stressed, emphasized, etc.) be possible for Mahouters (or  
 are we just Mahouts?) to get some access to these resources.  One big  
 question is where can we get some fairly large data sets (large, but  
 not super large, I think, but am not sure)
 
 If you have ideas, etc. please let us know.
 

It's worth plugging (theinfo), http://theinfo.org/.  It's a project to
collect references to datasets, and may help here.  Unfortunately, it
seems to be laggy at the moment.  I'll poke Aaron about that =)

HtH,
-- 
Josh Myer
[EMAIL PROTECTED]






Re : Going to move us to Hadoop 0.18.0, Java 6

2008-08-31 Thread deneche abdelhakim
Go on, I will do my part, I just hope GA likes Java 6 :P



- Message d'origine 
De : Sean Owen [EMAIL PROTECTED]
À : mahout-dev@lucene.apache.org
Envoyé le : Samedi, 30 Août 2008, 21h26mn 45s
Objet : Re: Going to move us to Hadoop 0.18.0, Java 6

So I should hold off on committing changes that use Java 6? let me
know when you're ready, or if it's going to be difficult to move to 6
for you.

i also wasn't totally clear whether the folks doing the 0.1 release
want to stay on Java 5 for that or not.

On Tue, Aug 26, 2008 at 3:56 AM, Xiance SI(司宪策) [EMAIL PROTECTED] wrote:
 I have to get Leopard first, now using Tiger, the newest possible Java is
 5.0.



  
_ 
Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr


Re : the Job jar file doesn't contain the core jar in it.

2008-08-17 Thread deneche abdelhakim
You should run the job task in the examples directory (ant job), it will 
generate a file (in examples/build) called 
apache-mahout-examples-0.1-dev.job, this is the jar (even if it ends with 
.job) that contains both the examples and the core.


- Message d'origine 
De : Robin Anil [EMAIL PROTECTED]
À : mahout-dev@lucene.apache.org
Envoyé le : Dimanche, 17 Août 2008, 19h55mn 58s
Objet : the Job jar file doesn't contain the core jar in it.

Any idea how the examples should  be run?

Robin



  
_ 
Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr


Mahout.GA, what comes next ?

2008-07-08 Thread deneche abdelhakim
Now that the Class Discovery (CD) example is up and running, it's time to think 
about what to do next. I already have some ideas, but I want to check with the 
community first.

I see two possible ways ahead of me:

A.Enhance the (CD) example
 a1. handle categorical attributes
 a2. generate dataset infos (attributes type and range), possibly using a small 
map-reduce program
 a3. multi-class classification, instead of binary classification

B.Investigate other distributed models, for example the insular model.

Any other suggestion is appreciated


  
_ 
Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr

Re: Problems running the examples

2008-07-01 Thread deneche abdelhakim
The test run fine, its the examples that didn't run correctly, but I found a 
way to run them, by playing with the HADOOP_HEAPSIZE option in 
conf/hadoop-env.sh, it defaults to 1000 MB, I just set it to 128 and now its 
ok...

by the way the Taste examples are missing a dependency (ejb.jar), is there a 
good reason not to include it (License issues perhaps) ?

--- En date de : Mar 1.7.08, Jeff Eastman [EMAIL PROTECTED] a écrit :
De: Jeff Eastman [EMAIL PROTECTED]
Objet: Re: Problems running the examples
À: mahout-dev@lucene.apache.org
Date: Mardi 1 Juillet 2008, 17h32

I had to use an -Xmx 256m to get the tests to run without heap problems.
Jeff


deneche abdelhakim wrote:
 I've been using Eclipse for all my testing and all just works fine.
But now I want to build and test the examples using ant. I managed to modify
the build.xml to generate the examples job. But when I run one of the examples
(for example : ...clustering.syntheticcontrol.canopy.Job) I get the following
errors:

 $ ~/hadoop-0.17.0/bin/hadoop jar apache-mahout-0.1-dev-ex.jar
org.apache.mahout
   .clustering.syntheticcontrol.canopy.Job
 Error occurred during initialization of VM
 Could not reserve enough space for object heap
 Could not create the Java virtual machine.

 any hints on how to solve this ???




  
_ 
 Envoyez avec Yahoo! Mail. Une boite mail plus intelligente
http://mail.yahoo.fr



  
_ 
Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr

Re : getting started with mahout, failing tests

2008-06-21 Thread deneche abdelhakim
I just did a fresh checkout and all the tests are successfull !!!


--- En date de : Sam 21.6.08, Allen Day [EMAIL PROTECTED] a écrit :

 De: Allen Day [EMAIL PROTECTED]
 Objet: getting started with mahout, failing tests
 À: mahout-dev@lucene.apache.org
 Date: Samedi 21 Juin 2008, 8h00
 Hi,
 
 I finally had a chance to get mahout checked out and built
 today.  I
 want to get up to speed so I can start using/contributing.
 
 I can get the compile target to build
 successfully, but I'm getting
 errors from the test target.
 
 [junit] Test
 org.apache.mahout.clustering.canopy.TestCanopyCreation
 FAILED
 [junit] Test org.apache.mahout.matrix.TestSparseMatrix
 FAILED
 [junit] Test org.apache.mahout.matrix.TestSparseVector
 FAILED
 
 Is this normal for now?
 
 -Allen


  
_ 
Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr


Re: GSOC Mahout.GA, next steps ?

2008-06-09 Thread deneche abdelhakim
I found a cool introduction to evolutionary algorithms, I added it to the wiki 
if someone is interested...


--- En date de : Mer 28.5.08, Grant Ingersoll [EMAIL PROTECTED] a écrit :

 De: Grant Ingersoll [EMAIL PROTECTED]
 Objet: Re: GSOC Mahout.GA, next steps ?
 À: mahout-dev@lucene.apache.org
 Date: Mercredi 28 Mai 2008, 13h11
 This sounds good.  I don't know a lot about GAs, so if
 others have  
 insight, that would be great.  It would also be handy if
 you could put  
 up a section on the Wiki about GAs and maybe post some
 links to basic  
 papers there, so people that aren't familiar can go do
 some background  
 reading.
 
 I will try to get to MAHOUT-56 this week, but others can
 jump in and  
 review as well.
 
 -Grant
 
 On May 27, 2008, at 4:52 AM, deneche abdelhakim wrote:
 
  In a GA there are many things that can be distributed,
 and one  
  should always start with the most compute demanding
 task . This is  
  very problem dependent, but in most cases the fitness
 evaluation  
  function (FEF) is the part to distribute.
 
  The FEF evaluates each single individual in the
 population, and it  
  may need some datas (D) to do so. For example in the
 traveling  
  Salesman Problem, the problem is defined by a set of
 cities and the  
  distances between them, the FEF needs those distances
 to evaluate  
  the individuals.
 
  I see 2 ways to distribute the FEF:
 
  A. if the datas D is not big and can fit in each
 single cluster  
  node, then the easiest solution is to use each Mapper
 to evaluate  
  one individual and to pass the Datas D to all the
 mappers (using  
  some Job parameter or the DistributedCache). The input
 of the job is  
  the population of individuals. For someone used to
 work with  
  Watchmaker, the solution A is straightforward, he
 needs to change  
  one line of code.
 
  B. if the datas D are really big and span over
 multiple nodes, then  
  the FEF should be writen in the form of
 Mappers-Reducers, the  
  population of individuals is passed to all the mappers
 (again using  
  the DistributedCache or a Job parameter) and the datas
 D are now the  
  input of the Job.
 
  [MAHOUT-56] contains a possible implementation for
 solution A. Now I  
  should start thinking about solution B and all I need
 is a problem  
  that uses very big datasets. I already proposed one in
 my GSoC  
  proposal, it consists of using a Genetic Algorithm to
 find good  
  binary classification rule for a given dataset. But I
 am open to any  
  other suggestion.
 
  __
  Do You Yahoo!?
  En finir avec le spam? Yahoo! Mail vous offre la
 meilleure  
  protection possible contre les messages non
 sollicités
  http://mail.yahoo.fr Yahoo! Mail


  
_ 
Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr


OutOfMemory Exception !

2008-06-02 Thread deneche abdelhakim
I checked the last version of Mahout (rev. 662372) and got the following 
exception with many tests (the list of this tests is at the end of this post):

java.io.IOException: Job failed! 

the following message is printed in System.err :

java.lang.OutOfMemoryError: Java heap space

I think its somehow caused by using Hadoop 0.17.0 as my own tests run perfectly 
with Hadoop 0.16.4

here are the tests that don't pass

org.apache.mahout.clustering.canopy.TestCanopyCreation:
 testCanopyGenManhattanMR
 testCanopyGenEuclideanMR
 testClusteringManhattanMR
 testClusteringEuclideanMR
 testClusteringManhattanMRWithPayload
 testClusteringEuclideanMRWithPayload
 testUserDefinedDistanceMeasure

org.apache.mahout.clustering.meanshift.TestMeanShift
 testCanopyEuclideanMRJob

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail


Re: Gene Expression Programming in Mahout

2008-06-02 Thread deneche abdelhakim
I am working on using Hadoop to distribute the fitness evaluation of 
(hopefully) any problem written using the Watchmaker framework 
[https://watchmaker.dev.java.net/]. I already provided a patch with some code 
[http://issues.apache.org/jira/browse/MAHOUT-56] that let you distribute the 
evaluation of the population over the cluster (each node will evaluate a subset 
of the population).

thank you for the links, I will take a look at some papers, but in the mean 
time could you tell me please : wich part of the GEP algorithm needs to be 
distributed (I'm guessing it's the fitness evaluation part) ?

--- En date de : Lun 2.6.08, juber patel [EMAIL PROTECTED] a écrit :

 De: juber patel [EMAIL PROTECTED]
 Objet: Re: Gene Expression Programming in Mahout
 À: mahout-dev@lucene.apache.org, [EMAIL PROTECTED]
 Date: Lundi 2 Juin 2008, 19h34
 yes, GEP is related to GA and I feel it provides a more
 generic way of
 defining populations, fitness functions etc. with the
 possibility of a
 wide range of grammars for the encoding of the Individual.
 This
 flexibility can be hugely effective when we can use the
 computing
 power of clusters.
 
 here is some biblio:
 
 http://www.gene-expression-programming.com/GEPBiblio.asp
 
 
 Deneche,
 
 could you just give me an idea about your work so far?
 
 juber
 
 
 On Mon, Jun 2, 2008 at 11:48 AM, Isabel Drost
 [EMAIL PROTECTED] wrote:
  On Sunday 01 June 2008, juber patel wrote:
  I have been lurking on this list for some time
 now. I would really
  like to contribute to Mahout. As I had discussed
 earlier, I would like
  to include my code, Amiba
 (http://amiba.sourceforge.net/) in Mahout. I
  feel this is the right place for that code.
 
  Sounds great!
 
 
  It implements Gene Expression Programming but it
 is sequential. I
  would like to adapt it for Hadoop and for that I
 am reading up on
  Hadoop.
 
  If you have any questions, feel free to ask us or post
 your questions to the
  Hadoop mailinglists.
 
 
  Could you tell me again if this fits well with
 Mahout. And if you
  don't mind including it in Mahout.
 
  Sure. You might want to coordinate with Deneche
 Abdelhakim who is working in
  GA for GSoC - as I understand, Gene Expression
 Programming is related to GA?
 
 
  Isabel
 
 
  --
  #if _FP_W_TYPE_SIZE  32#error Here's a
 nickel kid.  Go buy yourself a real
  computer.#endif--
 linux/arch/sparc64/double.h
   |\  _,,,---,,_   Web:  
 http://www.isabel-drost.de
   /,`.-'`'-.  ;-;;,_
   |,4-  ) )-,_..;\ (  `'-'
  '---''(_/--'  `-'\_) (fL)  IM:
  xmpp://[EMAIL PROTECTED]
 
 
 
 
 -- 
 Juber Patel http://juberpatel.googlepages.com

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail


Re: GSOC Mahout.GA, next steps ?

2008-05-28 Thread deneche abdelhakim
 Ted Dunning [EMAIL PROTECTED] wrote:

 Conceptually, at least, it would be good to have the option for fitness
 functions to be expressed as map-reduce programs.  Unfortunately, having
 mappers spawn MR programs runs the real risk of dead-lock, especially on
 less than grandiose clusters.
 
 To me, that indicates that if the fitness function is nasty enough to
 require map-reduce to compute, then either:
 
 a) the executive that manages the population and generates mutations 
 should be written in sequential form
 
 or
 
 b) the evolutionary algorithm has to be written in such a way as to be 
 able to manipulate a map-reduce program so that evolution and evaluation  
 can be merged into a single (composite) map-reduce program.
 
 I vote for (a) because if fitness computations are so complex as to need  
 MR, then the cost of sorting the population will be negligible.


(a) has another advantage too: one can start by writing its program in a 
sequential form, test it with a small dataset, then rewrite only the fitness 
function in a M-R form.


 This raises the question of how the population should be communicated to  
 the parallel evaluator.


I don't know if there are many ways to do it in Hadoop, but how about writing 
the population into a file and pass it with the DistributedCache ?

-- 
abdelhakim

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail


Re: Thoughts on timeline for first release?

2008-05-21 Thread deneche abdelhakim
UCI : http://archive.ics.uci.edu/ml/


--- En date de : Mer 21.5.08, Jeff Eastman [EMAIL PROTECTED] a écrit :

 De: Jeff Eastman [EMAIL PROTECTED]
 Objet: Re: Thoughts on timeline for first release?
 À: mahout-dev@lucene.apache.org
 Date: Mercredi 21 Mai 2008, 17h10
 Does anybody have some links to datasets we can use for
 clustering 
 examples? I'm thinking we could publish an EC2 AMI that
 includes Hadoop 
 and Mahout, along with a script to deploy it on a cluster,
 upload the 
 examples and run clustering on it. Is that too ambitious?
 I'm kinda 
 hoping that we can use 0.17 which advertises simpler EC2
 deployment than 
 0.16. If that won't meet our schedule then maybe I
 should work through 
 the 0.16 deployment.
 
 Jeff
 
 Grant Ingersoll wrote:
  I was thinking we should get the Taste stuff in (seems
 to be pretty 
  close to done) and I would like to get Mahout-9 (Naive
 Bayes) in.  
  This would give us a pretty nice release, I think. 
 Namely, a couple 
  of clustering implementations, a classifier, and, of
 course, Taste.  I 
  think I can finish up my part in the next week or so. 
 Then, we will 
  need to start to figure out all the fun of releases
 (signatures, 
  notices.txt, etc.)  I'd also like to see us have
 an easy to use demo 
  of the clustering stuff, but it is all right if we
 don't.
 
  -Grant
 
  On May 21, 2008, at 1:23 AM, Sean Owen wrote:
 
  Just curious, what are people thinking about the
 timeline for a first,
  very early release, like an 0.1 release? any open
 tasks that I could
  pick up to help?
 
  Without rushing anything, I'm keen to retire
 my current project site
  and forward everybody that's interested to
 Mahout. As long as there's
  a .jar distro someone can pick up and use,
 that's cool.
 
  Sean
 
 
 

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail


About Contributing

2008-05-18 Thread deneche abdelhakim
as part of my gsoc project I started adapting one of Watchmaker examples (TSP) 
to use with Mahout. I believe that the next step is to start a Jira issue and 
post an svn patch, isn't it ?

I also did a fresh checkout of Mahout and run ant test in the core
directory and got a wonderful Tests failed even before I added my own
code :(  The test that fails is the following:

org.apache.mahout.cf.taste.impl.LoadTest.testItemLoad()

It seems that it takes more than 120 sec (the allowedTimeSec specified in the 
test) to load

I wonder, before I start hitting the keyboard with my head, if it is just 
normal that this test don't pass !!!

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail


RE : Google Summer of Code

2008-04-21 Thread deneche abdelhakim
Hi Robin, 

I am very happy that I've been accepted, thanks to the Mahout Community that 
kindly commented on my draft.

So we are four students, that's cool. I wish us good work and great fun in this 
summer.

Hakim


Robin Anil [EMAIL PROTECTED] a écrit : Hi Everyone,
  This is one of those days where I wake up and see that I
have got accepted to GSoc with Mahout (:32-all-out:) . I am really excited
to kick start the work. I know I have a lot to understand in terms of coding
practices, the whole workflow/process. And i would like to congratulate and
say hi to my fellow Gsoc'ers Farid, Yun and Abdel,  Hi to my mentor Ian
Holsman and to rest of the community.

I am usually online of google talk: if you use it do add me:
[EMAIL PROTECTED]

Cheers and Good Day
Robin


 __
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail 

RE : About the Mahout.GA Comment

2008-04-12 Thread deneche abdelhakim
The number of running algorithms don't depend on the number of processors, in 
fact this kind of algorithms is used even if there is only one single processor 
because of its good search properties. You can imagine it as a single big GA 
with a distributed population and each individual can have its own set of 
operators.

Abdel Hakim

Ted Dunning wrote :
  
   I think it is a very bad idea to tie the algorithm to the number of
   processors being used in this way.  A program should produce identical
   results on any machine, subject only to PRNG seeding issues.
On 4/11/08 8:52 PM, deneche abdelhakim  wrote:

 And there are other reasons to distribute a GA: for example, you may want to
 run a different version of the algorithm (a different population and perhaps a
 different set of operators) in each computing node, and from time to time some
 individuals will migrate from one node to another...this kind of distribution
 has proven to be more effective cause it searches a larger space.



   
-
 Envoyé avec Yahoo! Mail.
Une boite mail plus intelligente. 

RE : About the Mahout.GA Comment

2008-04-12 Thread deneche abdelhakim
I don't know the exact term, but may be I should have said computing process, 
so each processor (or computing node) can run many computing processes...

Ted Dunning [EMAIL PROTECTED] a écrit : 
How is computing node not a processor?


On 4/12/08 9:26 PM, deneche abdelhakim  wrote:

 The number of running algorithms don't depend on the number of processors, in
 fact this kind of algorithms is used even if there is only one single
 processor because of its good search properties. You can imagine it as a
 single big GA with a distributed population and each individual can have its
 own set of operators.
 
 Abdel Hakim
 
 Ted Dunning wrote :
 
 I think it is a very bad idea to tie the algorithm to the number of
 processors being used in this way.  A program should produce identical
 results on any machine, subject only to PRNG seeding issues.
 On 4/11/08 8:52 PM, deneche abdelhakim  wrote:
 
 And there are other reasons to distribute a GA: for example, you may want to
 run a different version of the algorithm (a different population and perhaps
 a
 different set of operators) in each computing node, and from time to time
 some
 individuals will migrate from one node to another...this kind of distribution
 has proven to be more effective cause it searches a larger space.
 
 
 

 -
  Envoyé avec Yahoo! Mail.
 Une boite mail plus intelligente. 



   
-
 Envoyé avec Yahoo! Mail.
Une boite mail plus intelligente. 

About the Mahout.GA Comment

2008-04-11 Thread deneche abdelhakim
Hi Grant, 

You wrote the following comment on my GSoC proposal:

 Could someone w/ a little more GA knowledge comment on the use of 
 WatchMaker?  What I wonder is if it is possible to distribute some of the 
 watchmaker functionality? 

Do you want to know if there are more other ways to distribute a GA ?

 May not be needed for this proposal, but I am curious as to how much work is 
done in Watchmaker vs. the actual fitness function.

I dont understand...

Abdel Hakim Deneche

   
-
 Envoyé avec Yahoo! Mail.
Une boite mail plus intelligente. 

GSoC Evolutionary Algorithm Proposal

2008-03-27 Thread deneche abdelhakim
I've written my proposal, and because I could no more change it after I submit 
it to GSoc, I first post it here
if someone have some suggestions you are welcome.
I will wait until saturday morning to post it to the GSoC

**
Application for Summer of Code 2008Mahout Project

Deneche Abdel Hakim

Codename Mahout.GA


I. Synopsis

I will add a genetic algorithm (GA) for binary classification on large datasets 
to the Mahout project. To gain time I will use an existing framework for 
genetic algorithms WatchMaker [WatchMaker] with an Apache Software License. I 
will also add a parallelized measure that indicates the quality of 
classification rules on a given dataset, this measure will be available 
independently of the GA. And if I have enough time I will make the GA more 
generic and apply it on a different problem (multiclass classification).


II. Project

A GA works by evolving a population of individuals toward a desired goal. To 
get a satisfying solution, the GA needs to run thousands of iterations with 
hundreds of individuals. For each iteration and individual the fitness is 
calculated, it indicates the closeness of that individual to the desired 
solution. The main advantage of GAs is there ability to find solution of 
problems given only a fitness measure (and of course a sufficient CPU power), 
this is particularly helpful when the problem is complex and no mathematical 
solution is available.

My primary goal is to implement the GA described in [GA]. It uses a fitness 
function that is easy to implement and can benefit from the Map-Reduce strategy 
to exploit distributed computing (when the training dataset is very large). It 
will be available as ready to use tool (Mahout.GA) that discovers binary 
classification rules for any given dataset. Concretely, the main program will 
launch the GA using WatchMaker, each time the GA needs to evaluate the fitness 
of the population it calls a specific class given by us, this class will 
configure and launch a Hadoop Job on a distributed cluster.

My secondary goal is to make Mahout.GA problem independent, thus allowing us to 
use it for different problems such as multiclass classification, optimization, 
clustering. This will be done by implementing a ready to use generic fitness 
function for WatchMaker that calls internally Hadoop. As a proof of concept I 
will use it for multiclass classification (if I don't run out of time of 
course!).


III. Profit for Mahout

1.The GA will be integrated with Mahout as a ready to use rule discovering tool 
for binary classification;
2.Explore the integration of existing frameworks with Mahout, for example how 
to design the program in a way that the framework libraries will not be needed 
in the slave nodes (technically its feasible, but I still need to learn how to 
do it);
3.The parallelized fitness function can be used independently of Mahout.GA. 
It’s a good measure of the quality of binary classification rules;
4.Simplify the process of using Mahout.GA for other problems. The user will 
still need to design the solutions representation and to implement a fitness 
function, but all the Hadoop stuff should be hidden or at least made simpler;
5.Apply the generalized Mahout.GA to multiclass classification and write a 
corresponding tutorial that explains how to use Mahout.GA to solve new problems.


IV. Success Criteria

Main goals
  1.Implement the parallelized fitness function described in [GA] and validate 
its results on a small dataset;
  2.Implement Mahout.GA for binary classification rule discovery. A simpler 
(not parallelized) version of this algorithm should also be implemented to 
validate the results of Mahout.GA;

Secondary goals
  1.Allow the parallelized fitness function to be used independently of 
Mahout.GA;
  2.Use Mahout.GA on a different problem (multiclass classification) and write 
a corresponding tutorial.


V. Roadmap

[April, 14: accepted students known]
  1.Familiarize myself with Hadoop
Modify one of the examples of Hadoop to simulate an iterative process. For 
each iteration, a new Job is executed with different parameters, and its 
results are imported back by the program.
  2.Implement the GA without parallelism
a.Start by implementing the tutorial example that comes with WatchMaker;
b.Implement my own Individual and Fitness function classes;
c.Validate the algorithm using a small dataset, and find the parameters 
that will give acceptable results.
  3.Prepare whatever I may need in the development period
[May, 26 coding starts]
  4.Implement the parallelized fitness function
a.Use Hadoop Map-Reduce to implement it [2 weeks];
b.Validate it on a small dataset [1 week].
  5.Implement Mahout.GA
a.Write an intermediary component between WatchMaker and the parallelized 
fitness function. This component takes a population, configures and launches a 
Job, waits for its 

GSoC Evolutionary Algorithm Idea

2008-03-25 Thread deneche abdelhakim
Hi

Im a PhD student on AI and adaptive systems, I have been working on 
evolutionary algorithms for the last 4 years. I implemented my own Aritifial 
Immune System with Matlab and as a Java extension to Yale, I also worked with a 
C++ framework for multi-objective optimization.

My project is to build a classification genetic algorithm in Mahoot.

I've already done some research and found the following paper 

Discovering Comprehensible Classification Rules with a Genetic Algorithm

its a Genetic Algorithm for binary classification. The fitness function (that 
iterates over all the training dataset) can benefit from the Map-Reduce model 
of Hadoop.

I plan to use some an existing open source framework for the genetic algorithm, 
the framework should take care of all tha GA stuff, and I will be left with:
. the representation of individuals, as described in the article
. the fitness function that uses Hadoop

This algorithm can also be adapted to work with more than two classes...but 
that's another story

What do you think about it ?


Abdel Hakim Deneche
Mentouri Unversity of Constantine, Algeria




  
_ 
Envoyez avec Yahoo! Mail. Capacité de stockage illimitée pour vos emails. 
http://mail.yahoo.fr