Re: GSoC 2014 mentor request

2014-03-21 Thread Tommaso Teofili
Thanks all, just subscribed to the mentors list.
Regards,
Tommaso


2014-03-21 10:23 GMT+01:00 Michael McCandless luc...@mikemccandless.com:

 ACK from Lucene PMC.

 I'm also CC'ing ment...@community.apache.org (Tommaso, you should
 subscribe if you haven't already).

 Thanks Tommaso!  Sad to have too many students/proposals and too few
 mentors ...

 Mike McCandless

 http://blog.mikemccandless.com


 On Fri, Mar 21, 2014 at 3:43 AM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  Dear Lucene PMC,
 
  please acknowledge my request to become a mentor for Google Summer of
  Code 2014 projects for Apache Lucene.
 
  My Melange username is tommaso.
 
  Thanks and regards,
  Tommaso



Re: GSoC 2014 mentor request

2014-03-21 Thread Michael McCandless
You should also subscribe to code-awards@a.o.

See http://community.apache.org/gsoc.html for details ...

Thanks for being a mentor!  We have far too few mentors in Lucene/Solr
unfortunately.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Mar 21, 2014 at 6:23 AM, Tommaso Teofili
tommaso.teof...@gmail.com wrote:
 Thanks all, just subscribed to the mentors list.
 Regards,
 Tommaso


 2014-03-21 10:23 GMT+01:00 Michael McCandless luc...@mikemccandless.com:

 ACK from Lucene PMC.

 I'm also CC'ing ment...@community.apache.org (Tommaso, you should
 subscribe if you haven't already).

 Thanks Tommaso!  Sad to have too many students/proposals and too few
 mentors ...

 Mike McCandless

 http://blog.mikemccandless.com


 On Fri, Mar 21, 2014 at 3:43 AM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  Dear Lucene PMC,
 
  please acknowledge my request to become a mentor for Google Summer of
  Code 2014 projects for Apache Lucene.
 
  My Melange username is tommaso.
 
  Thanks and regards,
  Tommaso



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC 2014 mentor request

2014-03-21 Thread Tommaso Teofili
2014-03-21 11:35 GMT+01:00 Michael McCandless luc...@mikemccandless.com:

 You should also subscribe to code-awards@a.o.


strangely this resulted in qmail-send program replying:

 code-awards-subscr...@apache.org:
 This mailing list has moved to mentors at community.apache.org.

so I guess mentors@ is enough.



 See http://community.apache.org/gsoc.html for details ...

 Thanks for being a mentor!  We have far too few mentors in Lucene/Solr
 unfortunately.


right, if I read Jira correctly we have more than 20 proposals!

Thanks,
Tommaso



 Mike McCandless

 http://blog.mikemccandless.com


 On Fri, Mar 21, 2014 at 6:23 AM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  Thanks all, just subscribed to the mentors list.
  Regards,
  Tommaso
 
 
  2014-03-21 10:23 GMT+01:00 Michael McCandless luc...@mikemccandless.com
 :
 
  ACK from Lucene PMC.
 
  I'm also CC'ing ment...@community.apache.org (Tommaso, you should
  subscribe if you haven't already).
 
  Thanks Tommaso!  Sad to have too many students/proposals and too few
  mentors ...
 
  Mike McCandless
 
  http://blog.mikemccandless.com
 
 
  On Fri, Mar 21, 2014 at 3:43 AM, Tommaso Teofili
  tommaso.teof...@gmail.com wrote:
   Dear Lucene PMC,
  
   please acknowledge my request to become a mentor for Google Summer of
   Code 2014 projects for Apache Lucene.
  
   My Melange username is tommaso.
  
   Thanks and regards,
   Tommaso
 
 



Re: GSoC 2014 mentor request

2014-03-21 Thread Michael McCandless
Ahh... the list must have moved.  Good to know :)

Mike McCandless

http://blog.mikemccandless.com


On Fri, Mar 21, 2014 at 7:04 AM, Tommaso Teofili
tommaso.teof...@gmail.com wrote:

 2014-03-21 11:35 GMT+01:00 Michael McCandless luc...@mikemccandless.com:

 You should also subscribe to code-awards@a.o.


 strangely this resulted in qmail-send program replying:

 code-awards-subscr...@apache.org:
 This mailing list has moved to mentors at community.apache.org.

 so I guess mentors@ is enough.



 See http://community.apache.org/gsoc.html for details ...

 Thanks for being a mentor!  We have far too few mentors in Lucene/Solr
 unfortunately.


 right, if I read Jira correctly we have more than 20 proposals!

 Thanks,
 Tommaso



 Mike McCandless

 http://blog.mikemccandless.com


 On Fri, Mar 21, 2014 at 6:23 AM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  Thanks all, just subscribed to the mentors list.
  Regards,
  Tommaso
 
 
  2014-03-21 10:23 GMT+01:00 Michael McCandless
  luc...@mikemccandless.com:
 
  ACK from Lucene PMC.
 
  I'm also CC'ing ment...@community.apache.org (Tommaso, you should
  subscribe if you haven't already).
 
  Thanks Tommaso!  Sad to have too many students/proposals and too few
  mentors ...
 
  Mike McCandless
 
  http://blog.mikemccandless.com
 
 
  On Fri, Mar 21, 2014 at 3:43 AM, Tommaso Teofili
  tommaso.teof...@gmail.com wrote:
   Dear Lucene PMC,
  
   please acknowledge my request to become a mentor for Google Summer of
   Code 2014 projects for Apache Lucene.
  
   My Melange username is tommaso.
  
   Thanks and regards,
   Tommaso
 
 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC

2014-03-12 Thread Michael McCandless
Hi Ivan,

It's best to just add a comment onto LUCENE-466 with your
ideas/questions specific to that issue; other more general questions
should be sent to this dev list.

Since the big part of that issue (supporting minShouldMatch in
BooleanQuery) was already done, I think fixing query parsers to handle
it is important but isn't an entire GSoC project?  Or, perhaps it is
(we have quite a few query parsers now...).  But I think doing another
improvement in addition would be the right amount...

The mentor assignment is somewhat ad-hoc, sort of like dating ;)  You
should add comments to the issue, adding ideas, asking for
suggestions, asking if anyone will mentor, and then see if any
possible mentors respond.  I'm not sure why the issue is assigned to
Yonik; I don't think he's actually working on it.

You could try looking at past GSoC proposals at Apache Lucene to get an idea?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Mar 12, 2014 at 10:40 AM, Ivan Biggs
ivan.c.bi...@vanderbilt.edu wrote:
 Hello,
 My name is Ivan Biggs and I'm very interested in working with Lucene for my
 Google Summer of Code Project. I've a lot of the4 relevant documentation and
 currently have my eye on the issue found here:
 https://issues.apache.org/jira/browse/LUCENE-466?filter=12326260jql=labels%20%3D%20gsoc2014%20AND%20status%20%3D%20Open

 My only concern is that I want to be sure that this issue would be
 considered adequate work for a project in of itself or if I should plan on
 tackling perhaps two of these type of issues. Furthermore, if anyone could
 point me in the direction of a possible future mentor, it'd be much
 appreciated as I'm not quite sure why this particular issue has an assignee
 listed. Also, since Apache doesn't have any sort of template or similar
 guidelines for proposal submissions available, any general help or advice as
 to what sort of standards I should be adhering to would be great too!

 Thanks,
 Ivan

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC

2014-03-12 Thread Ivan Biggs
First, thanks so much for getting me pointed in the right direction! I
assume you mean straight on Jira? Also do you have any clue where one would
be able to find past proposals for Lucene?
Thanks,
Ivan


On Wed, Mar 12, 2014 at 12:08 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Hi Ivan,

 It's best to just add a comment onto LUCENE-466 with your
 ideas/questions specific to that issue; other more general questions
 should be sent to this dev list.

 Since the big part of that issue (supporting minShouldMatch in
 BooleanQuery) was already done, I think fixing query parsers to handle
 it is important but isn't an entire GSoC project?  Or, perhaps it is
 (we have quite a few query parsers now...).  But I think doing another
 improvement in addition would be the right amount...

 The mentor assignment is somewhat ad-hoc, sort of like dating ;)  You
 should add comments to the issue, adding ideas, asking for
 suggestions, asking if anyone will mentor, and then see if any
 possible mentors respond.  I'm not sure why the issue is assigned to
 Yonik; I don't think he's actually working on it.

 You could try looking at past GSoC proposals at Apache Lucene to get an
 idea?

 Mike McCandless

 http://blog.mikemccandless.com


 On Wed, Mar 12, 2014 at 10:40 AM, Ivan Biggs
 ivan.c.bi...@vanderbilt.edu wrote:
  Hello,
  My name is Ivan Biggs and I'm very interested in working with Lucene for
 my
  Google Summer of Code Project. I've a lot of the4 relevant documentation
 and
  currently have my eye on the issue found here:
 
 https://issues.apache.org/jira/browse/LUCENE-466?filter=12326260jql=labels%20%3D%20gsoc2014%20AND%20status%20%3D%20Open
 
  My only concern is that I want to be sure that this issue would be
  considered adequate work for a project in of itself or if I should plan
 on
  tackling perhaps two of these type of issues. Furthermore, if anyone
 could
  point me in the direction of a possible future mentor, it'd be much
  appreciated as I'm not quite sure why this particular issue has an
 assignee
  listed. Also, since Apache doesn't have any sort of template or similar
  guidelines for proposal submissions available, any general help or
 advice as
  to what sort of standards I should be adhering to would be great too!
 
  Thanks,
  Ivan

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





Re: GSoC

2014-03-12 Thread Michael McCandless
Sorry, yes, please add comments/ideas straight on the Jira issue, i.e.
https://issues.apache.org/jira/browse/LUCENE-466 in this case.

Hmm, I'm not sure how to find past proposals.  The links to these
proposals, e.g. from my past blog post, and from past Jira issues,
seem to be broken now.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Mar 12, 2014 at 1:25 PM, Ivan Biggs ivan.c.bi...@vanderbilt.edu wrote:
 First, thanks so much for getting me pointed in the right direction! I
 assume you mean straight on Jira? Also do you have any clue where one would
 be able to find past proposals for Lucene?
 Thanks,
 Ivan


 On Wed, Mar 12, 2014 at 12:08 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 Hi Ivan,

 It's best to just add a comment onto LUCENE-466 with your
 ideas/questions specific to that issue; other more general questions
 should be sent to this dev list.

 Since the big part of that issue (supporting minShouldMatch in
 BooleanQuery) was already done, I think fixing query parsers to handle
 it is important but isn't an entire GSoC project?  Or, perhaps it is
 (we have quite a few query parsers now...).  But I think doing another
 improvement in addition would be the right amount...

 The mentor assignment is somewhat ad-hoc, sort of like dating ;)  You
 should add comments to the issue, adding ideas, asking for
 suggestions, asking if anyone will mentor, and then see if any
 possible mentors respond.  I'm not sure why the issue is assigned to
 Yonik; I don't think he's actually working on it.

 You could try looking at past GSoC proposals at Apache Lucene to get an
 idea?

 Mike McCandless

 http://blog.mikemccandless.com


 On Wed, Mar 12, 2014 at 10:40 AM, Ivan Biggs
 ivan.c.bi...@vanderbilt.edu wrote:
  Hello,
  My name is Ivan Biggs and I'm very interested in working with Lucene for
  my
  Google Summer of Code Project. I've a lot of the4 relevant documentation
  and
  currently have my eye on the issue found here:
 
  https://issues.apache.org/jira/browse/LUCENE-466?filter=12326260jql=labels%20%3D%20gsoc2014%20AND%20status%20%3D%20Open
 
  My only concern is that I want to be sure that this issue would be
  considered adequate work for a project in of itself or if I should plan
  on
  tackling perhaps two of these type of issues. Furthermore, if anyone
  could
  point me in the direction of a possible future mentor, it'd be much
  appreciated as I'm not quite sure why this particular issue has an
  assignee
  listed. Also, since Apache doesn't have any sort of template or similar
  guidelines for proposal submissions available, any general help or
  advice as
  to what sort of standards I should be adhering to would be great too!
 
  Thanks,
  Ivan

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC 2014 on LUCENE-466: Need QueryParser support for BooleanQuery.minNrShouldMatch

2014-02-28 Thread Michael McCandless
I think a good place to start is on the issue itself.

E.g. add a comment expressing that you're interested in this issue,
maybe summarize roughly what's entailed.  E.g., that issue is quite
old, and the first part of it (supporting minShouldMatch in BQ) has
already been done, so all that remains is fixing QueryParsers to
accept it, if they don't already?  I'm not sure, but just this part
may be too little for a whole summer?


Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 27, 2014 at 10:16 PM, Tao Lin taolin.bn...@gmail.com wrote:
 Hello,

 My name is Tao Lin, a Chinese student from Beijing Normal University Zhuhai
 Campus. It's great to see that Han Jiang (also a Chinese student) has
 already contributed to Lucene in GSoC 2012 and 2013. Likewise, I'd like to
 participant GSoC 2014, on the project of LUCENE-466 [1] (Need QueryParser
 support for BooleanQuery.minNrShouldMatch). Is this lucene dev mailing list
 the place to discuss gsoc projects? Who will be the mentor(s) for this
 project? I see the Assignee of LUCENE-466 is Yonik Seeley. How can I get in
 touch with him? Is LUCENE-466 still available as a GSoC 2014 student
 project?

 For a brief self-introduction, I've successfully completed 2 open source
 GSoC projects:
 - In GSoC 2011, I worked for Languagetool [2] to develop a Lucene-based
 indexing tool that makes it possible to run proof-reading rule against a
 large amount of text.
 - In GSoC 2012, I added the RDFa metadata support for Apache ODF Toolkit
 [3].

 Yours,
 Tao Lin

 [1] https://issues.apache.org/jira/browse/LUCENE-466

 [2] http://www.languagetool.org/gsoc2011/

 [3] https://issues.apache.org/jira/browse/ODFTOOLKIT-50




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSOC 2013

2013-03-30 Thread Michael McCandless
Thanks Adrien!

Mike McCandless

http://blog.mikemccandless.com


On Fri, Mar 29, 2013 at 1:49 PM, Adrien Grand jpou...@gmail.com wrote:
 Hi,

 Although I probably won't be able to mentor students next summer, I
 think it would be great to have students this year too. I modified
 open JIRA issues from last year's GSOC to add the gsoc2013 label so
 that students can find our project ideas.

 https://issues.apache.org/jira/issues/?jql=(project%20%3D%20%22Lucene%20-%20Core%22%20OR%20project%20%3D%20Solr)%20AND%20labels%20%3D%20gsoc2013

 --
 Adrien

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC 2013

2013-03-20 Thread Tommaso Teofili
Hello Raimon,

depending on what focus your master thesis should be Lucene / Solr may or
not be the right project.
Basically if your sentiment analysis topic is tight to information
retrieval (very dummy example: making a search engine which scores
documents boosting positive ones) then it could be ok (in this case you
could leverage some classification capabilities Lucene has [1]) otherwise
if your task is more focused on the extraction of such sentiments then
other projects may fit better, see for example OpenNLP or Mahout or UIMA.

My 2 cents,
Tommaso

[1] :
http://www.slideshare.net/teofili/text-categorization-with-lucene-and-solr




2013/3/19 Raimon Bosch raimon.bo...@gmail.com

 Anyone interested?


 2013/3/18 Raimon Bosch raimon.bo...@gmail.com


 Hi all,

 I would be interested in doing a Google Summer of Code this year with
 Lucene or Solr. My master thesis topic is about Sentiment analysis, there
 is any research in this direction inside Solr and Lucene? If there is any
 other interesting topic I would be open to discuss.

 Thanks in advance,
 Raimon Bosch.





Re: GSoC 2013

2013-03-20 Thread Raimon Bosch
Hi Tommaso,

Yes, I agree. To use Lucene in this kind of project we would need to focus
on creating sentiment ranking or improve the text classification
capabilities of Lucene. Integration with other might be interesting, also.

Thanks,
Raimon Bosch.

2013/3/20 Tommaso Teofili tommaso.teof...@gmail.com

 Hello Raimon,

 depending on what focus your master thesis should be Lucene / Solr may or
 not be the right project.
 Basically if your sentiment analysis topic is tight to information
 retrieval (very dummy example: making a search engine which scores
 documents boosting positive ones) then it could be ok (in this case you
 could leverage some classification capabilities Lucene has [1]) otherwise
 if your task is more focused on the extraction of such sentiments then
 other projects may fit better, see for example OpenNLP or Mahout or UIMA.

 My 2 cents,
 Tommaso

 [1] :
 http://www.slideshare.net/teofili/text-categorization-with-lucene-and-solr




 2013/3/19 Raimon Bosch raimon.bo...@gmail.com

 Anyone interested?


 2013/3/18 Raimon Bosch raimon.bo...@gmail.com


 Hi all,

 I would be interested in doing a Google Summer of Code this year with
 Lucene or Solr. My master thesis topic is about Sentiment analysis, there
 is any research in this direction inside Solr and Lucene? If there is any
 other interesting topic I would be open to discuss.

 Thanks in advance,
 Raimon Bosch.






Re: GSoC 2013

2013-03-19 Thread Raimon Bosch
Anyone interested?

2013/3/18 Raimon Bosch raimon.bo...@gmail.com


 Hi all,

 I would be interested in doing a Google Summer of Code this year with
 Lucene or Solr. My master thesis topic is about Sentiment analysis, there
 is any research in this direction inside Solr and Lucene? If there is any
 other interesting topic I would be open to discuss.

 Thanks in advance,
 Raimon Bosch.



Re: [GSoC] codec not registered?

2012-04-30 Thread Robert Muir
Since your test uses PerFieldPostingsFormat, its going to write the
name of your format PForDelta into the index and expects to be able
to load it via the SPI mechanism.

So I think you should register your PForDeltaPostingsFormat in
lucene/core/src/resources/META-INF/services/org.apache.lucene.codecs.PostingsFormat
so that the SPI mechanism is able to look it up by name.

On Mon, Apr 30, 2012 at 2:39 PM, Han Jiang jiangha...@gmail.com wrote:
 Hi,

 I just immitated the MockFixedIntBlock and wrote a simple postings format,
 but when I tried to use ant test, it told me that:
  A SPI class of type org.apache.lucene.codecs.PostingsFormat with name
 'PForDelta' does not exist.
 Details are here: http://pastebin.com/EQDLwrn2

 To reproduce the error, you can use the patch and run mytest-min under
 trunk/lucene.

 It is strange that the error happens when calling writer.close(), and no
 error will occur if I change to an existing postings format. What did I
 missed?

 Billy

 --
 Han Jiang

 EECS, Peking University, China
 Every Effort Creates Smile

 Senior Student



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-- 
lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [GSoC] codec not registered?

2012-04-30 Thread Han Jiang
Ah, I see. Thank you Robert !

On Tue, May 1, 2012 at 2:46 AM, Robert Muir rcm...@gmail.com wrote:

 Since your test uses PerFieldPostingsFormat, its going to write the
 name of your format PForDelta into the index and expects to be able
 to load it via the SPI mechanism.

 So I think you should register your PForDeltaPostingsFormat in

 lucene/core/src/resources/META-INF/services/org.apache.lucene.codecs.PostingsFormat
 so that the SPI mechanism is able to look it up by name.

 On Mon, Apr 30, 2012 at 2:39 PM, Han Jiang jiangha...@gmail.com wrote:
  Hi,
 
  I just immitated the MockFixedIntBlock and wrote a simple postings
 format,
  but when I tried to use ant test, it told me that:
   A SPI class of type org.apache.lucene.codecs.PostingsFormat with name
  'PForDelta' does not exist.
  Details are here: http://pastebin.com/EQDLwrn2
 
  To reproduce the error, you can use the patch and run mytest-min under
  trunk/lucene.
 
  It is strange that the error happens when calling writer.close(), and
 no
  error will occur if I change to an existing postings format. What did I
  missed?
 
  Billy
 
  --
  Han Jiang
 
  EECS, Peking University, China
  Every Effort Creates Smile
 
  Senior Student
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org



 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Han Jiang

EECS, Peking University, China
Every Effort Creates Smile

Senior Student


Re: GSoC 2012 - Refactoring IndexWriter (LUCENE-2026)

2012-04-05 Thread Timur Achmetow
Hi,

here's my first suggestion for the Refactoring steps:

By now is the IW-class very big and i would try to reduce the code,
by delegate special functions to the new components (Pattern: SRP).
So keeps the IndexWriter most of his APIs and delegates only.

I would try to extract the internals from the following methods into new
components;
for example, it could look like this;

   - addDocument: component SegementWriter
   - addIndexes:  component IndexbasedWriter

What you think? Other ideas / suggestions / tips?
Should I have to send the mail to the lucene mailing list?

Thx for the feedback
Tim


Re: GSoC - Refactoring IndexWriter

2012-04-04 Thread Achmetow (Google)
Hey Simon, 

thx for your fast response!

 to begin with make sure you read this: 
http://wiki.apache.org/lucene-java/SummerOfCode2012
http://wiki.apache.org/lucene-java/HowToContribute

Okay, i read the documentation.

 Yeah we have multiple test for IndexWriter (IW in short) the are all
basically in /lucene/core/src/test/org/apache/lucene/index
there is a bunch of them but those are only the test that test the IW
directly. Lots of other tests are involved. Whatever you do you should
run all core tests. The one with NRT and Threads in the name are the
most evil :) 

What means NRT?
Okay, i will now checkout the trunk and run all the unit tests.

 I use Eclipse, but you can use the tool you like / know 
Ok, Thx. 

Re: GSoC - Refactoring IndexWriter

2012-04-03 Thread Simon Willnauer
Hey Tim, great to have you!
to begin with make sure you read this:
http://wiki.apache.org/lucene-java/SummerOfCode2012

On Wed, Apr 4, 2012 at 12:20 AM, Achmetow (Google)
achmeto...@googlemail.com wrote:
 Hi,

 I am a student from Germany and would like to contribute to the ASF Lucene
 project.

great! I am excited!

 In the ideas list I have found the following interesting project:
 Refactoring IndexWriter
 (https://issues.apache.org/jira/browse/LUCENE-2026)

 Now I have some questions to this project:

 1. Exist unit tests for this code (IndexWriter.java)?

Yeah we have multiple test for IndexWriter (IW in short) the are all
basically in /lucene/core/src/test/org/apache/lucene/index

there is a bunch of them but those are only the test that test the IW
directly. Lots of other tests are involved. Whatever you do you should
run all core tests. The one with NRT and Threads in the name are the
most evil :)

simonw$ find . -name TestIndexWriter*
./core/src/test/org/apache/lucene/index/TestIndexWriter.java
./core/src/test/org/apache/lucene/index/TestIndexWriterCommit.java
./core/src/test/org/apache/lucene/index/TestIndexWriterConfig.java
./core/src/test/org/apache/lucene/index/TestIndexWriterDelete.java
./core/src/test/org/apache/lucene/index/TestIndexWriterExceptions.java
./core/src/test/org/apache/lucene/index/TestIndexWriterForceMerge.java
./core/src/test/org/apache/lucene/index/TestIndexWriterLockRelease.java
./core/src/test/org/apache/lucene/index/TestIndexWriterMergePolicy.java
./core/src/test/org/apache/lucene/index/TestIndexWriterMerging.java
./core/src/test/org/apache/lucene/index/TestIndexWriterNRTIsCurrent.java
./core/src/test/org/apache/lucene/index/TestIndexWriterOnDiskFull.java
./core/src/test/org/apache/lucene/index/TestIndexWriterOnJRECrash.java
./core/src/test/org/apache/lucene/index/TestIndexWriterReader.java
./core/src/test/org/apache/lucene/index/TestIndexWriterUnicode.java
./core/src/test/org/apache/lucene/index/TestIndexWriterWithThreads.java

 2. Where I can find the code/software btw. component? (svn, git etc.)

here is a good guideline for getting started
http://wiki.apache.org/lucene-java/HowToContribute

 3. Which IDE I can use for this project? Your Suggestions (Eclipse)?

I use Eclipse, but you can use the tool you like / know
 4. What's about coding style guides in the ASF?

We have a code style in lucene which basically follows the sun
guidelines. I think there are templates for eclipse and intellij on
the contribution wiki.

hope that gets you started!

simon



 Thanks and Greetings

 Tim



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [GSoC] About how flexible indexing works in lucene 4.0

2012-03-28 Thread Michael McCandless
On Mon, Mar 26, 2012 at 6:59 PM, Han Jiang jiangha...@gmail.com wrote:
 Hi all,

 I was trying to figure out the control flow of IndexWriter and
 IndexSearcher, in order to get a better understanding of the idea behind
 Codec implementation.

 However, there seem to be some questions related with codes, which I just
 find inconvenient to discuss here.

 Maybe it is better to expain how much I understand, and ask for your
 comments?
 Here is what I understand:

 Index time:
 --First of all, IndexWriter should get a Codec configuration from an
 IndexWriterConfig.
 --When IndexWriter.addDocument is called, an instance of
 DocumentsWriterPerThread will be created,
 --It then pass the codec information through the indexing chain, and make an
 instance of FreqProxTermsWriterPerField to call flush().
 --Then, based on the codec information, we create an instance of
 TermsConsumer, after this, we iterator each termID, get corresponding
 PostingConsumer, and save infomation of each document.
 --Here, by inheriting TermsConsumer and PostingConsumer, we get
 IndexWriter create index with new posting formats.

That sounds about right!

But, it's best to think of FreqProxTErmsWriter/PerField as having its
own private in-memory postings format, and then, on flush, it
re-parses its in-memory postings and feeds them to the codec
(Fields/Terms/PostingsConsumer) for writing to the index.

 Query time:
 --Now, let's take Phrase Search as an example.
 --When IndexSearcher.search(phraseQuery,topN) is called, an instance of
 PhraseWeight will be created to wrap the query terms,
 --Then, IndexSearcher will create tasks to call method
 PhraseWeight.scorer(), inside which two instances: Terms and TermsEnum will
 be fetched from corresponding AtomicReader,
 --With the help of TermsEnum, for every phrase words, related docs and
 positions will be fetched through a DocsAndPositionsEnum, and result thus be
 generated.
 --Here, by inheriting TermsEnum and related *Enum classes, we get
 IndexSearcher(or IndexReader) understand our posting formats.

Sounds right!

 And, here I have some questions:

 1. Will multiple AtomicReaders created if I operate a search on a index with
 several segments? If not, when will there be multi AtomicReaders? And to
 further the question,  what is the idea to introduce AtomicReader and
 CompositeReader into lucene 4?

Right, it's one atomic reader (SegmentReader) per segment.

We split composite/atomic readers in 4.0 so they'd be strongly typed
(they have different methods and before the split they'd throw
UnsupportedOperationExceptions from a number of methods, which was
messy).

 2. I must have missed something during query time, since subtype of
 PostingsReaderBase is just absent from what I explained. Is it created when
 an instance of AtomicReader is fetch from context? Where can I find related
 codes?

PostingsWriter/ReaderBase is what our default terms dictionaries
(Block/TreeTermsWriter/Reader) interact with.

So, eg the Lucene40PostingsWriter/Reader subclass PostingsWriter/ReaderBase.

 3. The wiki page here  says we should provide an arbitrary skipDocs bit set
 during enumeration. Then, is posting list itself remains unchanged, even if
 I call deleteDocuments() ? Will deleted documents still remain in the
 postings file, even segments get merged?

Deleted docs are simply marked in a bit set (the liveDocs bits), and
the postings files themselves are unchanged.

So when the postings reader enumerates the postings, it must checked
the provided live docs (if it's not null) to confirm the doc is not
deleted.

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [GSoC] Question about LUCENE-3892

2012-03-23 Thread Michael McCandless
Hello,

One quick question up front: are you subscribed to the dev list?  If
not, you may have missed my response to your last email with GSoC
questions:


http://lucene.markmail.org/thread/lqv6lyql2nlagv7f#query:+page:1+mid:ubjsvvfviuaexqlo+state:results

Answers below:

On Fri, Mar 23, 2012 at 2:09 PM, Han Jiang jiangha...@gmail.com wrote:

 I scanned through some discussions and codes around PForDelta, like
 LUCENE-1410, LUCENE-2903, ConversationBetweenMichaelAndLiLi. It is great to
 see so much information, and PForDelta seems to be a promising target. But
 as I look into the codes in branch-bulkpostings, it seems that most of the
 algorithms had already been implemented. Then, what is required to do for
 LUCENE-3892 , is the main target be the performance improvement,
 intergration with trunk version, or another implementation from the bottom
 up?

We can work out the scope... but I think success would be a useful
codec committed to 4.0?  Ideally, and I think likely, it shows faster
performance than our current default codec, in which case we may want
to change our default, depending on other factors...

Ie, you'd need to bring forward those old patches/branches to the
current codec APIs, do performance testing to understand where they do
well / poorly, whether more disk space is used, etc.  Perhaps iterate
on their implementations to improve performance...

If the project succeeds in building a committable PForDelta codec that
would be awesome!

If that somehow winds up being too little, you can explore other
intblock codecs as well...

 And another question about development. I am quite curious that some classes
 such as StandardAnalyzer were not found in the trunk or branch-bulkpostings,
 but replaced with Mock ones. Then how can I test my old codes, if I want to
 intergrate these classes with trunk library?

We've moved all real analyzers to the module/analysis... what's in
trunk are test analyzers, which you should use for new tests since
they have more thorough checks.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [GSoC]About some general information

2012-03-21 Thread Michael McCandless
Hello!  Answers below...:

On Wed, Mar 21, 2012 at 11:03 AM, Han Jiang jiangha...@gmail.com wrote:
 Hi All,

 I'm Billy, a senior undergraduate student in Peking University. I'm working
 in the area of Information Retrieval and Web Mining. When going through the
 idea list, I felt quite interested in the LUCENE-3892 and LUCENE-3069. I am
 very proficient on java, and have been using lucene for about one year. I am
 looking forward to make a contribution to this project.

Awesome.

 Here, I have a few questions about lucene:

 First of all,  which version of lucene shall we use as a start point? The
 trunk or 3.5?

Both of these issues will be trunk only I think: they both are far
easier to do with the Codec API in 4.0.

 Is there any demo codes to show the idea of Codecs?

Maybe the simplest demo would be to look at the SimpleText codec?  It
roughly tries to have simple source code as well as a simple (text
only, human readable) on-disk format.

 How many posting formats are supposed to be implemented, for project
 LUCENE-3892 ?

This can be worked out when scoping the project... but I think getting
one postings format working well would be awesome :)  If somehow
that's too easy, then add more!

 Is there any further documentation for LUCENE-3069 ?

Not that I know of... but I suspect the approach can be very similar
to the MemoryPostingsFormat we already have, just that it'd only be
the terms data stored in the FST, while the postings
(docs/freqs/positions/offsets) are written to a file.

Ideally, it would just act like a different terms dictionary
implementation, ie so that we can then plug in any PostingsBaseFormat
(even the one from LUCENE-3892!).

 Thank you!

You're welcome, and welcome to Lucene/Solr!

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSOC 2012?

2012-03-10 Thread Simon Willnauer
Mark, can you open an issue for this and lable it as:

gsoc2012
lucene-gsoc-12
mentor

just like this one https://issues.apache.org/jira/browse/LUCENE-2562

thanks,

simon

On Fri, Mar 2, 2012 at 12:26 PM, mark harwood markharw...@yahoo.co.uk wrote:
Does anyone have any ideas?

 A framework for match metadata?

 Similar to the way tokenization was changed to allow tokenizers to to enrich 
 a stream of tokens with arbitrary attributes, Scorers could provide 
 MatchAttributes to provide arbitrary metadata about the stream of matches 
 they produce.
 Same model is used - callers decide in advance which attribute decorations 
 they want to consume and Scorers modify a singleton object which can be 
 cloned if multiple attributes need to be retained by the caller.

 Helps support highlighting, explain and enables communication of added 
 information between query objects in the tree.
 LUCENE-1999 was an example of a horrible work-around where additional match 
 information that was required was smuggled through by bit-twiddling the score 
  - this is because score is the only bit of match context we currently pass 
 in Lucene APIs.

 Cheers
 Mark




 
 From: Robert Muir rcm...@gmail.com
 To: dev@lucene.apache.org
 Sent: Friday, 2 March 2012, 10:30
 Subject: GSOC 2012?

 Hello,

 I was asked by a student if we are participating in GSOC this year. I
 hope the answer is yes?

 If we are planning to, I think it would be good if we came up with a
 list on the wiki of potential tasks. Does anyone have any ideas?

 One suggested idea I had (similar to LUCENE-2959 last year) would be
 to add a flexible query expansion framework.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSOC 2012?

2012-03-02 Thread Simon Willnauer
On Fri, Mar 2, 2012 at 11:30 AM, Robert Muir rcm...@gmail.com wrote:
 Hello,

 I was asked by a student if we are participating in GSOC this year. I
 hope the answer is yes?

 If we are planning to, I think it would be good if we came up with a
 list on the wiki of potential tasks. Does anyone have any ideas?

 One suggested idea I had (similar to LUCENE-2959 last year) would be
 to add a flexible query expansion framework.


+1 I'd love to help somebody to get PositionIterators in!!!

simon
 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSOC 2012?

2012-03-02 Thread mark harwood
Does anyone have any ideas?

A framework for match metadata?

Similar to the way tokenization was changed to allow tokenizers to to enrich a 
stream of tokens with arbitrary attributes, Scorers could provide 
MatchAttributes to provide arbitrary metadata about the stream of matches 
they produce.
Same model is used - callers decide in advance which attribute decorations they 
want to consume and Scorers modify a singleton object which can be cloned if 
multiple attributes need to be retained by the caller.

Helps support highlighting, explain and enables communication of added 
information between query objects in the tree.
LUCENE-1999 was an example of a horrible work-around where additional match 
information that was required was smuggled through by bit-twiddling the score  
- this is because score is the only bit of match context we currently pass in 
Lucene APIs.

Cheers
Mark





From: Robert Muir rcm...@gmail.com
To: dev@lucene.apache.org 
Sent: Friday, 2 March 2012, 10:30
Subject: GSOC 2012?

Hello,

I was asked by a student if we are participating in GSOC this year. I
hope the answer is yes?

If we are planning to, I think it would be good if we came up with a
list on the wiki of potential tasks. Does anyone have any ideas?

One suggested idea I had (similar to LUCENE-2959 last year) would be
to add a flexible query expansion framework.

-- 
lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSOC 2012?

2012-03-02 Thread Simon Willnauer
I created an initial GSOC 2012 page here:
http://wiki.apache.org/lucene-java/SummerOfCode2012

simon

On Fri, Mar 2, 2012 at 12:26 PM, mark harwood markharw...@yahoo.co.uk wrote:
Does anyone have any ideas?

 A framework for match metadata?

 Similar to the way tokenization was changed to allow tokenizers to to enrich 
 a stream of tokens with arbitrary attributes, Scorers could provide 
 MatchAttributes to provide arbitrary metadata about the stream of matches 
 they produce.
 Same model is used - callers decide in advance which attribute decorations 
 they want to consume and Scorers modify a singleton object which can be 
 cloned if multiple attributes need to be retained by the caller.

 Helps support highlighting, explain and enables communication of added 
 information between query objects in the tree.
 LUCENE-1999 was an example of a horrible work-around where additional match 
 information that was required was smuggled through by bit-twiddling the score 
  - this is because score is the only bit of match context we currently pass 
 in Lucene APIs.

 Cheers
 Mark




 
 From: Robert Muir rcm...@gmail.com
 To: dev@lucene.apache.org
 Sent: Friday, 2 March 2012, 10:30
 Subject: GSOC 2012?

 Hello,

 I was asked by a student if we are participating in GSOC this year. I
 hope the answer is yes?

 If we are planning to, I think it would be good if we came up with a
 list on the wiki of potential tasks. Does anyone have any ideas?

 One suggested idea I had (similar to LUCENE-2959 last year) would be
 to add a flexible query expansion framework.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSOC 2012?

2012-03-02 Thread Robert Muir
Thanks for helping to get this started Simon and Mark!

On Fri, Mar 2, 2012 at 7:10 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 I created an initial GSOC 2012 page here:
 http://wiki.apache.org/lucene-java/SummerOfCode2012

 simon

 On Fri, Mar 2, 2012 at 12:26 PM, mark harwood markharw...@yahoo.co.uk wrote:
Does anyone have any ideas?

 A framework for match metadata?

 Similar to the way tokenization was changed to allow tokenizers to to enrich 
 a stream of tokens with arbitrary attributes, Scorers could provide 
 MatchAttributes to provide arbitrary metadata about the stream of matches 
 they produce.
 Same model is used - callers decide in advance which attribute decorations 
 they want to consume and Scorers modify a singleton object which can be 
 cloned if multiple attributes need to be retained by the caller.

 Helps support highlighting, explain and enables communication of added 
 information between query objects in the tree.
 LUCENE-1999 was an example of a horrible work-around where additional match 
 information that was required was smuggled through by bit-twiddling the 
 score  - this is because score is the only bit of match context we currently 
 pass in Lucene APIs.

 Cheers
 Mark




 
 From: Robert Muir rcm...@gmail.com
 To: dev@lucene.apache.org
 Sent: Friday, 2 March 2012, 10:30
 Subject: GSOC 2012?

 Hello,

 I was asked by a student if we are participating in GSOC this year. I
 hope the answer is yes?

 If we are planning to, I think it would be good if we came up with a
 list on the wiki of potential tasks. Does anyone have any ideas?

 One suggested idea I had (similar to LUCENE-2959 last year) would be
 to add a flexible query expansion framework.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC: LUCENE-2308: Separately specify a field's type

2011-05-13 Thread Nikola Tanković
2011/5/12 Michael McCandless luc...@mikemccandless.com

 2011/5/9 Nikola Tanković nikola.tanko...@gmail.com:

   Introduction of an FieldType class that will hold all the extra
   properties
   now stored inside Field instance other than field value itself.
 
  Seems like this is an easy first baby step -- leave current Field
  class, but break out the type details into a separate class that can
  be shared across Field instances.
 
  Yes, I agree, this could be a good first step. Mike submitted a patch on
  issue #2308. I think it's a solid base for this.

 Make that Chris.


Ouch, sorry!



   New FieldTypeAttribute interface will be added to handle extension
 with
   new
   field properties inspired by IndexWriterConfig.
 
  How would this work?  What's an example compelling usage?  An app
  could use this for extensibility, and then make a matching codec that
  picks up this attr?  EG, say, maybe for marking that a field is a
  primary key field and then codec could optimize accordingly...?
 
  Well that could be very interesting scenario. It didn't rang a bell to me
  for possible codec usage, but it seems very reasonable. Attributes
 otherwise
  don't make much sense, unless propertly used in custom codecs.
 
  How will we ensure attribute and codec compatibility?

 I'm just thinking we should have concrete reasons in mind for cutting
 over to attributes here... I'd rather see a fixed, well thought out
 concrete FieldType hierarchy first...


Yes, I couldn't agree more, and I also think Chris has some great ideas on
this field, given his work on Spatial indexing which tends to have use of
this additional attributes.



   Refactoring and dividing of settings for term frequency and
 positioning
   can
   also be done (LUCENE-2048)
 
  Ahh great!  So we can omit-positions-but-not-TF.
 
   Discuss possible effects of completion of LUCENE-2310 on this project
 
  This one is badly needed... but we should keep your project focused.
 
 
  We'll tackle this one afterwards.

 Good.

   Adequate Factory class for easier configuration of new Field instances
   together with manually added new FieldTypeAttributes
   FieldType, once instantiated is read-only. Only fields value can be
   changed.
 
  OK.
 
   Simple hierarchy of Field classes with core properties logically
   predefaulted. E.g.:
  
   NumberField,
 
  Can't this just be our existing NumericField?
 
  Yes, this is classic NumericField with changes proposed in LUCENE-2310.
 Tim
  Smith mentioned that Fieldable class should be kept for custom
  implementations to reduce number of setters (for defaults).
  Chris Male suggested new CoreFieldTypeAttribute interface, so maybe it
  should be implemented instead of Fieldable for custom implementations, so
  both Fieldable and AbstractField are not needed anymore.
  In my opinion Field shoud become abstract extended with others.
  Another proposal: how about keeping only Field (with no hierarchy) and
 move
  hierarchy to FieldType, such as NumericFieldType, StringFieldType since
 this
  hierarchy concerns type information only?

 I think hierarchy of both types and the value containers that hold
 the corresponding values could make sense?


Hmm, I think we should get more opinions on this one also.



  e.g. Usage:
  FieldType number = new NumericFieldType();
  Field price = new Field();
  price.setType(number);
  // but this is much cleaner...
  Field price = new NumericField();
  so maybe whe should have paraller XYZField with XYZFieldType...
  Am I complicating?
 
   StringField,
 
  This would be like NOT_ANALYZED?
 
  Yes, strings are often one word only. Or maybe we can name it NameField,
  NonAnalyzedField or something.

 StringField sounds good actually...

   TextField,
 
  This would be ANALYZED?
 
  Yes.
 

 OK.

   What is the best way to break this into small baby steps?
 
  Hopefully this becomes clearer as we iterate.
 
  Well, we know the first step: moving type details into FieldType class.

 Yes!

 Somehow tying into this as well is a stronger decoupling of the
 indexer from analysis/document.  Ie, what indexer needs of a document
 is very minimal -- just an iterable over indexed  stored values.
 Separately we can still provide a full featured Document class w/
 add, get, remove, etc., but that's outside of the indexer.


I'll get back to this one after additional research. Maybe we should do
couple of more interactions, then I'll summarize the conclusions.



 Mike

 http://blog.mikemccandless.com


Nikola


Re: GSoC: LUCENE-2308: Separately specify a field's type

2011-04-14 Thread Michael McCandless
2011/4/13 Nikola Tanković nikola.tanko...@gmail.com:
 Hi all,
 if everything goes well I'll be delighted to be part of this project this
 summer together with my assigned mentor Mike. My task will be to introduce
 new classes to Lucene core which will enable to separate Fields' Lucene
 properties from it's value
 (https://issues.apache.org/jira/browse/LUCENE-2308).

Welcome Nikola!

 Changes will include:

 Introduction of an FieldType class that will hold all the extra properties
 now stored inside Field instance other than field value itself.

Seems like this is an easy first baby step -- leave current Field
class, but break out the type details into a separate class that can
be shared across Field instances.

 New FieldTypeAttribute interface will be added to handle extension with new
 field properties inspired by IndexWriterConfig.

How would this work?  What's an example compelling usage?  An app
could use this for extensibility, and then make a matching codec that
picks up this attr?  EG, say, maybe for marking that a field is a
primary key field and then codec could optimize accordingly...?

 Refactoring and dividing of settings for term frequency and positioning can
 also be done (LUCENE-2048)

Ahh great!  So we can omit-positions-but-not-TF.

 Discuss possible effects of completion of LUCENE-2310 on this project

This one is badly needed... but we should keep your project focused.

 Adequate Factory class for easier configuration of new Field instances
 together with manually added new FieldTypeAttributes
 FieldType, once instantiated is read-only. Only fields value can be changed.

OK.

 Simple hierarchy of Field classes with core properties logically
 predefaulted. E.g.:

 NumberField,

Can't this just be our existing NumericField?

 StringField,

This would be like NOT_ANALYZED?

 TextField,

This would be ANALYZED?

 NonIndexedField,

This would be only stored?

 My questions and issues:

 Backward compatibility? Will this go to Lucene 3.0?

Maybe focus on 4.0 for starters and then if there's a nice backport we
can do that...?

 What is the best way to break this into small baby steps?

Hopefully this becomes clearer as we iterate.

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC Lucene proposals

2011-04-06 Thread Vinicius Barrox
Done!

--- Em qua, 6/4/11, Adriano Crestani adrianocrest...@apache.org escreveu:

De: Adriano Crestani adrianocrest...@apache.org
Assunto: GSoC Lucene proposals
Para: dev@lucene.apache.org
Data: Quarta-feira, 6 de Abril de 2011, 22:43

Hi students,
We are receiving very good proposals this year, I am sure mentors are very 
happy :)
I have one suggestion to make our (mentors) lives easier. Please, add the JIRA 
identifier to your proposal's title, example: LUCENE-2883: Consolidate Solr  
Lucene FunctionQuery into modules. This will let mentors to quickly search for 
Lucene and Solr proposals, as all Apache proposals are mixed and there is no 
way to sort by project.


Thanks!
--Adriano Crestani


Re: GSoC 2011

2011-03-24 Thread Adriano Crestani
Hi Phillipe,

You could start taking a look at these projects:

LUCENE-2979 https://issues.apache.org/jira/browse/LUCENE-2979
 
https://issues.apache.org/jira/browse/LUCENE-2979LUCENE-2309https://issues.apache.org/jira/browse/LUCENE-2309
 
https://issues.apache.org/jira/browse/LUCENE-2309LUCENE-2450https://issues.apache.org/jira/browse/LUCENE-2450
 
https://issues.apache.org/jira/browse/LUCENE-2450LUCENE-1768https://issues.apache.org/jira/browse/LUCENE-1768

 https://issues.apache.org/jira/browse/LUCENE-1768These ones are either
related to analyzers/attributes or query parser.

I hope this helps you to decide ;)

On Thu, Mar 24, 2011 at 1:09 AM, Phillipe Ramalho 
phillipe.rama...@gmail.com wrote:

 Hello,

 I am planning to submit a project proposal to GSoC 2011 and Lucene seems to
 have a lot of GSoC projects this year. Last year I did a GSoC project using
 Lucene for PhotArk project. This year, instead of just using Lucene, I am
 planning to contribute code to it.

 My experience with Lucene is just as a regular user, the only code I have
 changed/extended so far was token streams/analyzers and query parser, so I
 have more knowledge on this part of the code. Based on that, I'm planning to
 focus on query parser and analyzer/token stream projects. Does that sound
 reasonable?

 I will be studying the code and planning the proposal(s), so you should
 start seeing more posts from me in the next few days.

 --
 Phillipe Ramalho



Re: [GSoC] Apache Lucene @ Google Summer of Code 2011 [STUDENTS READ THIS]

2011-03-23 Thread David Nemeskey
Hey Simon and all,

May we get an update on this? I understand that Google has published the list 
of accepted organizations, which -- not surprisingly -- includes the ASF. Is 
there any information on how many slots Apache got, and which issues will be 
selected?

The student application period opens on the 28th, so I'm just wondering if I 
should go ahead and apply or wait for the decision.

Thanks,
David

On 2011 March 11, Friday 17:23:58 Simon Willnauer wrote:
 Hey folks,
 
 Google Summer of Code 2011 is very close and the Project Applications
 Period has started recently. Now it's time to get some excited students
 on board for this year's GSoC.
 
 I encourage students to submit an application to the Google Summer of Code
 web-application. Lucene  Solr are amazing projects and GSoC is an
 incredible opportunity to join the community and push the project
 forward.
 
 If you are a student and you are interested spending some time on a
 great open source project while getting paid for it, you should submit
 your application from March 28 - April 8, 2011. There are only 3
 weeks until this process starts!
 
 Quote from the GSoC website: We hear almost universally from our
 mentoring organizations that the best applications they receive are
 from students who took the time to interact and discuss their ideas
 before submitting an application, so make sure to check out each
 organization's Ideas list to get to know a particular open source
 organization better.
 
 So if you have any ideas what Lucene  Solr should have, or if you
 find any of the GSoC pre-selected projects [1] interesting, please
 join us on dev@lucene.apache.org [2].  Since you as a student must
 apply for a certain project via the GSoC website [3], it's a good idea
 to work on it ahead of time and include the community and possible
 mentors as soon as possible.
 
 Open source development here at the Apache Software
 Foundation happens almost exclusively in the public and I encourage you to
 follow this. Don't mail folks privately; please use the mailing list to
 get the best possible visibility and attract interested community
 members and push your idea forward. As always, it's the idea that
 counts not the person!
 
 That said, please do not underestimate the complexity of even small
 GSoC - Projects. Don't try to rewrite Lucene or Solr!  A project
 usually gains more from a smaller, well discussed and carefully
 crafted  tested feature than from a half baked monster change that's
 too large to work with.
 
 Once your proposal has been accepted and you begin work, you should
 give the community the opportunity to iterate with you.  We prefer
 progress over perfection so don't hesitate to describe your overall
 vision, but when the rubber meets the road let's take it in small
 steps.  A code patch of 20 KB is likely to be reviewed very quickly so
 get fast feedback, while a patch even 60kb in size can take very
 - Hide quoted text -
 long. So try to break up your vision and the community will work with
 you to get things done!
 
 On behalf of the Lucene  Solr community,
 
 Go! join the mailing list and apply for GSoC 2011,
 
 Simon
 
 [1]
 https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQu
 ery=labels+%3D+lucene-gsoc-11 [2]
 http://lucene.apache.org/java/docs/mailinglists.html
 [3] http://www.google-melange.com
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [GSoC] Apache Lucene @ Google Summer of Code 2011 [STUDENTS READ THIS]

2011-03-23 Thread Simon Willnauer
On Wed, Mar 23, 2011 at 9:37 AM, David Nemeskey
nemeskey.da...@sztaki.hu wrote:
 Hey Simon and all,

 May we get an update on this? I understand that Google has published the list
 of accepted organizations, which -- not surprisingly -- includes the ASF. Is
 there any information on how many slots Apache got, and which issues will be
 selected?

 The student application period opens on the 28th, so I'm just wondering if I
 should go ahead and apply or wait for the decision.

David,

you should go ahead and apply via the GSoC website and reference the
issue there this is how I understand it works.
We will later rate the proposals from the GSoC website and decide
which we choose. This is also when slots get assigned.

simon

 Thanks,
 David

 On 2011 March 11, Friday 17:23:58 Simon Willnauer wrote:
 Hey folks,

 Google Summer of Code 2011 is very close and the Project Applications
 Period has started recently. Now it's time to get some excited students
 on board for this year's GSoC.

 I encourage students to submit an application to the Google Summer of Code
 web-application. Lucene  Solr are amazing projects and GSoC is an
 incredible opportunity to join the community and push the project
 forward.

 If you are a student and you are interested spending some time on a
 great open source project while getting paid for it, you should submit
 your application from March 28 - April 8, 2011. There are only 3
 weeks until this process starts!

 Quote from the GSoC website: We hear almost universally from our
 mentoring organizations that the best applications they receive are
 from students who took the time to interact and discuss their ideas
 before submitting an application, so make sure to check out each
 organization's Ideas list to get to know a particular open source
 organization better.

 So if you have any ideas what Lucene  Solr should have, or if you
 find any of the GSoC pre-selected projects [1] interesting, please
 join us on dev@lucene.apache.org [2].  Since you as a student must
 apply for a certain project via the GSoC website [3], it's a good idea
 to work on it ahead of time and include the community and possible
 mentors as soon as possible.

 Open source development here at the Apache Software
 Foundation happens almost exclusively in the public and I encourage you to
 follow this. Don't mail folks privately; please use the mailing list to
 get the best possible visibility and attract interested community
 members and push your idea forward. As always, it's the idea that
 counts not the person!

 That said, please do not underestimate the complexity of even small
 GSoC - Projects. Don't try to rewrite Lucene or Solr!  A project
 usually gains more from a smaller, well discussed and carefully
 crafted  tested feature than from a half baked monster change that's
 too large to work with.

 Once your proposal has been accepted and you begin work, you should
 give the community the opportunity to iterate with you.  We prefer
 progress over perfection so don't hesitate to describe your overall
 vision, but when the rubber meets the road let's take it in small
 steps.  A code patch of 20 KB is likely to be reviewed very quickly so
 get fast feedback, while a patch even 60kb in size can take very
 - Hide quoted text -
 long. So try to break up your vision and the community will work with
 you to get things done!

 On behalf of the Lucene  Solr community,

 Go! join the mailing list and apply for GSoC 2011,

 Simon

 [1]
 https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQu
 ery=labels+%3D+lucene-gsoc-11 [2]
 http://lucene.apache.org/java/docs/mailinglists.html
 [3] http://www.google-melange.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC

2011-03-10 Thread David Nemeskey
Ok, I have created a new issue, LUCENE-2959 for this project. I have uploaded 
the pdfs and added the gsoc2011 and lucene-gsoc-2011 labels as well.

David

On 2011 March 09, Wednesday 21:58:53 Simon Willnauer wrote:
 On Wed, Mar 9, 2011 at 5:48 PM, Grant Ingersoll gsing...@apache.org wrote:
  I think we, Lucene committers, need to identify who is willing to mentor.
 In my experience, it is less than 5 hours a week.  Most of the work
  is done as part of the community.  Sometimes you have to be tough and
  fail someone (I did last year) but most of the time, if you take the
  time to interview the candidates up front, it is a good experience for
  everyone.
 
 count me in
 
  I'd add it would be useful to have everyone put the lucene-gsoc-11 label
  on their issues too, that way we can quickly find the Lucene ones.
 
 done on at least one ;)
 
 simon
 
  Also, feel free to label existing bugs.
  
  On Mar 9, 2011, at 2:11 AM, Simon Willnauer wrote:
  Hey David and all others who want to contribute to GSoC,
  
  the ASF has applied for GSoC 2011 as a mentoring organization. As a
  ASF project we don't need to apply directly though but we need to
  register our ideas now. This works like almost anything in the ASF
  through JIRA. All ideas should be recorded as JIRA tickets  labeled
  with gsoc2011. Once this is done it will show up here:
  http://s.apache.org/gsoc2011tasks
  
  Everybody who is interested in GSoC as a mentor or student should now
  read this too http://community.apache.org/gsoc.html
  
  
  Thanks,
  
  Simon
  
  
  
  
  On Thu, Feb 24, 2011 at 12:14 PM, David Nemeskey
  
  nemeskey.da...@sztaki.hu wrote:
  Please find the implementation plan attached. The word soon gets a
  new meaning when power outages are taken into account. :)
  
  As before, comments are welcome.
  
  David
  
  On Tuesday, February 22, 2011 15:22:57 Simon Willnauer wrote:
  I think that is good for now. I should get started on codeawards and
  wrap up our proposals. I hope I can do that this week.
  
  simon
  
  On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey
  
  nemeskey.da...@sztaki.hu wrote:
  Hey,
  
  I have written the proposal. Please let me know if you want more /
  less of certain parts. Should I upload it somewhere?
  
  Implementation plan soon to follow.
  
  Sorry for the late reply; I have been rather busy these past few
  weeks.
  
  David
  
  On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote:
  Hey David,
  
  I saw that you added a tiny line to the GSoC Lucene wiki - thanks
  for that.
  
  On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey
  
  nemeskey.da...@sztaki.hu wrote:
  Hi guys,
  
  Mark, Robert, Simon: thanks for the support! I really hope we can
  work together this summer (and before that, obviously).
  
  Same here!
  
  According to http://www.google-
  melange.com/document/show/gsoc_program/google/gsoc2011/timeline ,
  there's still some time until the application period. So let me use
  this week to finish my PhD research plan, and get back to you next
  week.
  
  I am not really familiar with how the program works, i.e. how
  detailed the application description should be, when mentorship is
  decided, etc. so I guess we will have a lot to talk about. :)
  
  so from a 1ft view it work like this:
  
  1. Write up a short proposal what your idea is about
  2. make it public! and publish a implementation plan - how you would
  want to realize your proposal. If you don't follow that 100% in the
  actual impl. don't worry. Its just mean to give us an idea that you
  know what you are doing and where you want to go. something like a 1
  A4 rough design doc.
  3. give other people the change to apply for the same suggestion
  (this is how it works though)
  4 Let the ASF / us assign one or more possible mentors to it
  5. let us apply for a slot in GSoC (those are limited for
  organizations) 6. get accepted
  7. rock it!
  
  (Actually, should we move this discussion private?)
  
  no - we usually do everything in public except of discussion within
  the PMC that are meant to be private for legal reasons or similar
  things. Lets stick to the mailing list for all communication except
  you have something that should clearly not be public. This also give
  other contributors a chance to help and get interested in your
  work!!
  
  simon
  
  David
  
  Hi David, honestly this sounds fantastic.
  
  It would be great to have someone to work with us on this issue!
  
  To date, progress is pretty slow-going (minor improvements,
  cleanups, additional stats here and there)... but we really need
  all the help we can get, especially from people who have a really
  good understanding of the various models.
  
  In case you are interested, here are some references to
  discussions about adding more flexibility (with some prototypes
  etc):
  http://www.lucidimagination.com/search/document/72787e0e54f798e4/
  baby _st eps _towards_making_lucene_s_scoring_more_flexible
  

Re: GSoC

2011-03-10 Thread Simon Willnauer
awesome thanks!

simon

On Thu, Mar 10, 2011 at 11:54 AM, David Nemeskey
nemeskey.da...@sztaki.hu wrote:
 Ok, I have created a new issue, LUCENE-2959 for this project. I have uploaded
 the pdfs and added the gsoc2011 and lucene-gsoc-2011 labels as well.

 David

 On 2011 March 09, Wednesday 21:58:53 Simon Willnauer wrote:
 On Wed, Mar 9, 2011 at 5:48 PM, Grant Ingersoll gsing...@apache.org wrote:
  I think we, Lucene committers, need to identify who is willing to mentor.
     In my experience, it is less than 5 hours a week.  Most of the work
  is done as part of the community.  Sometimes you have to be tough and
  fail someone (I did last year) but most of the time, if you take the
  time to interview the candidates up front, it is a good experience for
  everyone.

 count me in

  I'd add it would be useful to have everyone put the lucene-gsoc-11 label
  on their issues too, that way we can quickly find the Lucene ones.

 done on at least one ;)

 simon

  Also, feel free to label existing bugs.
 
  On Mar 9, 2011, at 2:11 AM, Simon Willnauer wrote:
  Hey David and all others who want to contribute to GSoC,
 
  the ASF has applied for GSoC 2011 as a mentoring organization. As a
  ASF project we don't need to apply directly though but we need to
  register our ideas now. This works like almost anything in the ASF
  through JIRA. All ideas should be recorded as JIRA tickets  labeled
  with gsoc2011. Once this is done it will show up here:
  http://s.apache.org/gsoc2011tasks
 
  Everybody who is interested in GSoC as a mentor or student should now
  read this too http://community.apache.org/gsoc.html
 
 
  Thanks,
 
  Simon
 
 
 
 
  On Thu, Feb 24, 2011 at 12:14 PM, David Nemeskey
 
  nemeskey.da...@sztaki.hu wrote:
  Please find the implementation plan attached. The word soon gets a
  new meaning when power outages are taken into account. :)
 
  As before, comments are welcome.
 
  David
 
  On Tuesday, February 22, 2011 15:22:57 Simon Willnauer wrote:
  I think that is good for now. I should get started on codeawards and
  wrap up our proposals. I hope I can do that this week.
 
  simon
 
  On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey
 
  nemeskey.da...@sztaki.hu wrote:
  Hey,
 
  I have written the proposal. Please let me know if you want more /
  less of certain parts. Should I upload it somewhere?
 
  Implementation plan soon to follow.
 
  Sorry for the late reply; I have been rather busy these past few
  weeks.
 
  David
 
  On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote:
  Hey David,
 
  I saw that you added a tiny line to the GSoC Lucene wiki - thanks
  for that.
 
  On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey
 
  nemeskey.da...@sztaki.hu wrote:
  Hi guys,
 
  Mark, Robert, Simon: thanks for the support! I really hope we can
  work together this summer (and before that, obviously).
 
  Same here!
 
  According to http://www.google-
  melange.com/document/show/gsoc_program/google/gsoc2011/timeline ,
  there's still some time until the application period. So let me use
  this week to finish my PhD research plan, and get back to you next
  week.
 
  I am not really familiar with how the program works, i.e. how
  detailed the application description should be, when mentorship is
  decided, etc. so I guess we will have a lot to talk about. :)
 
  so from a 1ft view it work like this:
 
  1. Write up a short proposal what your idea is about
  2. make it public! and publish a implementation plan - how you would
  want to realize your proposal. If you don't follow that 100% in the
  actual impl. don't worry. Its just mean to give us an idea that you
  know what you are doing and where you want to go. something like a 1
  A4 rough design doc.
  3. give other people the change to apply for the same suggestion
  (this is how it works though)
  4 Let the ASF / us assign one or more possible mentors to it
  5. let us apply for a slot in GSoC (those are limited for
  organizations) 6. get accepted
  7. rock it!
 
  (Actually, should we move this discussion private?)
 
  no - we usually do everything in public except of discussion within
  the PMC that are meant to be private for legal reasons or similar
  things. Lets stick to the mailing list for all communication except
  you have something that should clearly not be public. This also give
  other contributors a chance to help and get interested in your
  work!!
 
  simon
 
  David
 
  Hi David, honestly this sounds fantastic.
 
  It would be great to have someone to work with us on this issue!
 
  To date, progress is pretty slow-going (minor improvements,
  cleanups, additional stats here and there)... but we really need
  all the help we can get, especially from people who have a really
  good understanding of the various models.
 
  In case you are interested, here are some references to
  discussions about adding more flexibility (with some prototypes
  etc):
  http://www.lucidimagination.com/search/document/72787e0e54f798e4/
 

Re: GSoC

2011-03-10 Thread Michael McCandless
On Wed, Mar 9, 2011 at 3:58 PM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 On Wed, Mar 9, 2011 at 5:48 PM, Grant Ingersoll gsing...@apache.org wrote:
 I think we, Lucene committers, need to identify who is willing to mentor.    
 In my experience, it is less than 5 hours a week.  Most of the work is done 
 as part of the community.  Sometimes you have to be tough and fail someone 
 (I did last year) but most of the time, if you take the time to interview 
 the candidates up front, it is a good experience for everyone.

 count me in

I'll also be a GSOC mentor!

-- 
Mike

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC

2011-03-09 Thread Grant Ingersoll
I think we, Lucene committers, need to identify who is willing to mentor.In 
my experience, it is less than 5 hours a week.  Most of the work is done as 
part of the community.  Sometimes you have to be tough and fail someone (I did 
last year) but most of the time, if you take the time to interview the 
candidates up front, it is a good experience for everyone.

I'd add it would be useful to have everyone put the lucene-gsoc-11 label on 
their issues too, that way we can quickly find the Lucene ones.

Also, feel free to label existing bugs.


On Mar 9, 2011, at 2:11 AM, Simon Willnauer wrote:

 Hey David and all others who want to contribute to GSoC,
 
 the ASF has applied for GSoC 2011 as a mentoring organization. As a
 ASF project we don't need to apply directly though but we need to
 register our ideas now. This works like almost anything in the ASF
 through JIRA. All ideas should be recorded as JIRA tickets  labeled
 with gsoc2011. Once this is done it will show up here:
 http://s.apache.org/gsoc2011tasks
 
 Everybody who is interested in GSoC as a mentor or student should now
 read this too http://community.apache.org/gsoc.html
 
 
 Thanks,
 
 Simon
 
 
 
 
 On Thu, Feb 24, 2011 at 12:14 PM, David Nemeskey
 nemeskey.da...@sztaki.hu wrote:
 Please find the implementation plan attached. The word soon gets a new
 meaning when power outages are taken into account. :)
 
 As before, comments are welcome.
 
 David
 
 On Tuesday, February 22, 2011 15:22:57 Simon Willnauer wrote:
 I think that is good for now. I should get started on codeawards and
 wrap up our proposals. I hope I can do that this week.
 
 simon
 
 On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey
 
 nemeskey.da...@sztaki.hu wrote:
 Hey,
 
 I have written the proposal. Please let me know if you want more / less
 of certain parts. Should I upload it somewhere?
 
 Implementation plan soon to follow.
 
 Sorry for the late reply; I have been rather busy these past few weeks.
 
 David
 
 On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote:
 Hey David,
 
 I saw that you added a tiny line to the GSoC Lucene wiki - thanks for
 that.
 
 On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey
 
 nemeskey.da...@sztaki.hu wrote:
 Hi guys,
 
 Mark, Robert, Simon: thanks for the support! I really hope we can work
 together this summer (and before that, obviously).
 
 Same here!
 
 According to http://www.google-
 melange.com/document/show/gsoc_program/google/gsoc2011/timeline ,
 there's still some time until the application period. So let me use
 this week to finish my PhD research plan, and get back to you next
 week.
 
 I am not really familiar with how the program works, i.e. how detailed
 the application description should be, when mentorship is decided,
 etc. so I guess we will have a lot to talk about. :)
 
 so from a 1ft view it work like this:
 
 1. Write up a short proposal what your idea is about
 2. make it public! and publish a implementation plan - how you would
 want to realize your proposal. If you don't follow that 100% in the
 actual impl. don't worry. Its just mean to give us an idea that you
 know what you are doing and where you want to go. something like a 1
 A4 rough design doc.
 3. give other people the change to apply for the same suggestion (this
 is how it works though)
 4 Let the ASF / us assign one or more possible mentors to it
 5. let us apply for a slot in GSoC (those are limited for organizations)
 6. get accepted
 7. rock it!
 
 (Actually, should we move this discussion private?)
 
 no - we usually do everything in public except of discussion within
 the PMC that are meant to be private for legal reasons or similar
 things. Lets stick to the mailing list for all communication except
 you have something that should clearly not be public. This also give
 other contributors a chance to help and get interested in your work!!
 
 simon
 
 David
 
 Hi David, honestly this sounds fantastic.
 
 It would be great to have someone to work with us on this issue!
 
 To date, progress is pretty slow-going (minor improvements, cleanups,
 additional stats here and there)... but we really need all the help
 we can get, especially from people who have a really good
 understanding of the various models.
 
 In case you are interested, here are some references to discussions
 about adding more flexibility (with some prototypes etc):
 http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby
 _st eps _towards_making_lucene_s_scoring_more_flexible
 https://issues.apache.org/jira/browse/LUCENE-2392
 
 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey
 
 nemeskey.da...@sztaki.hu wrote:
 Hi all,
 
 I have already sent this mail to Simon Willnauer, and he suggested
 me to post it here for discussion.
 
 I am David Nemeskey, a PhD student at the Eotvos Lorand University,
 Budapest, Hungary. I am doing an IR-related research, and we have
 considered using Lucene as our search engine. We were quite
 satisfied with the speed and ease 

Re: GSoC

2011-03-09 Thread Simon Willnauer
On Wed, Mar 9, 2011 at 5:48 PM, Grant Ingersoll gsing...@apache.org wrote:
 I think we, Lucene committers, need to identify who is willing to mentor.    
 In my experience, it is less than 5 hours a week.  Most of the work is done 
 as part of the community.  Sometimes you have to be tough and fail someone (I 
 did last year) but most of the time, if you take the time to interview the 
 candidates up front, it is a good experience for everyone.

count me in


 I'd add it would be useful to have everyone put the lucene-gsoc-11 label on 
 their issues too, that way we can quickly find the Lucene ones.

done on at least one ;)

simon

 Also, feel free to label existing bugs.


 On Mar 9, 2011, at 2:11 AM, Simon Willnauer wrote:

 Hey David and all others who want to contribute to GSoC,

 the ASF has applied for GSoC 2011 as a mentoring organization. As a
 ASF project we don't need to apply directly though but we need to
 register our ideas now. This works like almost anything in the ASF
 through JIRA. All ideas should be recorded as JIRA tickets  labeled
 with gsoc2011. Once this is done it will show up here:
 http://s.apache.org/gsoc2011tasks

 Everybody who is interested in GSoC as a mentor or student should now
 read this too http://community.apache.org/gsoc.html


 Thanks,

 Simon




 On Thu, Feb 24, 2011 at 12:14 PM, David Nemeskey
 nemeskey.da...@sztaki.hu wrote:
 Please find the implementation plan attached. The word soon gets a new
 meaning when power outages are taken into account. :)

 As before, comments are welcome.

 David

 On Tuesday, February 22, 2011 15:22:57 Simon Willnauer wrote:
 I think that is good for now. I should get started on codeawards and
 wrap up our proposals. I hope I can do that this week.

 simon

 On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey

 nemeskey.da...@sztaki.hu wrote:
 Hey,

 I have written the proposal. Please let me know if you want more / less
 of certain parts. Should I upload it somewhere?

 Implementation plan soon to follow.

 Sorry for the late reply; I have been rather busy these past few weeks.

 David

 On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote:
 Hey David,

 I saw that you added a tiny line to the GSoC Lucene wiki - thanks for
 that.

 On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey

 nemeskey.da...@sztaki.hu wrote:
 Hi guys,

 Mark, Robert, Simon: thanks for the support! I really hope we can work
 together this summer (and before that, obviously).

 Same here!

 According to http://www.google-
 melange.com/document/show/gsoc_program/google/gsoc2011/timeline ,
 there's still some time until the application period. So let me use
 this week to finish my PhD research plan, and get back to you next
 week.

 I am not really familiar with how the program works, i.e. how detailed
 the application description should be, when mentorship is decided,
 etc. so I guess we will have a lot to talk about. :)

 so from a 1ft view it work like this:

 1. Write up a short proposal what your idea is about
 2. make it public! and publish a implementation plan - how you would
 want to realize your proposal. If you don't follow that 100% in the
 actual impl. don't worry. Its just mean to give us an idea that you
 know what you are doing and where you want to go. something like a 1
 A4 rough design doc.
 3. give other people the change to apply for the same suggestion (this
 is how it works though)
 4 Let the ASF / us assign one or more possible mentors to it
 5. let us apply for a slot in GSoC (those are limited for organizations)
 6. get accepted
 7. rock it!

 (Actually, should we move this discussion private?)

 no - we usually do everything in public except of discussion within
 the PMC that are meant to be private for legal reasons or similar
 things. Lets stick to the mailing list for all communication except
 you have something that should clearly not be public. This also give
 other contributors a chance to help and get interested in your work!!

 simon

 David

 Hi David, honestly this sounds fantastic.

 It would be great to have someone to work with us on this issue!

 To date, progress is pretty slow-going (minor improvements, cleanups,
 additional stats here and there)... but we really need all the help
 we can get, especially from people who have a really good
 understanding of the various models.

 In case you are interested, here are some references to discussions
 about adding more flexibility (with some prototypes etc):
 http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby
 _st eps _towards_making_lucene_s_scoring_more_flexible
 https://issues.apache.org/jira/browse/LUCENE-2392

 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey

 nemeskey.da...@sztaki.hu wrote:
 Hi all,

 I have already sent this mail to Simon Willnauer, and he suggested
 me to post it here for discussion.

 I am David Nemeskey, a PhD student at the Eotvos Lorand University,
 Budapest, Hungary. I am doing an IR-related research, and we have
 

Re: GSoC

2011-03-08 Thread Simon Willnauer
Hey David and all others who want to contribute to GSoC,

the ASF has applied for GSoC 2011 as a mentoring organization. As a
ASF project we don't need to apply directly though but we need to
register our ideas now. This works like almost anything in the ASF
through JIRA. All ideas should be recorded as JIRA tickets  labeled
with gsoc2011. Once this is done it will show up here:
http://s.apache.org/gsoc2011tasks

Everybody who is interested in GSoC as a mentor or student should now
read this too http://community.apache.org/gsoc.html


Thanks,

Simon




On Thu, Feb 24, 2011 at 12:14 PM, David Nemeskey
nemeskey.da...@sztaki.hu wrote:
 Please find the implementation plan attached. The word soon gets a new
 meaning when power outages are taken into account. :)

 As before, comments are welcome.

 David

 On Tuesday, February 22, 2011 15:22:57 Simon Willnauer wrote:
 I think that is good for now. I should get started on codeawards and
 wrap up our proposals. I hope I can do that this week.

 simon

 On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey

 nemeskey.da...@sztaki.hu wrote:
  Hey,
 
  I have written the proposal. Please let me know if you want more / less
  of certain parts. Should I upload it somewhere?
 
  Implementation plan soon to follow.
 
  Sorry for the late reply; I have been rather busy these past few weeks.
 
  David
 
  On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote:
  Hey David,
 
  I saw that you added a tiny line to the GSoC Lucene wiki - thanks for
  that.
 
  On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey
 
  nemeskey.da...@sztaki.hu wrote:
   Hi guys,
  
   Mark, Robert, Simon: thanks for the support! I really hope we can work
   together this summer (and before that, obviously).
 
  Same here!
 
   According to http://www.google-
   melange.com/document/show/gsoc_program/google/gsoc2011/timeline ,
   there's still some time until the application period. So let me use
   this week to finish my PhD research plan, and get back to you next
   week.
  
   I am not really familiar with how the program works, i.e. how detailed
   the application description should be, when mentorship is decided,
   etc. so I guess we will have a lot to talk about. :)
 
  so from a 1ft view it work like this:
 
  1. Write up a short proposal what your idea is about
  2. make it public! and publish a implementation plan - how you would
  want to realize your proposal. If you don't follow that 100% in the
  actual impl. don't worry. Its just mean to give us an idea that you
  know what you are doing and where you want to go. something like a 1
  A4 rough design doc.
  3. give other people the change to apply for the same suggestion (this
  is how it works though)
  4 Let the ASF / us assign one or more possible mentors to it
  5. let us apply for a slot in GSoC (those are limited for organizations)
  6. get accepted
  7. rock it!
 
   (Actually, should we move this discussion private?)
 
  no - we usually do everything in public except of discussion within
  the PMC that are meant to be private for legal reasons or similar
  things. Lets stick to the mailing list for all communication except
  you have something that should clearly not be public. This also give
  other contributors a chance to help and get interested in your work!!
 
  simon
 
   David
  
   Hi David, honestly this sounds fantastic.
  
   It would be great to have someone to work with us on this issue!
  
   To date, progress is pretty slow-going (minor improvements, cleanups,
   additional stats here and there)... but we really need all the help
   we can get, especially from people who have a really good
   understanding of the various models.
  
   In case you are interested, here are some references to discussions
   about adding more flexibility (with some prototypes etc):
   http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby
   _st eps _towards_making_lucene_s_scoring_more_flexible
   https://issues.apache.org/jira/browse/LUCENE-2392
  
   On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey
  
   nemeskey.da...@sztaki.hu wrote:
Hi all,
   
I have already sent this mail to Simon Willnauer, and he suggested
me to post it here for discussion.
   
I am David Nemeskey, a PhD student at the Eotvos Lorand University,
Budapest, Hungary. I am doing an IR-related research, and we have
considered using Lucene as our search engine. We were quite
satisfied with the speed and ease of use. However, we would like
to experiment with different ranking algorithms, and this is where
problems arise. Lucene only supports the VSM, and unfortunately
the ranking architecture seems to be tailored specifically to its
needs.
   
I would be very much interested in revamping the ranking component
as a GSoC project. The following modifications should be doable in
the allocated time frame:
- a new ranking class hierarchy, which is generic enough to allow
easy 

Re: GSoC

2011-02-22 Thread Simon Willnauer
I think that is good for now. I should get started on codeawards and
wrap up our proposals. I hope I can do that this week.

simon

On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey
nemeskey.da...@sztaki.hu wrote:
 Hey,

 I have written the proposal. Please let me know if you want more / less of
 certain parts. Should I upload it somewhere?

 Implementation plan soon to follow.

 Sorry for the late reply; I have been rather busy these past few weeks.

 David

 On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote:
 Hey David,

 I saw that you added a tiny line to the GSoC Lucene wiki - thanks for that.

 On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey

 nemeskey.da...@sztaki.hu wrote:
  Hi guys,
 
  Mark, Robert, Simon: thanks for the support! I really hope we can work
  together this summer (and before that, obviously).

 Same here!

  According to http://www.google-
  melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's
  still some time until the application period. So let me use this week to
  finish my PhD research plan, and get back to you next week.
 
  I am not really familiar with how the program works, i.e. how detailed
  the application description should be, when mentorship is decided, etc.
  so I guess we will have a lot to talk about. :)

 so from a 1ft view it work like this:

 1. Write up a short proposal what your idea is about
 2. make it public! and publish a implementation plan - how you would
 want to realize your proposal. If you don't follow that 100% in the
 actual impl. don't worry. Its just mean to give us an idea that you
 know what you are doing and where you want to go. something like a 1
 A4 rough design doc.
 3. give other people the change to apply for the same suggestion (this
 is how it works though)
 4 Let the ASF / us assign one or more possible mentors to it
 5. let us apply for a slot in GSoC (those are limited for organizations)
 6. get accepted
 7. rock it!

  (Actually, should we move this discussion private?)

 no - we usually do everything in public except of discussion within
 the PMC that are meant to be private for legal reasons or similar
 things. Lets stick to the mailing list for all communication except
 you have something that should clearly not be public. This also give
 other contributors a chance to help and get interested in your work!!

 simon

  David
 
  Hi David, honestly this sounds fantastic.
 
  It would be great to have someone to work with us on this issue!
 
  To date, progress is pretty slow-going (minor improvements, cleanups,
  additional stats here and there)... but we really need all the help we
  can get, especially from people who have a really good understanding
  of the various models.
 
  In case you are interested, here are some references to discussions
  about adding more flexibility (with some prototypes etc):
  http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_st
  eps _towards_making_lucene_s_scoring_more_flexible
  https://issues.apache.org/jira/browse/LUCENE-2392
 
  On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey
 
  nemeskey.da...@sztaki.hu wrote:
   Hi all,
  
   I have already sent this mail to Simon Willnauer, and he suggested me
   to post it here for discussion.
  
   I am David Nemeskey, a PhD student at the Eotvos Lorand University,
   Budapest, Hungary. I am doing an IR-related research, and we have
   considered using Lucene as our search engine. We were quite satisfied
   with the speed and ease of use. However, we would like to experiment
   with different ranking algorithms, and this is where problems arise.
   Lucene only supports the VSM, and unfortunately the ranking
   architecture seems to be tailored specifically to its needs.
  
   I would be very much interested in revamping the ranking component as
   a GSoC project. The following modifications should be doable in the
   allocated time frame:
   - a new ranking class hierarchy, which is generic enough to allow easy
   implementation of new weighting schemes (at least bag-of-words ones),
   - addition of state-of-the-art ranking methods, such as Okapi BM25,
   proximity and DFR models,
   - configuration for ranking selection, with the old method as default.
  
   I believe all users of Lucene would profit from such a project. It
   would provide the scientific community with an even more useful
   research aid, while regular users could benefit from superior ranking
   results.
  
   Please let me know your opinion about this proposal.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


 -
 To 

Re: GSoC

2011-02-22 Thread Fernando Wasylyszyn
This also give other contributors a chance to help and get interested in your 
work!!
I really would love to contribute to this project!

Regards.
Fernando.





De: Simon Willnauer simon.willna...@googlemail.com
Para: dev@lucene.apache.org
CC: David Nemeskey nemeskey.da...@sztaki.hu
Enviado: martes, 22 de febrero, 2011 11:22:57
Asunto: Re: GSoC

I think that is good for now. I should get started on codeawards and
wrap up our proposals. I hope I can do that this week.

simon

On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey
nemeskey.da...@sztaki.hu wrote:
 Hey,

 I have written the proposal. Please let me know if you want more / less of
 certain parts. Should I upload it somewhere?

 Implementation plan soon to follow.

 Sorry for the late reply; I have been rather busy these past few weeks.

 David

 On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote:
 Hey David,

 I saw that you added a tiny line to the GSoC Lucene wiki - thanks for that.

 On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey

 nemeskey.da...@sztaki.hu wrote:
  Hi guys,
 
  Mark, Robert, Simon: thanks for the support! I really hope we can work
  together this summer (and before that, obviously).

 Same here!

  According to http://www.google-
  melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's
  still some time until the application period. So let me use this week to
  finish my PhD research plan, and get back to you next week.
 
  I am not really familiar with how the program works, i.e. how detailed
  the application description should be, when mentorship is decided, etc.
  so I guess we will have a lot to talk about. :)

 so from a 1ft view it work like this:

 1. Write up a short proposal what your idea is about
 2. make it public! and publish a implementation plan - how you would
 want to realize your proposal. If you don't follow that 100% in the
 actual impl. don't worry. Its just mean to give us an idea that you
 know what you are doing and where you want to go. something like a 1
 A4 rough design doc.
 3. give other people the change to apply for the same suggestion (this
 is how it works though)
 4 Let the ASF / us assign one or more possible mentors to it
 5. let us apply for a slot in GSoC (those are limited for organizations)
 6. get accepted
 7. rock it!

  (Actually, should we move this discussion private?)

 no - we usually do everything in public except of discussion within
 the PMC that are meant to be private for legal reasons or similar
 things. Lets stick to the mailing list for all communication except
 you have something that should clearly not be public. This also give
 other contributors a chance to help and get interested in your work!!

 simon

  David
 
  Hi David, honestly this sounds fantastic.
 
  It would be great to have someone to work with us on this issue!
 
  To date, progress is pretty slow-going (minor improvements, cleanups,
  additional stats here and there)... but we really need all the help we
  can get, especially from people who have a really good understanding
  of the various models.
 
  In case you are interested, here are some references to discussions
  about adding more flexibility (with some prototypes etc):
  http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_st
  eps _towards_making_lucene_s_scoring_more_flexible
  https://issues.apache.org/jira/browse/LUCENE-2392
 
  On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey
 
  nemeskey.da...@sztaki.hu wrote:
   Hi all,
  
   I have already sent this mail to Simon Willnauer, and he suggested me
   to post it here for discussion.
  
   I am David Nemeskey, a PhD student at the Eotvos Lorand University,
   Budapest, Hungary. I am doing an IR-related research, and we have
   considered using Lucene as our search engine. We were quite satisfied
   with the speed and ease of use. However, we would like to experiment
   with different ranking algorithms, and this is where problems arise.
   Lucene only supports the VSM, and unfortunately the ranking
   architecture seems to be tailored specifically to its needs.
  
   I would be very much interested in revamping the ranking component as
   a GSoC project. The following modifications should be doable in the
   allocated time frame:
   - a new ranking class hierarchy, which is generic enough to allow easy
   implementation of new weighting schemes (at least bag-of-words ones),
   - addition of state-of-the-art ranking methods, such as Okapi BM25,
   proximity and DFR models,
   - configuration for ranking selection, with the old method as default.
  
   I believe all users of Lucene would profit from such a project. It
   would provide the scientific community with an even more useful
   research aid, while regular users could benefit from superior ranking
   results.
  
   Please let me know your opinion about this proposal.
 
  -
  To unsubscribe

Re: GSoC

2011-02-02 Thread David Nemeskey
Hi guys,

Mark, Robert, Simon: thanks for the support! I really hope we can work 
together this summer (and before that, obviously).

According to http://www.google-
melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's 
still some time until the application period. So let me use this week to finish 
my PhD research plan, and get back to you next week.

I am not really familiar with how the program works, i.e. how detailed the 
application description should be, when mentorship is decided, etc. so I guess 
we will have a lot to talk about. :)

(Actually, should we move this discussion private?)

David

 Hi David, honestly this sounds fantastic.
 
 It would be great to have someone to work with us on this issue!
 
 To date, progress is pretty slow-going (minor improvements, cleanups,
 additional stats here and there)... but we really need all the help we
 can get, especially from people who have a really good understanding
 of the various models.
 
 In case you are interested, here are some references to discussions
 about adding more flexibility (with some prototypes etc):
 http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_steps
 _towards_making_lucene_s_scoring_more_flexible
 https://issues.apache.org/jira/browse/LUCENE-2392

 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey
 
 nemeskey.da...@sztaki.hu wrote:
  Hi all,
  
  I have already sent this mail to Simon Willnauer, and he suggested me to
  post it here for discussion.
  
  I am David Nemeskey, a PhD student at the Eotvos Lorand University,
  Budapest, Hungary. I am doing an IR-related research, and we have
  considered using Lucene as our search engine. We were quite satisfied
  with the speed and ease of use. However, we would like to experiment
  with different ranking algorithms, and this is where problems arise.
  Lucene only supports the VSM, and unfortunately the ranking architecture
  seems to be tailored specifically to its needs.
  
  I would be very much interested in revamping the ranking component as a
  GSoC project. The following modifications should be doable in the
  allocated time frame:
  - a new ranking class hierarchy, which is generic enough to allow easy
  implementation of new weighting schemes (at least bag-of-words ones),
  - addition of state-of-the-art ranking methods, such as Okapi BM25,
  proximity and DFR models,
  - configuration for ranking selection, with the old method as default.
  
  I believe all users of Lucene would profit from such a project. It would
  provide the scientific community with an even more useful research aid,
  while regular users could benefit from superior ranking results.
  
  Please let me know your opinion about this proposal.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC

2011-02-02 Thread Simon Willnauer
Hey David,

I saw that you added a tiny line to the GSoC Lucene wiki - thanks for that.

On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey
nemeskey.da...@sztaki.hu wrote:
 Hi guys,

 Mark, Robert, Simon: thanks for the support! I really hope we can work
 together this summer (and before that, obviously).
Same here!

 According to http://www.google-
 melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's
 still some time until the application period. So let me use this week to 
 finish
 my PhD research plan, and get back to you next week.

 I am not really familiar with how the program works, i.e. how detailed the
 application description should be, when mentorship is decided, etc. so I guess
 we will have a lot to talk about. :)

so from a 1ft view it work like this:

1. Write up a short proposal what your idea is about
2. make it public! and publish a implementation plan - how you would
want to realize your proposal. If you don't follow that 100% in the
actual impl. don't worry. Its just mean to give us an idea that you
know what you are doing and where you want to go. something like a 1
A4 rough design doc.
3. give other people the change to apply for the same suggestion (this
is how it works though)
4 Let the ASF / us assign one or more possible mentors to it
5. let us apply for a slot in GSoC (those are limited for organizations)
6. get accepted
7. rock it!


 (Actually, should we move this discussion private?)
no - we usually do everything in public except of discussion within
the PMC that are meant to be private for legal reasons or similar
things. Lets stick to the mailing list for all communication except
you have something that should clearly not be public. This also give
other contributors a chance to help and get interested in your work!!

simon

 David

 Hi David, honestly this sounds fantastic.

 It would be great to have someone to work with us on this issue!

 To date, progress is pretty slow-going (minor improvements, cleanups,
 additional stats here and there)... but we really need all the help we
 can get, especially from people who have a really good understanding
 of the various models.

 In case you are interested, here are some references to discussions
 about adding more flexibility (with some prototypes etc):
 http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_steps
 _towards_making_lucene_s_scoring_more_flexible
 https://issues.apache.org/jira/browse/LUCENE-2392

 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey

 nemeskey.da...@sztaki.hu wrote:
  Hi all,
 
  I have already sent this mail to Simon Willnauer, and he suggested me to
  post it here for discussion.
 
  I am David Nemeskey, a PhD student at the Eotvos Lorand University,
  Budapest, Hungary. I am doing an IR-related research, and we have
  considered using Lucene as our search engine. We were quite satisfied
  with the speed and ease of use. However, we would like to experiment
  with different ranking algorithms, and this is where problems arise.
  Lucene only supports the VSM, and unfortunately the ranking architecture
  seems to be tailored specifically to its needs.
 
  I would be very much interested in revamping the ranking component as a
  GSoC project. The following modifications should be doable in the
  allocated time frame:
  - a new ranking class hierarchy, which is generic enough to allow easy
  implementation of new weighting schemes (at least bag-of-words ones),
  - addition of state-of-the-art ranking methods, such as Okapi BM25,
  proximity and DFR models,
  - configuration for ranking selection, with the old method as default.
 
  I believe all users of Lucene would profit from such a project. It would
  provide the scientific community with an even more useful research aid,
  while regular users could benefit from superior ranking results.
 
  Please let me know your opinion about this proposal.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC

2011-02-02 Thread Grant Ingersoll

On Feb 2, 2011, at 4:10 AM, David Nemeskey wrote:

 Hi guys,
 
 Mark, Robert, Simon: thanks for the support! I really hope we can work 
 together this summer (and before that, obviously).

Sounds like a great idea.  Looking forward to the proposal.

 
 According to http://www.google-
 melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's 
 still some time until the application period. So let me use this week to 
 finish 
 my PhD research plan, and get back to you next week.
 
 I am not really familiar with how the program works, i.e. how detailed the 
 application description should be, when mentorship is decided, etc. so I 
 guess 
 we will have a lot to talk about. :)

It's pretty competitive, especially since you are not only competing against 
others for Lucene slots, but you are competing against other ASF projects.  I 
highly recommend you, as well as interested mentors, look through Mahout's past 
GSOC projects: http://www.lucidimagination.com/search/?q=GSOC#/p:mahout and 
http://www.lucidimagination.com/search/document/2acd6fd380feec3/thoughts_on_gsoc
 and https://cwiki.apache.org/confluence/display/MAHOUT/GSOC

 
 (Actually, should we move this discussion private?)

No, you shouldn't and it would be to your detriment come the ranking process 
since people won't have a track record of what you've done as it relates to 
your proposal.  The goal of GSOC is to learn how Open Source works.  Even 
though you have a mentor, that person is there to help you navigate the 
community, not to be a private tutor on technical details.   I routinely tell 
all my students that I will help them w/ personal issues (vacation, 
emergencies, etc.) but that all technical stuff must be done on list (JIRA, 
IRC, dev@, patches, etc.)

 
 David
 
 Hi David, honestly this sounds fantastic.
 
 It would be great to have someone to work with us on this issue!
 
 To date, progress is pretty slow-going (minor improvements, cleanups,
 additional stats here and there)... but we really need all the help we
 can get, especially from people who have a really good understanding
 of the various models.
 
 In case you are interested, here are some references to discussions
 about adding more flexibility (with some prototypes etc):
 http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_steps
 _towards_making_lucene_s_scoring_more_flexible
 https://issues.apache.org/jira/browse/LUCENE-2392
 
 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey
 
 nemeskey.da...@sztaki.hu wrote:
 Hi all,
 
 I have already sent this mail to Simon Willnauer, and he suggested me to
 post it here for discussion.
 
 I am David Nemeskey, a PhD student at the Eotvos Lorand University,
 Budapest, Hungary. I am doing an IR-related research, and we have
 considered using Lucene as our search engine. We were quite satisfied
 with the speed and ease of use. However, we would like to experiment
 with different ranking algorithms, and this is where problems arise.
 Lucene only supports the VSM, and unfortunately the ranking architecture
 seems to be tailored specifically to its needs.
 
 I would be very much interested in revamping the ranking component as a
 GSoC project. The following modifications should be doable in the
 allocated time frame:
 - a new ranking class hierarchy, which is generic enough to allow easy
 implementation of new weighting schemes (at least bag-of-words ones),
 - addition of state-of-the-art ranking methods, such as Okapi BM25,
 proximity and DFR models,
 - configuration for ranking selection, with the old method as default.
 
 I believe all users of Lucene would profit from such a project. It would
 provide the scientific community with an even more useful research aid,
 while regular users could benefit from superior ranking results.
 
 Please let me know your opinion about this proposal.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC

2011-01-28 Thread Mark Miller
+1 the proposal. We already have a committer digging into this area - he would 
make a perfect GSoC mentor! And would likely love the help.

His response likely to follow...

- Mark

On Jan 28, 2011, at 11:32 AM, David Nemeskey wrote:

 Hi all,
 
 I have already sent this mail to Simon Willnauer, and he suggested me to post 
 it here for discussion.
 
 I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest, 
 Hungary. I am doing an IR-related research, and we have considered using 
 Lucene as our search engine. We were quite satisfied with the speed and ease 
 of 
 use. However, we would like to experiment with different ranking algorithms, 
 and this is where problems arise. Lucene only supports the VSM, and 
 unfortunately the ranking architecture seems to be tailored specifically to 
 its 
 needs.
 
 I would be very much interested in revamping the ranking component as a GSoC 
 project. The following modifications should be doable in the allocated time 
 frame:
 - a new ranking class hierarchy, which is generic enough to allow easy 
 implementation of new weighting schemes (at least bag-of-words ones),
 - addition of state-of-the-art ranking methods, such as Okapi BM25, proximity 
 and DFR models,
 - configuration for ranking selection, with the old method as default.
 
 I believe all users of Lucene would profit from such a project. It would 
 provide the scientific community with an even more useful research aid, while 
 regular users could benefit from superior ranking results.
 
 Please let me know your opinion about this proposal.
 
 Thank you very much,
 David Nemeskey
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 

- Mark Miller
lucidimagination.com





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC

2011-01-28 Thread Simon Willnauer
On Fri, Jan 28, 2011 at 5:42 PM, Mark Miller markrmil...@gmail.com wrote:
 +1 the proposal. We already have a committer digging into this area - he 
 would make a perfect GSoC mentor! And would likely love the help.

same here +1 - if there is mentoring needed I will be there too.
Robert I recommend you already when David contacted me in the first
place :)

it's all yours :)

simon

 His response likely to follow...

 - Mark

 On Jan 28, 2011, at 11:32 AM, David Nemeskey wrote:

 Hi all,

 I have already sent this mail to Simon Willnauer, and he suggested me to post
 it here for discussion.

 I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest,
 Hungary. I am doing an IR-related research, and we have considered using
 Lucene as our search engine. We were quite satisfied with the speed and ease 
 of
 use. However, we would like to experiment with different ranking algorithms,
 and this is where problems arise. Lucene only supports the VSM, and
 unfortunately the ranking architecture seems to be tailored specifically to 
 its
 needs.

 I would be very much interested in revamping the ranking component as a GSoC
 project. The following modifications should be doable in the allocated time
 frame:
 - a new ranking class hierarchy, which is generic enough to allow easy
 implementation of new weighting schemes (at least bag-of-words ones),
 - addition of state-of-the-art ranking methods, such as Okapi BM25, proximity
 and DFR models,
 - configuration for ranking selection, with the old method as default.

 I believe all users of Lucene would profit from such a project. It would
 provide the scientific community with an even more useful research aid, while
 regular users could benefit from superior ranking results.

 Please let me know your opinion about this proposal.

 Thank you very much,
 David Nemeskey

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


 - Mark Miller
 lucidimagination.com





 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC

2011-01-28 Thread Robert Muir
On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey
nemeskey.da...@sztaki.hu wrote:
 Hi all,

 I have already sent this mail to Simon Willnauer, and he suggested me to post
 it here for discussion.

 I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest,
 Hungary. I am doing an IR-related research, and we have considered using
 Lucene as our search engine. We were quite satisfied with the speed and ease 
 of
 use. However, we would like to experiment with different ranking algorithms,
 and this is where problems arise. Lucene only supports the VSM, and
 unfortunately the ranking architecture seems to be tailored specifically to 
 its
 needs.

 I would be very much interested in revamping the ranking component as a GSoC
 project. The following modifications should be doable in the allocated time
 frame:
 - a new ranking class hierarchy, which is generic enough to allow easy
 implementation of new weighting schemes (at least bag-of-words ones),
 - addition of state-of-the-art ranking methods, such as Okapi BM25, proximity
 and DFR models,
 - configuration for ranking selection, with the old method as default.

 I believe all users of Lucene would profit from such a project. It would
 provide the scientific community with an even more useful research aid, while
 regular users could benefit from superior ranking results.

 Please let me know your opinion about this proposal.


Hi David, honestly this sounds fantastic.

It would be great to have someone to work with us on this issue!

To date, progress is pretty slow-going (minor improvements, cleanups,
additional stats here and there)... but we really need all the help we
can get, especially from people who have a really good understanding
of the various models.

In case you are interested, here are some references to discussions
about adding more flexibility (with some prototypes etc):
http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_steps_towards_making_lucene_s_scoring_more_flexible
https://issues.apache.org/jira/browse/LUCENE-2392

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [GSOC] Congrats to all students

2010-04-27 Thread Richard Simon Just
Thanks guys! So happy to get it, and really excited that Mahout got 5 slots.

@Robin: I'm totally up for a shared blog, was planning on blogging about
it anyway.


Robin Anil wrote:
 Congrats everyone.And a special thanks to Benson for helping us get the
 slots to 5 this year :)

 For students that do not get accepted into Google Summer of Code and still
 ready to work on your proposal. ASF has a formalized process by which you
 can work on it if you get a willing mentor from the community. It will be a
 great learning experience and you will get a certification on successful
 completion from Apache. Do take a look. Also its open for everyone not just
 for students.
 http://community.apache.org/mentoringprogramme.html

 @FamousFive(The selected students :P) Would you guys be interested in
 keeping track of your experiences via a shared blog. I am thinking of
 setting up one for Mahout along with the website change.

 Robin
 Congrats again.

   


Re: [GSOC] Congrats to all students

2010-04-27 Thread zhao zhendong
Thanks everyone! I am so exciting to be accepted and I will do my best to
finish my proposal in time.

A shared blog sounds great to me. The GSoC looks like a training, we suppose
to share the experience with all who interested in Mahout project.

Cheers,
Zhendong

On Tue, Apr 27, 2010 at 3:22 PM, Robin Anil robin.a...@gmail.com wrote:

 Congrats everyone.And a special thanks to Benson for helping us get the
 slots to 5 this year :)

 For students that do not get accepted into Google Summer of Code and still
 ready to work on your proposal. ASF has a formalized process by which you
 can work on it if you get a willing mentor from the community. It will be a
 great learning experience and you will get a certification on successful
 completion from Apache. Do take a look. Also its open for everyone not just
 for students.
 http://community.apache.org/mentoringprogramme.html

 @FamousFive(The selected students :P) Would you guys be interested in
 keeping track of your experiences via a shared blog. I am thinking of
 setting up one for Mahout along with the website change.

 Robin
 Congrats again.




-- 
-

Zhen-Dong Zhao (Maxim)



Department of Computer Science
School of Computing
National University of Singapore




Re: [GSOC] Congrats to all students

2010-04-27 Thread Sisir Koppaka
+1 for shared blog!


Re: [GSOC] Congrats to all students

2010-04-27 Thread Zaid Md Abdul Wahab Sheikh
Thanks. It's great to finally have the chance to be a part of Apache
Mahout. Congratulations to everyone who got selected!

+1 for the shared blog idea!





On Tue, Apr 27, 2010 at 12:52 PM, Robin Anil robin.a...@gmail.com wrote:
 Congrats everyone.And a special thanks to Benson for helping us get the
 slots to 5 this year :)

 For students that do not get accepted into Google Summer of Code and still
 ready to work on your proposal. ASF has a formalized process by which you
 can work on it if you get a willing mentor from the community. It will be a
 great learning experience and you will get a certification on successful
 completion from Apache. Do take a look. Also its open for everyone not just
 for students.
 http://community.apache.org/mentoringprogramme.html

 @FamousFive(The selected students :P) Would you guys be interested in
 keeping track of your experiences via a shared blog. I am thinking of
 setting up one for Mahout along with the website change.

 Robin
 Congrats again.




-- 
Zaid Md. Abdul Wahab Sheikh
Senior Undergraduate
B.Tech Computer Science and Engineering
NIT Allahabad (MNNIT)


Re: [GSOC] Congrats to all students

2010-04-26 Thread Sisir Koppaka
Thanks everyone!

This is a fantastic opportunity, and I'll try to make the best of this for
myself, as well as Mahout. Hopefully, we'll have a great compilation of deep
learning networks within the next few releases.

BTW, congrats to everyone on Mahout becoming a TLP!

On Tue, Apr 27, 2010 at 1:13 AM, Grant Ingersoll gsing...@apache.orgwrote:

 Looks like student GSOC announcements are up (
 http://socghop.appspot.com/gsoc/program/list_projects/google/gsoc2010).
  Mahout got quite a few projects (5) accepted this year, which is a true
 credit to the ASF, Mahout, the mentors, and most of all the students!  We
 had a good number of very high quality student proposals for Mahout this
 year and it was very difficult to choose.  Of the ones selected, I think
 they all bode well for the future of Mahout and the students.

 For those who didn't make the cut, I know it's small consolation, but I
 would encourage you all to stay involved in open source, if not Mahout
 specifically.  We'd certainly love to see you contributing here as many of
 you had very good ideas.

 At any rate, for everyone, keep an eye out on the Mahout project, as you
 should be seeing lots of exciting features coming to Mahout soon in the form
 of scalable Neural Networks, Restricted Boltzmann Machines (recommenders),
 SVD-based recommenders, EigenCuts Spectral Clustering and Support Vector
 Machines (SVM)!

 Should be an exciting summer!

 -Grant




-- 
SK


Re: [GSOC] 2010 Timelines

2010-04-09 Thread Isabel Drost

Timeline including Apache internal deadlines:

http://cwiki.apache.org/confluence/display/COMDEVxSITE/GSoC

Mentors, please also click on the ranking link to the ranking explanation [1] 
for more information on how to rank student proposals.

Isabel

[1] 
http://cwiki.apache.org/confluence/display/COMDEVxSITE/Mentee+Ranking+Process


signature.asc
Description: This is a digitally signed message part.


Re: [GSOC] Wiki Page Added

2010-03-31 Thread zhao zhendong
Hi Grant,

Could you please give us the link of this page?

Cheers,
Zhendong

On Wed, Mar 31, 2010 at 8:53 PM, Grant Ingersoll gsing...@apache.orgwrote:

 I created a Wiki page on GSOC.  I hope everyone considering GSOC reads it.
  Mentors, please add as you see fit.  Would be good to get a Mahout FAQ
 going to.  Perhaps, Robin, Deneche and David would consider adding their
 past year proposals up there as examples, too.

 Cheers,
 Grant




-- 
-

Zhen-Dong Zhao (Maxim)



Department of Computer Science
School of Computing
National University of Singapore




Re: [GSOC] Wiki Page Added

2010-03-31 Thread Grant Ingersoll
D'oh!  My bad: http://cwiki.apache.org/MAHOUT/gsoc.html.  It's linked from the 
front wiki page under community.

-Grant

On Mar 31, 2010, at 9:11 AM, zhao zhendong wrote:

 Hi Grant,
 
 Could you please give us the link of this page?
 
 Cheers,
 Zhendong
 
 On Wed, Mar 31, 2010 at 8:53 PM, Grant Ingersoll gsing...@apache.orgwrote:
 
 I created a Wiki page on GSOC.  I hope everyone considering GSOC reads it.
 Mentors, please add as you see fit.  Would be good to get a Mahout FAQ
 going to.  Perhaps, Robin, Deneche and David would consider adding their
 past year proposals up there as examples, too.
 
 Cheers,
 Grant
 
 
 
 
 -- 
 -
 
 Zhen-Dong Zhao (Maxim)
 
 
 
 Department of Computer Science
 School of Computing
 National University of Singapore
 
 




Re: [GSOC] Wiki Page Added

2010-03-31 Thread zhao zhendong
Ha, thanks.

On Wed, Mar 31, 2010 at 9:29 PM, Grant Ingersoll gsing...@apache.orgwrote:

 D'oh!  My bad: http://cwiki.apache.org/MAHOUT/gsoc.html.  It's linked from
 the front wiki page under community.

 -Grant

 On Mar 31, 2010, at 9:11 AM, zhao zhendong wrote:

  Hi Grant,
 
  Could you please give us the link of this page?
 
  Cheers,
  Zhendong
 
  On Wed, Mar 31, 2010 at 8:53 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
  I created a Wiki page on GSOC.  I hope everyone considering GSOC reads
 it.
  Mentors, please add as you see fit.  Would be good to get a Mahout FAQ
  going to.  Perhaps, Robin, Deneche and David would consider adding their
  past year proposals up there as examples, too.
 
  Cheers,
  Grant
 
 
 
 
  --
  -
 
  Zhen-Dong Zhao (Maxim)
 
  
 
  Department of Computer Science
  School of Computing
  National University of Singapore
 
  





-- 
-

Zhen-Dong Zhao (Maxim)



Department of Computer Science
School of Computing
National University of Singapore




Re: GSOC 2010

2010-03-31 Thread Robin Anil
Hi Tanya,
 MAHOUT-328 is just a general stub. There is no detailed project
description other than what is given there. The idea is we let you propose
to implement a clustering algorithm in Mahout. Start here
http://cwiki.apache.org/MAHOUT/gsoc.html. Browse through the Wiki. Look at
what mahout has at the moment http://cwiki.apache.org/MAHOUT/algorithms.html.
There are couple of algorithms missing  from mahout like min-hash or
hierarchical clustering or even a generic EM framework. I would suggest you
to read carefully through the discussions on the mailing list using the
archives and then zero in on the algorithm you would want to implement and
then propose to implement it.

Robin


On Wed, Mar 31, 2010 at 10:27 PM, Tanya Gupta gtany...@gmail.com wrote:

 Hi

 I would like a detailed project description for MAHOUT-328.

 Thanking You
 Tanya Gupta



Re: GSOC 2010 is here

2010-02-02 Thread Isabel Drost
On Mon Robin Anil robin.a...@gmail.com wrote:
 2. UIMA Integration with Mahout? (Maybe a good project if UIMA folks
 are taking in GSOC students)

I guess one could easily split this one in two:

a) Using UIMA (whole pipeline or just the analysers if that is possible)
for data pre-processing before Mahout algorithms are run.

b) Making it easy to integrate Mahout algorithms (classification models
etc.) as UIMA annotators.

Isabel


Re: GSOC 2010 is here

2010-02-01 Thread Isabel Drost
On Wed Robin Anil robin.a...@gmail.com wrote:
 Greetings! Fellow GSOC alums, administrators and dear mentors, the
 next edition is right here. Details are given in the link below.
 
 https://groups.google.com/group/google-summer-of-code-discuss/browse_thread/thread/d839c0b02ac15b3f

Some additional notes to committers: 

First of all mentoring a GSoC student is a great experience, so if
you do have some cycles left, I would highly recommend participating in
GSoC as a mentor (thanks Grant for convincing myself last year...).

We had several successful students here at Mahout in past GSoC years.
Each year there were strong proposals for projects within Mahout. As a
results projects usually turn out to be interesting for both, mentor
and student.

One final note: If there is anyone on this list who might be interested
in helping with general ASF GSoC logistics and administration tasks,
please have a look at the newly founded community development project
(d...@community.apache.org)

 
 Maybe we could identify key areas in Mahout which we need to develop
 apart from the ML implementations and list it down for students to
 see before they start trickling in.

And motivate students to come up with their own ideas and discuss them
on-list before submitting their submission.


 Some ideas:
 Benchmarking Framework with EC2 wrappers

+1 I would love to see that.


 Commandline Console+Launcher like Hbase and hadoop

+1


 Online Tool/Query UI for Algorithms in Mahout(like CF)
 
 
 Possible ideas(I have no idea what i am talking here but there are
 nice problems to solve)
 Improvements in Math?
 How to tackle management of datasets?
 Error Recovery if a job fails?

How to tackle managment of learned classification models?

Better tooling for Mahout integration? (Lucene for tokenization and
analysers?, data import and export?)



Isabel


Re: GSOC 2010 is here

2010-02-01 Thread Robin Anil
Some more Wild and Wacky Ideas. Might be out of scope for GSOC, but are nice
to have features for mahout. I would like to encourage all of you to put
down your ideas here.

1. Data Visualization tool backed with HDFS/Hbase for inspecting clusters,
Topic model etc etc
  - It could have many map/reduce jobs which transform the clustering
output, aggregates things and produce interesting stats or visualization of
data
2. UIMA Integration with Mahout? (Maybe a good project if UIMA folks are
taking in GSOC students)



Robin




On Mon, Feb 1, 2010 at 6:17 PM, Isabel Drost isa...@apache.org wrote:

 On Wed Robin Anil robin.a...@gmail.com wrote:
  Greetings! Fellow GSOC alums, administrators and dear mentors, the
  next edition is right here. Details are given in the link below.
 
 
 https://groups.google.com/group/google-summer-of-code-discuss/browse_thread/thread/d839c0b02ac15b3f

 Some additional notes to committers:

 First of all mentoring a GSoC student is a great experience, so if
 you do have some cycles left, I would highly recommend participating in
 GSoC as a mentor (thanks Grant for convincing myself last year...).

 We had several successful students here at Mahout in past GSoC years.
 Each year there were strong proposals for projects within Mahout. As a
 results projects usually turn out to be interesting for both, mentor
 and student.

 One final note: If there is anyone on this list who might be interested
 in helping with general ASF GSoC logistics and administration tasks,
 please have a look at the newly founded community development project
 (d...@community.apache.org)


  Maybe we could identify key areas in Mahout which we need to develop
  apart from the ML implementations and list it down for students to
  see before they start trickling in.

 And motivate students to come up with their own ideas and discuss them
 on-list before submitting their submission.


  Some ideas:
  Benchmarking Framework with EC2 wrappers

 +1 I would love to see that.


  Commandline Console+Launcher like Hbase and hadoop

 +1


  Online Tool/Query UI for Algorithms in Mahout(like CF)
 
 
  Possible ideas(I have no idea what i am talking here but there are
  nice problems to solve)
  Improvements in Math?
  How to tackle management of datasets?
  Error Recovery if a job fails?

 How to tackle managment of learned classification models?

 Better tooling for Mahout integration? (Lucene for tokenization and
 analysers?, data import and export?)



 Isabel



Re : [GSOC] Code Submissions

2009-09-08 Thread deneche abdelhakim
done.

--- En date de : Mar 8.9.09, Grant Ingersoll gsing...@apache.org a écrit :

 De: Grant Ingersoll gsing...@apache.org
 Objet: [GSOC] Code Submissions
 À: Mahout Dev List mahout-dev@lucene.apache.org
 Date: Mardi 8 Septembre 2009, 13h09
 Hi Robin, David and Deneche,
 
 You will need to submit code samples.  Please see 
 http://groups.google.com/group/google-summer-of-code-announce/web/how-to-provide-google-with-sample-code
 
 -Grant
 






Re: Re : [GSOC] July 6 is mid-term evaluations

2009-07-07 Thread Ted Dunning
I filled out one for Deneche.

On Tue, Jul 7, 2009 at 9:32 AM, deneche abdelhakim a_dene...@yahoo.frwrote:


 The students mid-term survey is available online. I'm posting this because
 I almost forgot it =P

 --- En date de : Mer 17.6.09, Grant Ingersoll gsing...@apache.org a
 écrit :

  De: Grant Ingersoll gsing...@apache.org
  Objet: [GSOC] July 6 is mid-term evaluations
  À: mahout-dev@lucene.apache.org
  Date: Mercredi 17 Juin 2009, 15h54
  Just a reminder to GSOC students that
  July 6 is mid-term evaluation.
 
 
 http://socghop.appspot.com/document/show/program/google/gsoc2009/timeline
 






-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)


Re: [GSOC] July 6 is mid-term evaluations

2009-07-07 Thread Isabel Drost
On Tuesday 07 July 2009 20:34:09 Ted Dunning wrote:
 I filled out one for Deneche.

I submitted the one for Robin yesterday evening.

Isabel


-- 
QOTD: Produtos desenvolvidos para todo tipo de idiota   * Impresso no fundo, 
embaixo, de uma sobremesa tiramisudo Tesco: ``N�o vire de ponta cabe�a.'' 
  |\  _,,,---,,_   Web:   http://www.isabel-drost.de
  /,`.-'`'-.  ;-;;,_  
 |,4-  ) )-,_..;\ (  `'-' 
'---''(_/--'  `-'\_) (fL)  IM:  xmpp://main...@spaceboyz.net



signature.asc
Description: This is a digitally signed message part.


Re: [GSOC] Thoughts about Random forests map-reduce implementation

2009-06-18 Thread Ted Dunning
Very similar, but I was talking about building trees on each split of the
data (a la map reduce split).

That would give many small splits and would thus give very different results
from bagging because the splits would be small and contigous rather than
large and random.


On Thu, Jun 18, 2009 at 1:37 AM, deneche abdelhakim a_dene...@yahoo.frwrote:

 build multiple trees for different portions of the data

 What's the difference with the basic bagging algorithm, which builds 'each
 tree' using a different portion (about 2/3) of the data ?


Re: [GSOC] GSOC Start time nearing

2009-05-14 Thread Isabel Drost
On Tuesday 12 May 2009 19:50:21 Grant Ingersoll wrote:
 http://socghop.appspot.com/document/show/program/google/gsoc2009/timeline

 May 23.  Hope all of our students and mentors are ready to go.

I certainly am*.

Isabel

* Might be a bit distracted on that exact day though: It's my birthday ;)





Re: [GSOC] Accepted Students

2009-04-23 Thread Grant Ingersoll
It's also helpful to get yourself a Wiki account and a JIRA account if  
you don't already have them.  Small patches to the existing docs/code  
can also help you figure out the process



On Apr 21, 2009, at 1:19 PM, Isabel Drost wrote:


On Tuesday 21 April 2009 08:30:34 David Hall wrote:

As for questions, what am I supposed to be reading during this
community building period? I see:

* http://cwiki.apache.org/MAHOUT/howtocontribute.html
* http://www.apache.org/foundation/how-it-works.html

plus skimming javadocs.


These are certainly of interest.

In addition you can checkout and have a look at the code. Try to get  
a rough
idea of where your contribution would fit best. Please share your  
ideas with

the community to get feedback early on.



Re: [GSOC] Accepted Students

2009-04-23 Thread David Hall
Thanks everyone!

-- David

On Thu, Apr 23, 2009 at 12:53 PM, Grant Ingersoll gsing...@apache.org wrote:
 It's also helpful to get yourself a Wiki account and a JIRA account if you
 don't already have them.  Small patches to the existing docs/code can also
 help you figure out the process


 On Apr 21, 2009, at 1:19 PM, Isabel Drost wrote:

 On Tuesday 21 April 2009 08:30:34 David Hall wrote:

 As for questions, what am I supposed to be reading during this
 community building period? I see:

 * http://cwiki.apache.org/MAHOUT/howtocontribute.html
 * http://www.apache.org/foundation/how-it-works.html

 plus skimming javadocs.

 These are certainly of interest.

 In addition you can checkout and have a look at the code. Try to get a
 rough
 idea of where your contribution would fit best. Please share your ideas
 with
 the community to get feedback early on.




Re: [GSOC] Accepted Students

2009-04-21 Thread deneche abdelhakim

Hi David, Welcome into Mahout =)

The How To Contribute Wiki page is a must read, it gives you a quick overview 
about all you'll need to when contributing to Mahout.

In my own experience you'll also need to:
* know how to build the latest version of Mahout:

http://cwiki.apache.org/MAHOUT/buildingmahout.html

although, depending on your project you may skip the Taste Web part if you're 
not working with Taste.

* know how to run an example in Hadoop, at least in pseudo-distributed:

http://hadoop.apache.org/core/docs/current/quickstart.html

--- En date de : Mar 21.4.09, David Hall d...@cs.stanford.edu a écrit :

 De: David Hall d...@cs.stanford.edu
 Objet: Re: [GSOC] Accepted Students
 À: mahout-dev@lucene.apache.org
 Date: Mardi 21 Avril 2009, 8h30
 On Mon, Apr 20, 2009 at 11:18 PM,
 deneche abdelhakim a_dene...@yahoo.fr
 wrote:
 
  Hi,
 
  =D
 
  I've been accepted. And I'll be working on Random
 Forests
 
  =P
 
  Given it's my second participation, I have one advise
 : don't be shy to ask about anything related to your project
 on this list (starting from now), its the fastest way to
 learn about Mahout.
 
  Who else has been accepted ?
 
 I'm here. I'll be working on Latent Dirichlet Allocation.
 
 As for questions, what am I supposed to be reading during
 this
 community building period? I see:
 
 * http://cwiki.apache.org/MAHOUT/howtocontribute.html
 * http://www.apache.org/foundation/how-it-works.html
 
 plus skimming javadocs.
 
 Other suggestions? Either general, or more specific to my
 project?
 
 -- David
 
 
  -
  abdelhakim
 
 
 
 
 





Re: [GSOC] Accepted Students

2009-04-21 Thread Joe Kumar
Deneche / David / Robin,

Congrats on getting selected for Mahout project.
Have fun coding...

Best regards,
Joe.

On Tue, Apr 21, 2009 at 7:53 AM, Robin Anil robin.a...@gmail.com wrote:

 Hi, Seems Like I am the last one to know :) Hoping for a great Summer
 of
 Code ahead.

 Robin

 PS: Trying hard to survive a heatwave of 45C
 http://www.iitkgp.ac.in/topfiles/wgraph.php

 On Tue, Apr 21, 2009 at 1:51 PM, deneche abdelhakim a_dene...@yahoo.fr
 wrote:

 
  Hi David, Welcome into Mahout =)
 
  The How To Contribute Wiki page is a must read, it gives you a quick
  overview about all you'll need to when contributing to Mahout.
 
  In my own experience you'll also need to:
  * know how to build the latest version of Mahout:
 
  http://cwiki.apache.org/MAHOUT/buildingmahout.html
 
  although, depending on your project you may skip the Taste Web part if
  you're not working with Taste.
 
  * know how to run an example in Hadoop, at least in pseudo-distributed:
 
  http://hadoop.apache.org/core/docs/current/quickstart.html
 
  --- En date de : Mar 21.4.09, David Hall d...@cs.stanford.edu a écrit
 :
 
   De: David Hall d...@cs.stanford.edu
   Objet: Re: [GSOC] Accepted Students
   À: mahout-dev@lucene.apache.org
   Date: Mardi 21 Avril 2009, 8h30
   On Mon, Apr 20, 2009 at 11:18 PM,
   deneche abdelhakim a_dene...@yahoo.fr
   wrote:
   
Hi,
   
=D
   
I've been accepted. And I'll be working on Random
   Forests
   
=P
   
Given it's my second participation, I have one advise
   : don't be shy to ask about anything related to your project
   on this list (starting from now), its the fastest way to
   learn about Mahout.
   
Who else has been accepted ?
  
   I'm here. I'll be working on Latent Dirichlet Allocation.
  
   As for questions, what am I supposed to be reading during
   this
   community building period? I see:
  
   * http://cwiki.apache.org/MAHOUT/howtocontribute.html
   * http://www.apache.org/foundation/how-it-works.html
  
   plus skimming javadocs.
  
   Other suggestions? Either general, or more specific to my
   project?
  
   -- David
  
   
-
abdelhakim
   
   
   
   
  
 
 
 
 



Re: [GSOC] Accepted Students

2009-04-21 Thread Robin Anil
Hi, Seems Like I am the last one to know :) Hoping for a great Summer of
Code ahead.

Robin

PS: Trying hard to survive a heatwave of 45C
http://www.iitkgp.ac.in/topfiles/wgraph.php

On Tue, Apr 21, 2009 at 1:51 PM, deneche abdelhakim a_dene...@yahoo.frwrote:


 Hi David, Welcome into Mahout =)

 The How To Contribute Wiki page is a must read, it gives you a quick
 overview about all you'll need to when contributing to Mahout.

 In my own experience you'll also need to:
 * know how to build the latest version of Mahout:

 http://cwiki.apache.org/MAHOUT/buildingmahout.html

 although, depending on your project you may skip the Taste Web part if
 you're not working with Taste.

 * know how to run an example in Hadoop, at least in pseudo-distributed:

 http://hadoop.apache.org/core/docs/current/quickstart.html

 --- En date de : Mar 21.4.09, David Hall d...@cs.stanford.edu a écrit :

  De: David Hall d...@cs.stanford.edu
  Objet: Re: [GSOC] Accepted Students
  À: mahout-dev@lucene.apache.org
  Date: Mardi 21 Avril 2009, 8h30
  On Mon, Apr 20, 2009 at 11:18 PM,
  deneche abdelhakim a_dene...@yahoo.fr
  wrote:
  
   Hi,
  
   =D
  
   I've been accepted. And I'll be working on Random
  Forests
  
   =P
  
   Given it's my second participation, I have one advise
  : don't be shy to ask about anything related to your project
  on this list (starting from now), its the fastest way to
  learn about Mahout.
  
   Who else has been accepted ?
 
  I'm here. I'll be working on Latent Dirichlet Allocation.
 
  As for questions, what am I supposed to be reading during
  this
  community building period? I see:
 
  * http://cwiki.apache.org/MAHOUT/howtocontribute.html
  * http://www.apache.org/foundation/how-it-works.html
 
  plus skimming javadocs.
 
  Other suggestions? Either general, or more specific to my
  project?
 
  -- David
 
  
   -
   abdelhakim
  
  
  
  
 






Re: [GSOC] Accepted Students

2009-04-21 Thread Isabel Drost
On Tuesday 21 April 2009 08:30:34 David Hall wrote:
 As for questions, what am I supposed to be reading during this
 community building period? I see:

 * http://cwiki.apache.org/MAHOUT/howtocontribute.html
 * http://www.apache.org/foundation/how-it-works.html

 plus skimming javadocs.

These are certainly of interest.

In addition you can checkout and have a look at the code. Try to get a rough 
idea of where your contribution would fit best. Please share your ideas with 
the community to get feedback early on.

Isabel


Re: gsoc , EM or SVM?

2009-04-02 Thread Yifan Wang
Hi

I decided to go with the mixture model for EM.
I have modified my proposal and submit it both on gsoc website and apache wiki.

Best Regards
Yifan

2009/4/1 Yifan Wang heavens...@gmail.com:
 I will choose Mixture Model for the EM implementation.

 Yifan

 2009/4/1 Ted Dunning ted.dunn...@gmail.com:
 Yifan,

 EM is a highly non-specific term and covers a huge range of very different
 algorithms.  For example, pLSI, HMM's, and mixture models can all be
 estimated using EM.

 What exactly did you mean to address with an EM implementation?

 On Wed, Apr 1, 2009 at 1:05 PM, Grant Ingersoll gsing...@apache.org wrote:

 Hi Yifan,

 I think both are good candidates, although AIUI, SVM is a bit harder to
 parallelize, so maybe it would make sense to focus on EM.  Of course, we
 don't have to be distributed, so you could propose a non-distributed SVM
 implementation as a first cut and then work on the distributed part as the
 project develops.

 ...


 For EM, it is a generalization of the k-means algorithm, and we already
 have
 k-means in the Mahout library.






Re: [gsoc] Collaborative filtering algorithms

2009-04-01 Thread Ted Dunning
I would hope that your SVD implementation would not be limited to NetFlix
like problems, but would be applicable to any reasonably sparse matrix-like
data.

Likewise, I would expect a good SVD implementation to be useful for nearest
neighbor methods or direct prediction by smoothing the history vector.

On Tue, Mar 31, 2009 at 11:09 PM, Atul Kulkarni atulskulka...@gmail.comwrote:

 I have worked with Netflix Prize problem and hence most of my suggested
 algorithms revolve around that problem. But I am open to other algorithms
 that might be out there. Is this a good thing to do?




-- 
Ted Dunning, CTO
DeepDyve


Re: [gsoc] Collaborative filtering algorithms

2009-04-01 Thread Atul Kulkarni
On Wed, Apr 1, 2009 at 1:30 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 I would hope that your SVD implementation would not be limited to NetFlix
 like problems, but would be applicable to any reasonably sparse matrix-like
 data.

Yes, ofcourse. it would apply to any large sparse matrix implementation.


 Likewise, I would expect a good SVD implementation to be useful for nearest
 neighbor methods or direct prediction by smoothing the history vector.

I do not have knowledge about this as of now, will read up and comment.


 On Tue, Mar 31, 2009 at 11:09 PM, Atul Kulkarni atulskulka...@gmail.com
 wrote:

  I have worked with Netflix Prize problem and hence most of my suggested
  algorithms revolve around that problem. But I am open to other algorithms
  that might be out there. Is this a good thing to do?
 



 --
 Ted Dunning, CTO
 DeepDyve




-- 
Regards,
Atul Kulkarni
Teaching Assistant,
Department of Computer Science,
University of Minnesota Duluth
Duluth. 55805.
www.d.umn.edu/~kulka053


Re: [gsoc] Collaborative filtering algorithms

2009-04-01 Thread Atul Kulkarni
Thanks David, that helped.


On Wed, Apr 1, 2009 at 1:47 AM, David Hall d...@cs.stanford.edu wrote:

 On Tue, Mar 31, 2009 at 11:43 PM, Atul Kulkarni atulskulka...@gmail.com
 wrote:
  questions in line.
 
  On Wed, Apr 1, 2009 at 1:27 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
  Nobody is working on SVD yet, but one GSOC applicant has said that they
  would like to work on LDA which is a probabilistic relative of SVD.
 
  I do not understand the relation in LDA and SVD. In my limited
 understanding
  I understand LDA transforms data points in to a coordinate system  where
  they can be easily discriminated/classified. SVD on the other hand is
 used
  for dimension reduction, can you help me bridge the gap by providing
  something to read on?

 LDA is an overloaded term. To the frequentist, it usually means Linear
 Discriminant Analysis, which is what you're talking about; to the
 bayesian machine learning people, it means Latent Dirichlet
 Allocation, which is a probabilistic dimensionality reduction
 technique for projecting documents in V-dimensional space to the
 K-simplex, with K \ll V.

 -- David

 
 
  The approach in your reference (3) is highly amenable to parallel
  implementation.
 
  Yes, I felt so too, but again did not want to comment on it untill I had
 the
  MapReduce basics related with it.
 
 
 
  Large-scale SVD would be a very interesting application for Mahout.
 
 
 
  On Tue, Mar 31, 2009 at 11:09 PM, Atul Kulkarni 
 atulskulka...@gmail.com
  wrote:
 
   Is there anyone doing the SVD part or are their any SVD algorithm
   implementation on Hadoop? If there are then I would like to implement
 the
   methods described in [1],[2],[3] for matrix factorization, in
 specific.
  
 
 
  --
  Ted Dunning, CTO
  DeepDyve
 
 
 
 
  --
  Regards,
  Atul Kulkarni
  Teaching Assistant,
  Department of Computer Science,
  University of Minnesota Duluth
  Duluth. 55805.
  www.d.umn.edu/~kulka053 http://www.d.umn.edu/%7Ekulka053
 




-- 
Regards,
Atul Kulkarni
Teaching Assistant,
Department of Computer Science,
University of Minnesota Duluth
Duluth. 55805.
www.d.umn.edu/~kulka053


Re: [GSOC] Ranking Process

2009-04-01 Thread Richard Tomsett
I'm preparing an application, but haven't submitted yet as I was
waiting on confirmation of my student status... as I now know that I'm
going to be eligible I'll get my application in soon :)

2009/4/1 Ted Dunning ted.dunn...@gmail.com:
 I only see two applications for Mahout, one reasonably strong, one much less
 so.

 Are there students out there who still need to prepare an application?

 The deadline is coming up fast.

 2009/3/31 Grant Ingersoll gsing...@apache.org

 FYI: http://wiki.apache.org/general/RankingProcess

 -Grant




 --
 Ted Dunning, CTO
 DeepDyve



Re: [GSOC] Ranking Process

2009-04-01 Thread Grant Ingersoll
Hmm, I see several in there, but they aren't all labeled w/ Mahout, so  
that may be why.  I also expanded to see 100 at a time.


-Grant

On Mar 31, 2009, at 8:43 PM, Ted Dunning wrote:

I only see two applications for Mahout, one reasonably strong, one  
much less

so.

Are there students out there who still need to prepare an application?

The deadline is coming up fast.

2009/3/31 Grant Ingersoll gsing...@apache.org


FYI: http://wiki.apache.org/general/RankingProcess

-Grant





--
Ted Dunning, CTO
DeepDyve




Re: [GSOC] Ranking Process

2009-04-01 Thread Grant Ingersoll
The other thing to note, here, is that people should be aware that the  
ASF is only going to get a certain number of slots from Google (last  
year, it was somewhere in the 30-40 range, I think), which are  
distributed across all projects that have expressed an interest in  
mentoring.  While Mahout has 4 interested mentors, that does not mean  
Mahout will get 4 projects.


At any rate, best of luck to everyone.  If you don't get picked, we  
still welcome your contributions!  Remember, open source is an  
excellent resume builder.


Cheers,
Grant

On Mar 31, 2009, at 4:43 PM, Grant Ingersoll wrote:


FYI: http://wiki.apache.org/general/RankingProcess

-Grant





Re: [gsoc] Collaborative filtering algorithms

2009-04-01 Thread Ted Dunning
The machinery of SVD is almost always described in terms of least squares
matrix approximation without mentioning the probabilistic underpinnings of
why least-squares is a good idea.  The connection, however, goes all the way
back to Gauss' reduction of planetary position observations (this is *why*
the normal distribution is often called a Gaussian).  Gauss provided such a
compelling rationale for both the normal distribution (what I called a
Gaussian below) and the resulting least squared error formulation of the
estimation problem that everybody has just assumed that least-squared-error
estimation is the way to go.  Generally this is a pretty good
approximation.  Occasionally it is not at all good.  One place where it is a
really bad approximation is with very sparse count data.  Netflix data is a
great example, text represented as word counts per document is another.

To fill in more detail, here is a relatively jargon-filled explanation of
the connection.  I apologize for not being able to express this more
lucidly.

A more general view of both SVD and LDA are that they find probabilistic
mixture models to describe data.   SVD finds a single mixture of Gaussian
distributions that all have the same variance and uses maximum likelihood to
find this mixture.  LDA finds a multi-level mixture of multinomial models
and gives you a distribution of models that represents the distribution of
possible models given your data and explicit assumptions.

Gaussian distributions and multinomials look quite different, but for
relatively large observed counts their log-likelihood functions become very
similar.  For Gaussians, the log-likelihood is just the sum of squared
deviations from the mean.  For large counts, the log-likelihood for
multinomials approximates squared deviations from the mean.


On Tue, Mar 31, 2009 at 11:43 PM, Atul Kulkarni atulskulka...@gmail.comwrote:

 I do not understand the relation in LDA and SVD. In my limited
 understanding
 I understand LDA transforms data points in to a coordinate system  where
 they can be easily discriminated/classified. SVD on the other hand is used
 for dimension reduction, can you help me bridge the gap by providing
 something to read on?




-- 
Ted Dunning, CTO
DeepDyve


Re: [GSOC] Ranking Process

2009-04-01 Thread Ted Dunning
Let me second that.  When I am hiring a student without professional
experience, it is almost a perfect predictor that if they have done
significant work on a significant outside project they will get an interview
with me and if not, they won't.

Moreover, if I have a candidate at any level who has made significant
contributions to a major open source project, I generally don't even drill
much more on code hygiene issues.  The standards in most open source
projects regarding testing and continuous integration are high enough that I
don't have to worry about whether the applicant understands how to code and
how to code with others.

On the other hand, the only use I make of the list of buzzwords generally
found under skills on a resume is that I start at the end of the list and
ask a question about that area's fundamentals to see if the student is
padding their list.  When interviewing with me don't ever put anything on
your resume that you don't really know.

I don't know how widespread my attitude is, but I can't believe I am alone
in this.

On Wed, Apr 1, 2009 at 3:42 AM, Grant Ingersoll gsing...@apache.org wrote:

 Remember, open source is an excellent resume builder.




-- 
Ted Dunning, CTO
DeepDyve


Re: gsoc , EM or SVM?

2009-04-01 Thread Grant Ingersoll

Hi Yifan,

I think both are good candidates, although AIUI, SVM is a bit harder  
to parallelize, so maybe it would make sense to focus on EM.  Of  
course, we don't have to be distributed, so you could propose a non- 
distributed SVM implementation as a first cut and then work on the  
distributed part as the project develops.



-Grant

On Mar 31, 2009, at 2:48 AM, Yifan Wang wrote:

Hi, My Name is Yifan. I submitted a proposal for the gsoc this year.  
I am

interested in the classification and clustering algorithms.

Because I need one such algorithm for the experimental project that I
started myself for text classification and clustering.

In my proposal, I planned to implement two of the machine learning
algorithms: EM and SVM.

But it seems a bit much to implement two algorithms in gsoc, so now  
I need

to choose one between the two algorithms.

For EM, it is a generalization of the k-means algorithm, and we  
already have

k-means in the Mahout library.

For SVM, It is a quite important algorithm for classification while
implementation of it can be hard.

So any suggestions of which one has the most benefit to the Mahout  
library

and may be a good candidate for the gsoc?



Best Regards

Yifan





--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: gsoc , EM or SVM?

2009-04-01 Thread Ted Dunning
Yifan,

EM is a highly non-specific term and covers a huge range of very different
algorithms.  For example, pLSI, HMM's, and mixture models can all be
estimated using EM.

What exactly did you mean to address with an EM implementation?

On Wed, Apr 1, 2009 at 1:05 PM, Grant Ingersoll gsing...@apache.org wrote:

 Hi Yifan,

 I think both are good candidates, although AIUI, SVM is a bit harder to
 parallelize, so maybe it would make sense to focus on EM.  Of course, we
 don't have to be distributed, so you could propose a non-distributed SVM
 implementation as a first cut and then work on the distributed part as the
 project develops.

 ...


 For EM, it is a generalization of the k-means algorithm, and we already
 have
 k-means in the Mahout library.




Re: gsoc , EM or SVM?

2009-04-01 Thread Yifan Wang
I will choose Mixture Model for the EM implementation.

Yifan

2009/4/1 Ted Dunning ted.dunn...@gmail.com:
 Yifan,

 EM is a highly non-specific term and covers a huge range of very different
 algorithms.  For example, pLSI, HMM's, and mixture models can all be
 estimated using EM.

 What exactly did you mean to address with an EM implementation?

 On Wed, Apr 1, 2009 at 1:05 PM, Grant Ingersoll gsing...@apache.org wrote:

 Hi Yifan,

 I think both are good candidates, although AIUI, SVM is a bit harder to
 parallelize, so maybe it would make sense to focus on EM.  Of course, we
 don't have to be distributed, so you could propose a non-distributed SVM
 implementation as a first cut and then work on the distributed part as the
 project develops.

 ...


 For EM, it is a generalization of the k-means algorithm, and we already
 have
 k-means in the Mahout library.





Re: [GSoC] SimRank Algorithms on Mahout Proposal draft from Xuan Yang

2009-04-01 Thread Robert Burrell Donkin
On Wed, Apr 1, 2009 at 7:12 PM, Xuan Yang sailingw...@gmail.com wrote:
 Hello everyone,

    This is my proposal draft.

BTW remember http://markmail.org/message/rbwp2hf6iipc2ut3

- robert


Re: [GSoC] SimRank Algorithms on Mahout Proposal draft from Xuan Yang

2009-04-01 Thread Xuan Yang
Thanks, I have submited it there. :)

2009/4/2 Robert Burrell Donkin robertburrelldon...@gmail.com:
 On Wed, Apr 1, 2009 at 7:12 PM, Xuan Yang sailingw...@gmail.com wrote:
 Hello everyone,

    This is my proposal draft.

 BTW remember http://markmail.org/message/rbwp2hf6iipc2ut3

 - robert




-- 
Xuan Yang


Re: [gsoc] random forests

2009-03-31 Thread deneche abdelhakim

Here is a draft of my proposal

**
Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests

Student: AbdelHakim Deneche
Student e-mail: ...

Student Major: Phd in Computer Science
Student Degree: Master in Computer Science
Student Graduation: Spring 2011

Organization: The Apache Software Foundation
Assigned Mentor:


Abstract:

My goal is to add the power of random/regression forests to Mahout. At the end 
of this summer one should be able to build random/regression forests for large, 
possibly, distributed datasets, store the forest and reuse it to classify new 
data. In addition, a demo on EC2 is planned.


Detailed Description:

This project is all about random/regression forests. The core component is the 
tree building algorithm from a random bootstrap from the whole dataset. I 
already wrote a detailed description on Mahout Wiki [RandomForests]. Given the 
size of the dataset, two distributed implementation are possible:

1. The most straightforward one deals with relatively small datasets. By small, 
I mean a dataset that can be replicated on every node of the cluster. 
Basically, each mapper has access to the whole dataset, so if the forest 
contains N trees and we have M mappers, each mapper runs the core building 
algorithm N/M times. This implementation is, relatively, easy because each 
mapper runs the basic building algorithm as it is. It is also of great 
interest if the user wants to try different parameters when building the 
forest. An out-of-core implementation is also possible to deal with datasets 
that cannot fit into the node memory.

2. The second implementation, which is the most difficult, is concerned with 
very large datasets that cannot fit in every machine of the cluster. In this 
case the mappers work differently, each mapper has access to a subset from the 
dataset, thus all the mappers collaborate to build each tree of the forest. The 
core building algorithm must thus be rewritten in a map-reduce form. This 
implementation can deal with datasets of any size, as long as they are on the 
cluster.

Although the first implementation is easier to implement, the CPU and IO 
overhead of the out-of-core implementation are still unknown. A reference, 
non-parallel, implementation should thus be built to better understand the 
effects of the out-of-core implementation, especially for large datasets. This 
reference implementation is also usefull to asses the correctness of the 
distributed implementation.


Working Plan and list of deliverables

Must-Have:
1. reference implementation of Random/Regression Forests Building Algorithm:
 . Build a forest of trees, the basic algorithm (described in the wiki) takes a 
subset from the dataset as a training set and builds a decision tree. This 
algorithm is repeated for each tree of the forest.
 . The forest is stored in a file, this way it can be re-used, at any time, to 
classify new cases
 . At this step, the necessary changes to Mahout's Classifier interface are 
made to extend its use to more than Text datasets.

2. Study the effects of large datasets on the reference implementation
 . This step should guide our choice of the proper parallel implementation

3. Parallel implementation, choose one of the following:
 3a. Parallel implementation A
  . When the dataset can be replicated to all computing nodes.
  . Each mapper has access to the whole dataset, if the forest contains N trees 
and we have M mappers, each mapper runs the basic building algorithm N/M times. 
The mapper if also responsible of computing the out-of-bag error estimation.
  . The reducer store the trees in the RF file, and merges the oob error 
estimations.
 3b. Parallel implementation B:
 . When the dataset is so big that it can no longer fit on every computing 
node, it must be distributed over the cluster.
 . Each mapper has access to a subset from the dataset, thus all the mappers 
collaborate to build each tree of the forest.
 . In this case, the basic algorithm must be rewritten to fit in the map-reduce 
paradigm.

Should-Have:
4. Run the Random Forest with a real dataset on EC2:
 . This step is important, because running the RF on a local dual core machine 
is different from running it on a real cluster with a real dataset.
 . This can make a good demo for Mahout
 . Amazon has put some interesting datasets to play with [PublicDatasets].
   The US Census dataset comes in various sizes ranging from 2Go to 200Go, and 
should make a very good example.
 . At this stage it may be useful to implement [MAHOUT-71] (Dataset to Matrix 
Reader).

Wanna-Have:
5. If there is still time, implement one or two other important features of RFs 
such as Variable importance and Proximity estimation


Additional Information:
I am a PhD student at the University Mentouri of Constantine. My primary 
research goal is a framework to help build Intelligent Adaptive Systems. For 
the purpose of my Master, I worked on 

Re: [GSOC] Ranking Process

2009-03-31 Thread Ted Dunning
I only see two applications for Mahout, one reasonably strong, one much less
so.

Are there students out there who still need to prepare an application?

The deadline is coming up fast.

2009/3/31 Grant Ingersoll gsing...@apache.org

 FYI: http://wiki.apache.org/general/RankingProcess

 -Grant




-- 
Ted Dunning, CTO
DeepDyve


Re: [gsoc] random forests

2009-03-31 Thread Ted Dunning
Deneche,

I don't see your application on the GSOC web site.  Nor on the apache wiki.

Time is running out and I would hate to not see you in the program.  Is it
just that I can't see the application yet?

On Tue, Mar 31, 2009 at 1:05 PM, deneche abdelhakim a_dene...@yahoo.frwrote:


 Here is a draft of my proposal

 **
 Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests

 Student: AbdelHakim Deneche
 Student e-mail: ...

 Student Major: Phd in Computer Science
 Student Degree: Master in Computer Science
 Student Graduation: Spring 2011

 Organization: The Apache Software Foundation
 Assigned Mentor:


 Abstract:

 My goal is to add the power of random/regression forests to Mahout. At the
 end of this summer one should be able to build random/regression forests for
 large, possibly, distributed datasets, store the forest and reuse it to
 classify new data. In addition, a demo on EC2 is planned.


 Detailed Description:

 This project is all about random/regression forests. The core component is
 the tree building algorithm from a random bootstrap from the whole dataset.
 I already wrote a detailed description on Mahout Wiki [RandomForests]. Given
 the size of the dataset, two distributed implementation are possible:

 1. The most straightforward one deals with relatively small datasets. By
 small, I mean a dataset that can be replicated on every node of the cluster.
 Basically, each mapper has access to the whole dataset, so if the forest
 contains N trees and we have M mappers, each mapper runs the core building
 algorithm N/M times. This implementation is, relatively, easy because each
 mapper runs the basic building algorithm as it is. It is also of great
 interest if the user wants to try different parameters when building the
 forest. An out-of-core implementation is also possible to deal with datasets
 that cannot fit into the node memory.

 2. The second implementation, which is the most difficult, is concerned
 with very large datasets that cannot fit in every machine of the cluster. In
 this case the mappers work differently, each mapper has access to a subset
 from the dataset, thus all the mappers collaborate to build each tree of the
 forest. The core building algorithm must thus be rewritten in a map-reduce
 form. This implementation can deal with datasets of any size, as long as
 they are on the cluster.

 Although the first implementation is easier to implement, the CPU and IO
 overhead of the out-of-core implementation are still unknown. A reference,
 non-parallel, implementation should thus be built to better understand the
 effects of the out-of-core implementation, especially for large datasets.
 This reference implementation is also usefull to asses the correctness of
 the distributed implementation.


 Working Plan and list of deliverables

 Must-Have:
 1. reference implementation of Random/Regression Forests Building
 Algorithm:
  . Build a forest of trees, the basic algorithm (described in the wiki)
 takes a subset from the dataset as a training set and builds a decision
 tree. This algorithm is repeated for each tree of the forest.
  . The forest is stored in a file, this way it can be re-used, at any time,
 to classify new cases
  . At this step, the necessary changes to Mahout's Classifier interface are
 made to extend its use to more than Text datasets.

 2. Study the effects of large datasets on the reference implementation
  . This step should guide our choice of the proper parallel implementation

 3. Parallel implementation, choose one of the following:
  3a. Parallel implementation A
  . When the dataset can be replicated to all computing nodes.
  . Each mapper has access to the whole dataset, if the forest contains N
 trees and we have M mappers, each mapper runs the basic building algorithm
 N/M times. The mapper if also responsible of computing the out-of-bag error
 estimation.
  . The reducer store the trees in the RF file, and merges the oob error
 estimations.
  3b. Parallel implementation B:
  . When the dataset is so big that it can no longer fit on every computing
 node, it must be distributed over the cluster.
  . Each mapper has access to a subset from the dataset, thus all the
 mappers collaborate to build each tree of the forest.
  . In this case, the basic algorithm must be rewritten to fit in the
 map-reduce paradigm.

 Should-Have:
 4. Run the Random Forest with a real dataset on EC2:
  . This step is important, because running the RF on a local dual core
 machine is different from running it on a real cluster with a real dataset.
  . This can make a good demo for Mahout
  . Amazon has put some interesting datasets to play with [PublicDatasets].
   The US Census dataset comes in various sizes ranging from 2Go to 200Go,
 and should make a very good example.
  . At this stage it may be useful to implement [MAHOUT-71] (Dataset to
 Matrix Reader).

 Wanna-Have:
 5. If there is still time, 

Re: [gsoc] random forests

2009-03-30 Thread deneche abdelhakim

Thank you for your answer, it just made me aware of many hidden-possible-future 
problems with my implementation.

 The first is that for any given application, the odds that
 the data will not fit in a single machine are small, especially if you 
 have an out-of-core tree builder.  Really, really big datasets are
 increasingly common, but are still a small minority of all datasets.

by out-of-core you mean the builder can fetch the data directly from a file 
instead of working from in-memory only (?)

 One question I have about your plan is whether your step (1) involves
 building trees or forests only from data held in memory or whether it 
 can be adapted to stream through the data (possibly several
 times).  If a streaming implementation is viable, then it may well be 
 that performance is still quite good for small datasets due to buffering.

I was planning to distribute the dataset files to all workers using Hadoop's 
DistributedCache. I think that a streaming implementation is feasible, the 
basic tree building algorithm (described here 
http://cwiki.apache.org/MAHOUT/random-forests.html) would have to stream 
through the data (either in-memory or from a file) for each node of the tree. 
During this pass, it computes the information gain (IG) for the selected 
variables. 
This algorithm could be improved to compute the IG's for a list of nodes, thus 
reducing the total number of passes through the data. When building the forest, 
the list of nodes comes from all the trees built by the mapper.

 Another way to put this is that the key question is how single node
 computation scales with input size.  If the scaling is relatively linear
 with data size, then your approach (3) will work no matter the data size.
 If scaling shows an evil memory size effect, then your approach (2) 
 would be required for large data sets.

I'll have to run some tests before answering this question, but I think that 
the memory usage of the improved algorithm (described above) will mainly be 
needed to store the IG's computations (variable probabilities...). One way to 
limit the memory usage is to limit the number of tree-nodes computed at each 
data pass. Increasing this limit should reduce the data passes but increase the 
memory usage, and vice versa.

There is still one case that this approach, even out-of-core, cannot handle: 
very large datasets that cannot fit in the node hard-drive, and thus must be 
distributed across the cluster.

abdelHakim
--- En date de : Lun 30.3.09, Ted Dunning ted.dunn...@gmail.com a écrit :

 De: Ted Dunning ted.dunn...@gmail.com
 Objet: Re: [gsoc] random forests
 À: mahout-dev@lucene.apache.org
 Date: Lundi 30 Mars 2009, 0h59
 I have two answers for you.
 
 The first is that for any given application, the odds that
 the data will not
 fit in a single machine are small, especially if you have
 an out-of-core
 tree builder.  Really, really big datasets are
 increasingly common, but are
 still a small minority of all datasets.
 
 The second answer is that the odds that SOME mahout
 application will be too
 large for a single node are quite high.
 
 These aren't contradictory.  They just describe the
 long-tail nature of
 problem sizes.
 
 One question I have about your plan is whether your step
 (1) involves
 building trees or forests only from data held in memory or
 whether it can be
 adapted to stream through the data (possibly several
 times).  If a streaming
 implementation is viable, then it may well be that
 performance is still
 quite good for small datasets due to buffering.
 
 If streaming works, then a single node will be able to
 handle very large
 datasets but will just be kind of slow.  As you point
 out, that can be
 remedied trivially.
 
 Another way to put this is that the key question is how
 single node
 computation scales with input size.  If the scaling is
 relatively linear
 with data size, then your approach (3) will work no matter
 the data size.
 If scaling shows an evil memory size effect, then your
 approach (2) would be
 required for large data sets.
 
 On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim a_dene...@yahoo.frwrote:
 
  My question is : when Mahout.RF will be used in a real
 application, what
  are the odds that the dataset will be so large that it
 can't fit on every
  machine of the cluster ?
 
  the answer to this question should help me decide
 which implementation I'll
  choose.
 
 
 
 
 -- 
 Ted Dunning, CTO
 DeepDyve
 
 111 West Evelyn Ave. Ste. 202
 Sunnyvale, CA 94086
 www.deepdyve.com
 408-773-0110 ext. 738
 858-414-0013 (m)
 408-773-0220 (fax)
 





Re: [gsoc] random forests

2009-03-30 Thread Ted Dunning
Indeed.  And those datasets exist.

It is also plausible that this full data scan approach will fail when you
want the forest building to take less time.

It is also plausible that a full data scan approach fails to improve enough
on a non-parallel implementation.  This would happen if a significantly
large fraction of the entire forest could be built on a single node.  That
would happen if the CPU requirements for forest building are overshadowed by
the I/O cost of scanning the data set.  This would imply that there is a
small limit to the amount of parallelism that would help.

You will know much more about this after you finish the non-parallel
implementation than either of us knows now.

On Mon, Mar 30, 2009 at 7:24 AM, deneche abdelhakim a_dene...@yahoo.frwrote:

 There is still one case that this approach, even out-of-core, cannot
 handle: very large datasets that cannot fit in the node hard-drive, and thus
 must be distributed across the cluster.




-- 
Ted Dunning, CTO
DeepDyve


Re: [gsoc] random forests

2009-03-30 Thread Ted Dunning
I suggest that we all learn from the experience you are about to have on the
reference implementation.

And, yes, I did mean the reference implementation when I said
non-parallel.  Thanks for clarifying.

On Mon, Mar 30, 2009 at 10:45 AM, deneche abdelhakim a_dene...@yahoo.frwrote:

 What do you suggest ?

 And just to make sure, by 'non-paralel implementation' you mean the
 reference implementation, right ?




-- 
Ted Dunning, CTO
DeepDyve


Re: [gsoc] random forests

2009-03-29 Thread Ted Dunning
I have two answers for you.

The first is that for any given application, the odds that the data will not
fit in a single machine are small, especially if you have an out-of-core
tree builder.  Really, really big datasets are increasingly common, but are
still a small minority of all datasets.

The second answer is that the odds that SOME mahout application will be too
large for a single node are quite high.

These aren't contradictory.  They just describe the long-tail nature of
problem sizes.

One question I have about your plan is whether your step (1) involves
building trees or forests only from data held in memory or whether it can be
adapted to stream through the data (possibly several times).  If a streaming
implementation is viable, then it may well be that performance is still
quite good for small datasets due to buffering.

If streaming works, then a single node will be able to handle very large
datasets but will just be kind of slow.  As you point out, that can be
remedied trivially.

Another way to put this is that the key question is how single node
computation scales with input size.  If the scaling is relatively linear
with data size, then your approach (3) will work no matter the data size.
If scaling shows an evil memory size effect, then your approach (2) would be
required for large data sets.

On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim a_dene...@yahoo.frwrote:

 My question is : when Mahout.RF will be used in a real application, what
 are the odds that the dataset will be so large that it can't fit on every
 machine of the cluster ?

 the answer to this question should help me decide which implementation I'll
 choose.




-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
408-773-0110 ext. 738
858-414-0013 (m)
408-773-0220 (fax)


Re: [gsoc] random forests

2009-03-28 Thread deneche abdelhakim

you should read in . 2a

. This implementation is, relatively, easy given...

--- En date de : Sam 28.3.09, deneche abdelhakim a_dene...@yahoo.fr a écrit :

 De: deneche abdelhakim a_dene...@yahoo.fr
 Objet: Re: [gsoc] random forests
 À: mahout-dev@lucene.apache.org
 Date: Samedi 28 Mars 2009, 16h14
 
 I'm actually writing my working plan, and it looks like
 this:
 
 *
 1. reference implementation of Random/Regression Forests
 Building Algorithm: 
  . Build a forest of trees, the basic algorithm (described
 in the wiki) takes a subset from the dataset as a training
 set and builds a decision tree. This basic algorithm is
 repeated for each tree of the forest. 
  . The forest is stored in a file, this way it can be used
 later to classify new cases
 
 2a. distributed Implementation A: 
  . When the dataset can be replicated to all computing
 nodes.
  . Each mapper has access to the whole dataset, if the
 forest contains N trees and we have M mappers, each mapper
 runs the basic building algorithm N/M times.
  . This implementation is, relatively, given that the
 reference implementation is available, because each mapper
 runs the basic building algorithm as it is.
 
 2b. Distributed Implementation B:
  . When the dataset is so big that it can no longer fit on
 every computing node, it must be distributed over the
 cluster. 
  . Each mapper has access to a subset from the dataset,
 thus all the mappers collaborate to build each tree of the
 forest.
  . In this case, the basic algorithm must be rewritten to
 fit in the map-reduce paradigm.
 
 3. Run the Random Forest with a real dataset on EC2:
  . This step is important, because running the RF on a
 local dual core machine is way different from running it on
 a real cluster with a real dataset.
  . This can make for a good demo for Mahout
 
 4. If there is still time, implement one or two other
 important features of RFs such as Variable importance and
 Proximity estimation
 *
 
 It is clear from the plan that I won't be able to do all
 those steps, and in some way I must choose only one
 implementation (2a or 2b) to do. The first implementation
 should take less time to implement than 2b and I'm quite
 sure I can go up to the 4th step, adding other features to
 the RF. BUT the second implementation is the only one
 capable of dealing with very large distributed datasets.
 
 My question is : when Mahout.RF will be used in a real
 application, what are the odds that the dataset will be so
 large that it can't fit on every machine of the cluster ? 
 
 the answer to this question should help me decide which
 implementation I'll choose.
 
 --- En date de : Dim 22.3.09, Ted Dunning ted.dunn...@gmail.com
 a écrit :
 
  De: Ted Dunning ted.dunn...@gmail.com
  Objet: Re: [gsoc] random forests
  À: mahout-dev@lucene.apache.org
  Date: Dimanche 22 Mars 2009, 0h36
  Great expression!
  
  You may be right about the nose-bleed tendency between
 the
  two methods.
  
  On Sat, Mar 21, 2009 at 4:46 AM, deneche abdelhakim
 a_dene...@yahoo.frwrote:
  
   I can't find a no-nose-bleeding algorithm
  
  
  
  
  -- 
  Ted Dunning, CTO
  DeepDyve
  
 
 
 
 





Re: [GSoC] SimRank algorithms on Mahout

2009-03-24 Thread Grant Ingersoll
Graph ranking strategies are something I am very much interested in  
and would love to see in Mahout.  Please do propose.


-Grant

On Mar 24, 2009, at 6:00 AM, Xuan Yang wrote:


Hello everyone,

I am a student from Fudan University, Shanghai, China.

These days I am doing some research work on SimRank, which is an model
measuring similarity of objects. SimRank is applicable in any  
domain with
object-to-object relationships, e.g., web pages with hyperlinks,  
papers and
authors, customers and commodities etc.  Based on the simple  
assumption that
two objects are similar if they are related to similar objects,  
SimRank is
calculated recursively on directed graph. You can find the algorithm  
here

http://en.wikipedia.org/wiki/SimRank

I found that the calculation of SimRank suits well for the framework  
of

Hadoop.
1, the directed graph could be saved in the form of edge list in  
hbase. And

the Result Sn(a,b) could also be saved in hbase as matrix.
2, We can distribute all the n^2 pairs into the map nodes to calculate
SimRank value of the next iteration.
3, There is an optimization method for SimRank's calculation: We can  
let map

nodes calculate the sum of Rk(Xi, V) = PSUMa(V), Xi belongs to the set
In(a), and V is an arbitrary node, then hand it to reduce node. In  
reduce
node: If we want to calculate Rk+1(a, b), we only need to calculate  
Sum of

PSUMa(Yj) in which Yj belongs to In(b);
4, besides, there are other optimization methods such as threshold  
could be

used in Map nodes and Reduce nodes.
Of course, there are some problems:
1, It is true that mapreduce could make the computation of each node  
more
easier. Yet if the volume of data is very huge, the transport  
latency of
data will become more and more serious. So, methods to reduce IO  
would be

very helpful.
2, SimRank is to compute the similarity between all the nodes. If we  
map a
group of nodes {A, B, C} into one map node, and {D, E, F} into  
another map
node. The computation inside set {A, B, C} will be easy, so will be  
set {D,
E, F}. But when we want to compute SimRank between A and D, It will  
not be

very convenient.

I think it would be great to solve these problems and implement a
mapreduce-version of algorithm for SimRank.
I intend to implement this as my Summer of Code project. Would you be
interested in this? And can I get some advices from you?

Thanks a lot,

Xuan Yang




Re: [GSoC] SimRank algorithms on Mahout

2009-03-24 Thread Xuan Yang
ok~ I will do it asap~

btw, I there any advices?

thanks a lot~ :)

2009/3/24 Grant Ingersoll gsing...@apache.org

 Graph ranking strategies are something I am very much interested in and
 would love to see in Mahout.  Please do propose.

 -Grant


 On Mar 24, 2009, at 6:00 AM, Xuan Yang wrote:

  Hello everyone,

 I am a student from Fudan University, Shanghai, China.

 These days I am doing some research work on SimRank, which is an model
 measuring similarity of objects. SimRank is applicable in any domain
 with
 object-to-object relationships, e.g., web pages with hyperlinks, papers
 and
 authors, customers and commodities etc.  Based on the simple assumption
 that
 two objects are similar if they are related to similar objects, SimRank
 is
 calculated recursively on directed graph. You can find the algorithm here
 http://en.wikipedia.org/wiki/SimRank

 I found that the calculation of SimRank suits well for the framework of
 Hadoop.
 1, the directed graph could be saved in the form of edge list in hbase.
 And
 the Result Sn(a,b) could also be saved in hbase as matrix.
 2, We can distribute all the n^2 pairs into the map nodes to calculate
 SimRank value of the next iteration.
 3, There is an optimization method for SimRank's calculation: We can let
 map
 nodes calculate the sum of Rk(Xi, V) = PSUMa(V), Xi belongs to the set
 In(a), and V is an arbitrary node, then hand it to reduce node. In reduce
 node: If we want to calculate Rk+1(a, b), we only need to calculate Sum of
 PSUMa(Yj) in which Yj belongs to In(b);
 4, besides, there are other optimization methods such as threshold could
 be
 used in Map nodes and Reduce nodes.
 Of course, there are some problems:
 1, It is true that mapreduce could make the computation of each node more
 easier. Yet if the volume of data is very huge, the transport latency of
 data will become more and more serious. So, methods to reduce IO would be
 very helpful.
 2, SimRank is to compute the similarity between all the nodes. If we map a
 group of nodes {A, B, C} into one map node, and {D, E, F} into another map
 node. The computation inside set {A, B, C} will be easy, so will be set
 {D,
 E, F}. But when we want to compute SimRank between A and D, It will not be
 very convenient.

 I think it would be great to solve these problems and implement a
 mapreduce-version of algorithm for SimRank.
 I intend to implement this as my Summer of Code project. Would you be
 interested in this? And can I get some advices from you?

 Thanks a lot,

 Xuan Yang





Re: GSoC 2009-Discussion

2009-03-24 Thread deneche abdelhakim

talking about Random Forests, I think there are two possible ways to actually 
implement them:

The first implementation is useful when the dataset is not that big (= 2Go 
perhaps) and thus can be distributed via Hadoop's DistributedCache. In this 
case each mapper has access to all the dataset and builds a subset of the 
forest.

The second one is related to large datasets, and by large I mean datasets that 
cannot fit on every computing node. In this case each mapper processes a subset 
of the dataset for all the trees.

Im more interested in the second implementation, so may be Samuel would be 
interested in the first...but of course if actually the community need them 
both :)

--- En date de : Mar 24.3.09, Ted Dunning ted.dunn...@gmail.com a écrit :

 De: Ted Dunning ted.dunn...@gmail.com
 Objet: Re: GSoC 2009-Discussion
 À: mahout-dev@lucene.apache.org
 Date: Mardi 24 Mars 2009, 0h07
 There are other algorithms of serious
 interest.  Bayesian Additive
 Regression Trees (BART) would make a very interesting
 complement to Random
 Forests.  I don't know how important it is to get a
 normal decision tree
 algorithm going because the cost to build these is often
 not that high.
 Boosted decision trees might be of interest, but probably
 not as much as
 BART.
 
 It might also be interesting to work with this student to
 implement some of
 the diagnostics associated with random forests.  There
 is plenty to do.
 
 
 - Original Message 
 
   From: Samuel Louvan samuel.lou...@gmail.com
 
  My questions:
   - I just notice in the mailing archive that other
 student also pretty
   serious to implement random forest algorithm.
 Should I select
     decision tree instead ? (for my
 future GSoC proposal)
   - Actually I found it would be interesting if I
 can combine Apache
   Nutch and Mahout so the idea is to implement web
 page segmentation +
   classifier inside
     a web crawler. By doing this, a
 crawler, for instance, can use the
   output of the classification to  only follow
 certain links that lie on
   informative content parts.
     Is this interesting  make
 sense for you guys?
 
 
 
 
 -- 
 Ted Dunning, CTO
 DeepDyve
 





Re: [GSoC] SimRank algorithms on Mahout

2009-03-24 Thread Ted Dunning
Answering some of your email out of order,

On Mon, Mar 23, 2009 at 10:00 PM, Xuan Yang sailingw...@gmail.com wrote:

 These days I am doing some research work on SimRank, which is an model
 measuring similarity of objects.


Great.



 I think it would be great to solve these problems and implement a
 mapreduce-version of algorithm for SimRank.
 I intend to implement this as my Summer of Code project. Would you be
 interested in this?


This sounds like a fine project.


 And can I get some advices from you?


I am sure you can lots of advice from this group, both on the algorithm and
suggestions on how to code it into a program.

Back to your detailed suggestion.  Here are some of my first thoughts:


 1, the directed graph could be saved in the form of edge list in hbase. And
 the Result Sn(a,b) could also be saved in hbase as matrix.


Hbase or flat files would be a fine way to store this and an edge list is an
excellent way to store the data.

The output matrix should probably be stored as triples containing row,
column and value.


 2, We can distribute all the n^2 pairs into the map nodes to calculate
 SimRank value of the next iteration.


Hopefully you can keep this sparse.  If you cannot, then the algorithm may
not be suitable for use on large data no matter how you parallelize it.

Skipping item 3 because I don't have time right now to analyze it in
detail...


 4, besides, there are other optimization methods such as threshold could be
 used in Map nodes and Reduce nodes.


Thresholding is likely to be a critical step in order to preserve sparsity.


 1, It is true that mapreduce could make the computation of each node more
 easier. Yet if the volume of data is very huge, the transport latency of
 data will become more and more serious.


I think that you will find that with map-reduce in general and with Hadoop
more specifically, that as the problem gets larger, the discipline imposed
by map-reduce formulation on your data transport patterns actually allows
better scaling than you would expect.  Of course, if your data size scales
with n^2, you are in trouble no matter how your parallelize.

A good example came a year or so ago with a machine translation group at a
university in Maryland.  They had a large program that attempted to do
coocurrence counting on text corpora using a single multi-core machine.
They started to convert this to Hadoop using the simplest possible
representation for the cooccurrence matrix (index, value triples) and
expected that the redundancy of this representation would lead to very bad
results.  Since they expected bad results, they also expected to do lots of
optimization on the map-reduce version.  Also, since the original program
was largely memory based, they expected that the communication overhead of
hadoop would severely hurt performance.

The actual results were that an 18 hour program run on 70 machines took 20
minutes.  This is nearly perfect speedup over the sequential version.  The
moral is that highly sequential transport of large blocks of information can
be incredibly efficient.

So, methods to reduce IO would be
 very helpful.


My first recommendation on this is to wait.  Get and implementation first,
then optimize.  The problems you have will not be the problems you expect.


 2, SimRank is to compute the similarity between all the nodes. If we map a
 group of nodes {A, B, C} into one map node, and {D, E, F} into another map
 node. The computation inside set {A, B, C} will be easy, so will be set {D,
 E, F}. But when we want to compute SimRank between A and D, It will not be
 very convenient.


Map nodes should never communicate to each other.  That is the purpose of
the reduce layer.

I think that what you should do is organize your recursive step so that the
sum happens in the reduce.  Then each mapper would output records where the
key is the index pair for the summation (a and b in the notation used on
wikipedia) and the reduce does this summation.  This implies that you
change  your input format slightly to be variable length records containing
a node index and the In set for that node.  This transformation is a very
simple, one time map-reduce step.

More specifically, you would have original input which initially has zero
values for R:

   links: (Node from, Node to, double R)

and a transform MR step that does this to produce an auxilliary file
inputSets: (Node to, ListNode inputs):

map: (Node from, Node to) - (to, from)
reduce: (Node to, ListNode inputs) - to, inputs

Now you need to join the original input to the auxilliary file on both the
from and to indexes.  This join would require two map-reduces, one to join
on the from index and one to join on the to index.  The reduce in the final
step should emit the cross product of the input sets.  Then you need to join
that against the original data.  That join would require a single map-reduce
for the join.  Finally, you need to group on the to index and sum up all of
the distances 

Re: GSoC 2009-Discussion

2009-03-23 Thread Dawid Weiss


 [snip]

  a web crawler. By doing this, a crawler, for instance, can use the
output of the classification to  only follow certain links that lie on
informative content parts.
  Is this interesting  make sense for you guys?


Hi Samuel. This would be of great interest for the Nutch folks, I think. And 
obviously for Mahout, since it would be a practical application of an ML algorithm.


Dawid


Re: GSoC 2009-Discussion

2009-03-23 Thread Otis Gospodnetic

Mmmm :)  This would definitely be very useful to anyone dealing with web 
page parsing and indexing.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Samuel Louvan samuel.lou...@gmail.com
 To: mahout-dev@lucene.apache.org
 Sent: Sunday, March 22, 2009 7:17:11 PM
 Subject: GSoC 2009-Discussion
 
 Hi,
 I just browsed through the idea list in GSoC 2009 and I'm interested
 to work in Apache Mahout.
 Currently, I'm doing my master project in my university related to
 machine learning + information retrieval. More specifically
 it's about how to discover informative content in a web page by using
 machine learning approach.
 
 Overall, there are two stages for doing this task, namely web page
 segmentation and locating the informative content.
 Web page segmentation process, takes a DOM tree representation of a
 HTML document and then group the DOM nodes
 into certain granularity. Next, a classification task is performed to
 the DOM nodes into binary class whether it is
 a informative content or non-informative content. The features used
 for the classification are for example, inner HTML length,
 inner Text Length, stop word ratio, offsetHeight, coordinate of the
 HTML element on the browser etc.
 
 The dataset is generated by a labeling program that I made (for
 supervised learning). Basically, a user can
 select  annotate a particular segment of the web page and then mark
 the class label as a informative content or not informative content.
 
 I did some small experiments with this last semester, I played with
 WEKA and tried some algorithms namely Random forests,
 Decision tree, SVM, and Neural Network. In this experiment, random
 forest and decision tree yield the most satisfying result.
 
 Currently, I'm working on my master project and will implement a
 machine learning algorithm either decision tree or random forest
 for the classifier. For this reason, I'm very interested to work on
 Apache Mahout in this year's GSoC to implement one of those
 algorithm.
 
 
 My questions:
 - I just notice in the mailing archive that other student also pretty
 serious to implement random forest algorithm. Should I select
   decision tree instead ? (for my future GSoC proposal)
 - Actually I found it would be interesting if I can combine Apache
 Nutch and Mahout so the idea is to implement web page segmentation +
 classifier inside
   a web crawler. By doing this, a crawler, for instance, can use the
 output of the classification to  only follow certain links that lie on
 informative content parts.
   Is this interesting  make sense for you guys?
 
 Maybe for more details, you can download my presentation slides and
 master project desription at
 http://rapidshare.com/files/212352116/Slide_Doc.zip
 
 A little bit background of me : I'm a 2nd year Master Student in TU
 Eindhoven, Netherlands.
 Last year I also participated in GSoC with OpenNMS
 (http://code.google.com/soc/2008/opennms/appinfo.html?csaid=EDA725BD4D34D481)
 
 
 Looking forward for your feedback and input.
 
 
 
 Regards,
 Samuel L.



Re: GSOC Mentor

2009-03-20 Thread Grady Laksmono
Hi guys,
I'm actually interested with your project. I haven't started my proposal
yet, because I'm still working on my finals now, I'll be writing it soon and
let you guys know any updates. But I'm generally interested this idea:

http://wiki.apache.org/general/SummerOfCode2008#lucene

I had Machine Learning class but haven't had the chance to implement
algorithm. I used Lucene previously, and I have a strong interest with
Machine Learning, so I thought it would be nice if I could spend my summer
implementing Machine Learning algorithm.

Regards,
Grady

On Fri, Mar 20, 2009 at 4:27 AM, Grant Ingersoll gsing...@apache.orgwrote:

 Hey Gang,

 The ASF has been accepted to participate in GSOC.  If you want to be a
 mentor, you can now sign up to be one.  Just choose to be a part of the ASF.
  http://socghop.appspot.com/program/home/google/gsoc2009

 You should also subscribe to code-awa...@a.o for ASF specific info.

 Note, you have to be a committer to be a mentor.

 -Grant




-- 
Grady Laksmono
gradyfau...@laksmono.com
www.laksmono.com

I know the plans I have for you, declares the Lord, plans to prosper you
and not to harm you, plans to give you hope and a future.  ~ Jeremiah 29:11
~


Re: GSoC 09 project ideas...

2009-03-18 Thread Jason Rutherglen
Hi Z.S.,

I'll update LUCENE-1313 after LUCENE-1516 is committed.  I can post the
basic new patch I have for LUCENE-1313 (heavily simplified compared to the
previous patches), however it will assume LUCENE-1516.  The other area that
will need to be addressed is standard benchmarking for different realtime
search approaches as we don't know what will be best yet.

What areas in regard to realtime search are you working on?

-J

On Wed, Mar 18, 2009 at 9:04 AM, Zaid Md. Abdul Wahab Sheikh 
sheikh.z...@gmail.com wrote:

 Hi lucene,
 In this link http://wiki.apache.org/general/SummerOfCode2009 , there are
 no project ideas for Lucene proper. (Only ideas for Mahout listed). Please
 put up some ideas for Lucene there or please mention some popular open
 issues that might be suitable as a GSoC project.
 I would very much like to work on Lucene during Summer of Code 09. I am
 currently researching/doing a project on Realtime search.
 It seems, a contrib exists for realtime search in Lucene.
 http://issues.apache.org/jira/browse/LUCENE-1313. Can anyone give me an
 update on its status? Is that sufficient/complete, or should I start
 investigating possibilities of integrating 'realtime' search in Lucene.
 Please comment.

 Z.S.



Re: GSoC 09 project ideas...

2009-03-18 Thread Michael McCandless


I think creating a better Highlighter for Lucene, which is actively
being discussed:

https://issues.apache.org/jira/browse/LUCENE-1522

would make a good GSoC project, but I don't think I have time to mentor.

Realtime search is currently in progress already, being tracked/iterated
here:

https://issues.apache.org/jira/browse/LUCENE-1516

The original Ocean (LUCENE-1313) that you found was a more ambitious
approach, which after discussions here eventually lead to the simpler
approach in LUCENE-1516.

Mike

Abdul Wahab Sheikh wrote:


Hi lucene,
In this link http://wiki.apache.org/general/SummerOfCode2009 , there  
are no project ideas for Lucene proper. (Only ideas for Mahout  
listed). Please put up some ideas for Lucene there or please mention  
some popular open issues that might be suitable as a GSoC project.
I would very much like to work on Lucene during Summer of Code 09. I  
am currently researching/doing a project on Realtime search.
It seems, a contrib exists for realtime search in Lucene. http://issues.apache.org/jira/browse/LUCENE-1313 
. Can anyone give me an update on its status? Is that sufficient/ 
complete, or should I start investigating possibilities of  
integrating 'realtime' search in Lucene. Please comment.


Z.S.



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   >