Re: GSoC 2014 mentor request
Thanks all, just subscribed to the mentors list. Regards, Tommaso 2014-03-21 10:23 GMT+01:00 Michael McCandless luc...@mikemccandless.com: ACK from Lucene PMC. I'm also CC'ing ment...@community.apache.org (Tommaso, you should subscribe if you haven't already). Thanks Tommaso! Sad to have too many students/proposals and too few mentors ... Mike McCandless http://blog.mikemccandless.com On Fri, Mar 21, 2014 at 3:43 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Dear Lucene PMC, please acknowledge my request to become a mentor for Google Summer of Code 2014 projects for Apache Lucene. My Melange username is tommaso. Thanks and regards, Tommaso
Re: GSoC 2014 mentor request
You should also subscribe to code-awards@a.o. See http://community.apache.org/gsoc.html for details ... Thanks for being a mentor! We have far too few mentors in Lucene/Solr unfortunately. Mike McCandless http://blog.mikemccandless.com On Fri, Mar 21, 2014 at 6:23 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Thanks all, just subscribed to the mentors list. Regards, Tommaso 2014-03-21 10:23 GMT+01:00 Michael McCandless luc...@mikemccandless.com: ACK from Lucene PMC. I'm also CC'ing ment...@community.apache.org (Tommaso, you should subscribe if you haven't already). Thanks Tommaso! Sad to have too many students/proposals and too few mentors ... Mike McCandless http://blog.mikemccandless.com On Fri, Mar 21, 2014 at 3:43 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Dear Lucene PMC, please acknowledge my request to become a mentor for Google Summer of Code 2014 projects for Apache Lucene. My Melange username is tommaso. Thanks and regards, Tommaso - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC 2014 mentor request
2014-03-21 11:35 GMT+01:00 Michael McCandless luc...@mikemccandless.com: You should also subscribe to code-awards@a.o. strangely this resulted in qmail-send program replying: code-awards-subscr...@apache.org: This mailing list has moved to mentors at community.apache.org. so I guess mentors@ is enough. See http://community.apache.org/gsoc.html for details ... Thanks for being a mentor! We have far too few mentors in Lucene/Solr unfortunately. right, if I read Jira correctly we have more than 20 proposals! Thanks, Tommaso Mike McCandless http://blog.mikemccandless.com On Fri, Mar 21, 2014 at 6:23 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Thanks all, just subscribed to the mentors list. Regards, Tommaso 2014-03-21 10:23 GMT+01:00 Michael McCandless luc...@mikemccandless.com : ACK from Lucene PMC. I'm also CC'ing ment...@community.apache.org (Tommaso, you should subscribe if you haven't already). Thanks Tommaso! Sad to have too many students/proposals and too few mentors ... Mike McCandless http://blog.mikemccandless.com On Fri, Mar 21, 2014 at 3:43 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Dear Lucene PMC, please acknowledge my request to become a mentor for Google Summer of Code 2014 projects for Apache Lucene. My Melange username is tommaso. Thanks and regards, Tommaso
Re: GSoC 2014 mentor request
Ahh... the list must have moved. Good to know :) Mike McCandless http://blog.mikemccandless.com On Fri, Mar 21, 2014 at 7:04 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: 2014-03-21 11:35 GMT+01:00 Michael McCandless luc...@mikemccandless.com: You should also subscribe to code-awards@a.o. strangely this resulted in qmail-send program replying: code-awards-subscr...@apache.org: This mailing list has moved to mentors at community.apache.org. so I guess mentors@ is enough. See http://community.apache.org/gsoc.html for details ... Thanks for being a mentor! We have far too few mentors in Lucene/Solr unfortunately. right, if I read Jira correctly we have more than 20 proposals! Thanks, Tommaso Mike McCandless http://blog.mikemccandless.com On Fri, Mar 21, 2014 at 6:23 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Thanks all, just subscribed to the mentors list. Regards, Tommaso 2014-03-21 10:23 GMT+01:00 Michael McCandless luc...@mikemccandless.com: ACK from Lucene PMC. I'm also CC'ing ment...@community.apache.org (Tommaso, you should subscribe if you haven't already). Thanks Tommaso! Sad to have too many students/proposals and too few mentors ... Mike McCandless http://blog.mikemccandless.com On Fri, Mar 21, 2014 at 3:43 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Dear Lucene PMC, please acknowledge my request to become a mentor for Google Summer of Code 2014 projects for Apache Lucene. My Melange username is tommaso. Thanks and regards, Tommaso - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC
Hi Ivan, It's best to just add a comment onto LUCENE-466 with your ideas/questions specific to that issue; other more general questions should be sent to this dev list. Since the big part of that issue (supporting minShouldMatch in BooleanQuery) was already done, I think fixing query parsers to handle it is important but isn't an entire GSoC project? Or, perhaps it is (we have quite a few query parsers now...). But I think doing another improvement in addition would be the right amount... The mentor assignment is somewhat ad-hoc, sort of like dating ;) You should add comments to the issue, adding ideas, asking for suggestions, asking if anyone will mentor, and then see if any possible mentors respond. I'm not sure why the issue is assigned to Yonik; I don't think he's actually working on it. You could try looking at past GSoC proposals at Apache Lucene to get an idea? Mike McCandless http://blog.mikemccandless.com On Wed, Mar 12, 2014 at 10:40 AM, Ivan Biggs ivan.c.bi...@vanderbilt.edu wrote: Hello, My name is Ivan Biggs and I'm very interested in working with Lucene for my Google Summer of Code Project. I've a lot of the4 relevant documentation and currently have my eye on the issue found here: https://issues.apache.org/jira/browse/LUCENE-466?filter=12326260jql=labels%20%3D%20gsoc2014%20AND%20status%20%3D%20Open My only concern is that I want to be sure that this issue would be considered adequate work for a project in of itself or if I should plan on tackling perhaps two of these type of issues. Furthermore, if anyone could point me in the direction of a possible future mentor, it'd be much appreciated as I'm not quite sure why this particular issue has an assignee listed. Also, since Apache doesn't have any sort of template or similar guidelines for proposal submissions available, any general help or advice as to what sort of standards I should be adhering to would be great too! Thanks, Ivan - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC
First, thanks so much for getting me pointed in the right direction! I assume you mean straight on Jira? Also do you have any clue where one would be able to find past proposals for Lucene? Thanks, Ivan On Wed, Mar 12, 2014 at 12:08 PM, Michael McCandless luc...@mikemccandless.com wrote: Hi Ivan, It's best to just add a comment onto LUCENE-466 with your ideas/questions specific to that issue; other more general questions should be sent to this dev list. Since the big part of that issue (supporting minShouldMatch in BooleanQuery) was already done, I think fixing query parsers to handle it is important but isn't an entire GSoC project? Or, perhaps it is (we have quite a few query parsers now...). But I think doing another improvement in addition would be the right amount... The mentor assignment is somewhat ad-hoc, sort of like dating ;) You should add comments to the issue, adding ideas, asking for suggestions, asking if anyone will mentor, and then see if any possible mentors respond. I'm not sure why the issue is assigned to Yonik; I don't think he's actually working on it. You could try looking at past GSoC proposals at Apache Lucene to get an idea? Mike McCandless http://blog.mikemccandless.com On Wed, Mar 12, 2014 at 10:40 AM, Ivan Biggs ivan.c.bi...@vanderbilt.edu wrote: Hello, My name is Ivan Biggs and I'm very interested in working with Lucene for my Google Summer of Code Project. I've a lot of the4 relevant documentation and currently have my eye on the issue found here: https://issues.apache.org/jira/browse/LUCENE-466?filter=12326260jql=labels%20%3D%20gsoc2014%20AND%20status%20%3D%20Open My only concern is that I want to be sure that this issue would be considered adequate work for a project in of itself or if I should plan on tackling perhaps two of these type of issues. Furthermore, if anyone could point me in the direction of a possible future mentor, it'd be much appreciated as I'm not quite sure why this particular issue has an assignee listed. Also, since Apache doesn't have any sort of template or similar guidelines for proposal submissions available, any general help or advice as to what sort of standards I should be adhering to would be great too! Thanks, Ivan - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC
Sorry, yes, please add comments/ideas straight on the Jira issue, i.e. https://issues.apache.org/jira/browse/LUCENE-466 in this case. Hmm, I'm not sure how to find past proposals. The links to these proposals, e.g. from my past blog post, and from past Jira issues, seem to be broken now. Mike McCandless http://blog.mikemccandless.com On Wed, Mar 12, 2014 at 1:25 PM, Ivan Biggs ivan.c.bi...@vanderbilt.edu wrote: First, thanks so much for getting me pointed in the right direction! I assume you mean straight on Jira? Also do you have any clue where one would be able to find past proposals for Lucene? Thanks, Ivan On Wed, Mar 12, 2014 at 12:08 PM, Michael McCandless luc...@mikemccandless.com wrote: Hi Ivan, It's best to just add a comment onto LUCENE-466 with your ideas/questions specific to that issue; other more general questions should be sent to this dev list. Since the big part of that issue (supporting minShouldMatch in BooleanQuery) was already done, I think fixing query parsers to handle it is important but isn't an entire GSoC project? Or, perhaps it is (we have quite a few query parsers now...). But I think doing another improvement in addition would be the right amount... The mentor assignment is somewhat ad-hoc, sort of like dating ;) You should add comments to the issue, adding ideas, asking for suggestions, asking if anyone will mentor, and then see if any possible mentors respond. I'm not sure why the issue is assigned to Yonik; I don't think he's actually working on it. You could try looking at past GSoC proposals at Apache Lucene to get an idea? Mike McCandless http://blog.mikemccandless.com On Wed, Mar 12, 2014 at 10:40 AM, Ivan Biggs ivan.c.bi...@vanderbilt.edu wrote: Hello, My name is Ivan Biggs and I'm very interested in working with Lucene for my Google Summer of Code Project. I've a lot of the4 relevant documentation and currently have my eye on the issue found here: https://issues.apache.org/jira/browse/LUCENE-466?filter=12326260jql=labels%20%3D%20gsoc2014%20AND%20status%20%3D%20Open My only concern is that I want to be sure that this issue would be considered adequate work for a project in of itself or if I should plan on tackling perhaps two of these type of issues. Furthermore, if anyone could point me in the direction of a possible future mentor, it'd be much appreciated as I'm not quite sure why this particular issue has an assignee listed. Also, since Apache doesn't have any sort of template or similar guidelines for proposal submissions available, any general help or advice as to what sort of standards I should be adhering to would be great too! Thanks, Ivan - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC 2014 on LUCENE-466: Need QueryParser support for BooleanQuery.minNrShouldMatch
I think a good place to start is on the issue itself. E.g. add a comment expressing that you're interested in this issue, maybe summarize roughly what's entailed. E.g., that issue is quite old, and the first part of it (supporting minShouldMatch in BQ) has already been done, so all that remains is fixing QueryParsers to accept it, if they don't already? I'm not sure, but just this part may be too little for a whole summer? Mike McCandless http://blog.mikemccandless.com On Thu, Feb 27, 2014 at 10:16 PM, Tao Lin taolin.bn...@gmail.com wrote: Hello, My name is Tao Lin, a Chinese student from Beijing Normal University Zhuhai Campus. It's great to see that Han Jiang (also a Chinese student) has already contributed to Lucene in GSoC 2012 and 2013. Likewise, I'd like to participant GSoC 2014, on the project of LUCENE-466 [1] (Need QueryParser support for BooleanQuery.minNrShouldMatch). Is this lucene dev mailing list the place to discuss gsoc projects? Who will be the mentor(s) for this project? I see the Assignee of LUCENE-466 is Yonik Seeley. How can I get in touch with him? Is LUCENE-466 still available as a GSoC 2014 student project? For a brief self-introduction, I've successfully completed 2 open source GSoC projects: - In GSoC 2011, I worked for Languagetool [2] to develop a Lucene-based indexing tool that makes it possible to run proof-reading rule against a large amount of text. - In GSoC 2012, I added the RDFa metadata support for Apache ODF Toolkit [3]. Yours, Tao Lin [1] https://issues.apache.org/jira/browse/LUCENE-466 [2] http://www.languagetool.org/gsoc2011/ [3] https://issues.apache.org/jira/browse/ODFTOOLKIT-50 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSOC 2013
Thanks Adrien! Mike McCandless http://blog.mikemccandless.com On Fri, Mar 29, 2013 at 1:49 PM, Adrien Grand jpou...@gmail.com wrote: Hi, Although I probably won't be able to mentor students next summer, I think it would be great to have students this year too. I modified open JIRA issues from last year's GSOC to add the gsoc2013 label so that students can find our project ideas. https://issues.apache.org/jira/issues/?jql=(project%20%3D%20%22Lucene%20-%20Core%22%20OR%20project%20%3D%20Solr)%20AND%20labels%20%3D%20gsoc2013 -- Adrien - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC 2013
Hello Raimon, depending on what focus your master thesis should be Lucene / Solr may or not be the right project. Basically if your sentiment analysis topic is tight to information retrieval (very dummy example: making a search engine which scores documents boosting positive ones) then it could be ok (in this case you could leverage some classification capabilities Lucene has [1]) otherwise if your task is more focused on the extraction of such sentiments then other projects may fit better, see for example OpenNLP or Mahout or UIMA. My 2 cents, Tommaso [1] : http://www.slideshare.net/teofili/text-categorization-with-lucene-and-solr 2013/3/19 Raimon Bosch raimon.bo...@gmail.com Anyone interested? 2013/3/18 Raimon Bosch raimon.bo...@gmail.com Hi all, I would be interested in doing a Google Summer of Code this year with Lucene or Solr. My master thesis topic is about Sentiment analysis, there is any research in this direction inside Solr and Lucene? If there is any other interesting topic I would be open to discuss. Thanks in advance, Raimon Bosch.
Re: GSoC 2013
Hi Tommaso, Yes, I agree. To use Lucene in this kind of project we would need to focus on creating sentiment ranking or improve the text classification capabilities of Lucene. Integration with other might be interesting, also. Thanks, Raimon Bosch. 2013/3/20 Tommaso Teofili tommaso.teof...@gmail.com Hello Raimon, depending on what focus your master thesis should be Lucene / Solr may or not be the right project. Basically if your sentiment analysis topic is tight to information retrieval (very dummy example: making a search engine which scores documents boosting positive ones) then it could be ok (in this case you could leverage some classification capabilities Lucene has [1]) otherwise if your task is more focused on the extraction of such sentiments then other projects may fit better, see for example OpenNLP or Mahout or UIMA. My 2 cents, Tommaso [1] : http://www.slideshare.net/teofili/text-categorization-with-lucene-and-solr 2013/3/19 Raimon Bosch raimon.bo...@gmail.com Anyone interested? 2013/3/18 Raimon Bosch raimon.bo...@gmail.com Hi all, I would be interested in doing a Google Summer of Code this year with Lucene or Solr. My master thesis topic is about Sentiment analysis, there is any research in this direction inside Solr and Lucene? If there is any other interesting topic I would be open to discuss. Thanks in advance, Raimon Bosch.
Re: GSoC 2013
Anyone interested? 2013/3/18 Raimon Bosch raimon.bo...@gmail.com Hi all, I would be interested in doing a Google Summer of Code this year with Lucene or Solr. My master thesis topic is about Sentiment analysis, there is any research in this direction inside Solr and Lucene? If there is any other interesting topic I would be open to discuss. Thanks in advance, Raimon Bosch.
Re: [GSoC] codec not registered?
Since your test uses PerFieldPostingsFormat, its going to write the name of your format PForDelta into the index and expects to be able to load it via the SPI mechanism. So I think you should register your PForDeltaPostingsFormat in lucene/core/src/resources/META-INF/services/org.apache.lucene.codecs.PostingsFormat so that the SPI mechanism is able to look it up by name. On Mon, Apr 30, 2012 at 2:39 PM, Han Jiang jiangha...@gmail.com wrote: Hi, I just immitated the MockFixedIntBlock and wrote a simple postings format, but when I tried to use ant test, it told me that: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'PForDelta' does not exist. Details are here: http://pastebin.com/EQDLwrn2 To reproduce the error, you can use the patch and run mytest-min under trunk/lucene. It is strange that the error happens when calling writer.close(), and no error will occur if I change to an existing postings format. What did I missed? Billy -- Han Jiang EECS, Peking University, China Every Effort Creates Smile Senior Student - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [GSoC] codec not registered?
Ah, I see. Thank you Robert ! On Tue, May 1, 2012 at 2:46 AM, Robert Muir rcm...@gmail.com wrote: Since your test uses PerFieldPostingsFormat, its going to write the name of your format PForDelta into the index and expects to be able to load it via the SPI mechanism. So I think you should register your PForDeltaPostingsFormat in lucene/core/src/resources/META-INF/services/org.apache.lucene.codecs.PostingsFormat so that the SPI mechanism is able to look it up by name. On Mon, Apr 30, 2012 at 2:39 PM, Han Jiang jiangha...@gmail.com wrote: Hi, I just immitated the MockFixedIntBlock and wrote a simple postings format, but when I tried to use ant test, it told me that: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'PForDelta' does not exist. Details are here: http://pastebin.com/EQDLwrn2 To reproduce the error, you can use the patch and run mytest-min under trunk/lucene. It is strange that the error happens when calling writer.close(), and no error will occur if I change to an existing postings format. What did I missed? Billy -- Han Jiang EECS, Peking University, China Every Effort Creates Smile Senior Student - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang EECS, Peking University, China Every Effort Creates Smile Senior Student
Re: GSoC 2012 - Refactoring IndexWriter (LUCENE-2026)
Hi, here's my first suggestion for the Refactoring steps: By now is the IW-class very big and i would try to reduce the code, by delegate special functions to the new components (Pattern: SRP). So keeps the IndexWriter most of his APIs and delegates only. I would try to extract the internals from the following methods into new components; for example, it could look like this; - addDocument: component SegementWriter - addIndexes: component IndexbasedWriter What you think? Other ideas / suggestions / tips? Should I have to send the mail to the lucene mailing list? Thx for the feedback Tim
Re: GSoC - Refactoring IndexWriter
Hey Simon, thx for your fast response! to begin with make sure you read this: http://wiki.apache.org/lucene-java/SummerOfCode2012 http://wiki.apache.org/lucene-java/HowToContribute Okay, i read the documentation. Yeah we have multiple test for IndexWriter (IW in short) the are all basically in /lucene/core/src/test/org/apache/lucene/index there is a bunch of them but those are only the test that test the IW directly. Lots of other tests are involved. Whatever you do you should run all core tests. The one with NRT and Threads in the name are the most evil :) What means NRT? Okay, i will now checkout the trunk and run all the unit tests. I use Eclipse, but you can use the tool you like / know Ok, Thx.
Re: GSoC - Refactoring IndexWriter
Hey Tim, great to have you! to begin with make sure you read this: http://wiki.apache.org/lucene-java/SummerOfCode2012 On Wed, Apr 4, 2012 at 12:20 AM, Achmetow (Google) achmeto...@googlemail.com wrote: Hi, I am a student from Germany and would like to contribute to the ASF Lucene project. great! I am excited! In the ideas list I have found the following interesting project: Refactoring IndexWriter (https://issues.apache.org/jira/browse/LUCENE-2026) Now I have some questions to this project: 1. Exist unit tests for this code (IndexWriter.java)? Yeah we have multiple test for IndexWriter (IW in short) the are all basically in /lucene/core/src/test/org/apache/lucene/index there is a bunch of them but those are only the test that test the IW directly. Lots of other tests are involved. Whatever you do you should run all core tests. The one with NRT and Threads in the name are the most evil :) simonw$ find . -name TestIndexWriter* ./core/src/test/org/apache/lucene/index/TestIndexWriter.java ./core/src/test/org/apache/lucene/index/TestIndexWriterCommit.java ./core/src/test/org/apache/lucene/index/TestIndexWriterConfig.java ./core/src/test/org/apache/lucene/index/TestIndexWriterDelete.java ./core/src/test/org/apache/lucene/index/TestIndexWriterExceptions.java ./core/src/test/org/apache/lucene/index/TestIndexWriterForceMerge.java ./core/src/test/org/apache/lucene/index/TestIndexWriterLockRelease.java ./core/src/test/org/apache/lucene/index/TestIndexWriterMergePolicy.java ./core/src/test/org/apache/lucene/index/TestIndexWriterMerging.java ./core/src/test/org/apache/lucene/index/TestIndexWriterNRTIsCurrent.java ./core/src/test/org/apache/lucene/index/TestIndexWriterOnDiskFull.java ./core/src/test/org/apache/lucene/index/TestIndexWriterOnJRECrash.java ./core/src/test/org/apache/lucene/index/TestIndexWriterReader.java ./core/src/test/org/apache/lucene/index/TestIndexWriterUnicode.java ./core/src/test/org/apache/lucene/index/TestIndexWriterWithThreads.java 2. Where I can find the code/software btw. component? (svn, git etc.) here is a good guideline for getting started http://wiki.apache.org/lucene-java/HowToContribute 3. Which IDE I can use for this project? Your Suggestions (Eclipse)? I use Eclipse, but you can use the tool you like / know 4. What's about coding style guides in the ASF? We have a code style in lucene which basically follows the sun guidelines. I think there are templates for eclipse and intellij on the contribution wiki. hope that gets you started! simon Thanks and Greetings Tim - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [GSoC] About how flexible indexing works in lucene 4.0
On Mon, Mar 26, 2012 at 6:59 PM, Han Jiang jiangha...@gmail.com wrote: Hi all, I was trying to figure out the control flow of IndexWriter and IndexSearcher, in order to get a better understanding of the idea behind Codec implementation. However, there seem to be some questions related with codes, which I just find inconvenient to discuss here. Maybe it is better to expain how much I understand, and ask for your comments? Here is what I understand: Index time: --First of all, IndexWriter should get a Codec configuration from an IndexWriterConfig. --When IndexWriter.addDocument is called, an instance of DocumentsWriterPerThread will be created, --It then pass the codec information through the indexing chain, and make an instance of FreqProxTermsWriterPerField to call flush(). --Then, based on the codec information, we create an instance of TermsConsumer, after this, we iterator each termID, get corresponding PostingConsumer, and save infomation of each document. --Here, by inheriting TermsConsumer and PostingConsumer, we get IndexWriter create index with new posting formats. That sounds about right! But, it's best to think of FreqProxTErmsWriter/PerField as having its own private in-memory postings format, and then, on flush, it re-parses its in-memory postings and feeds them to the codec (Fields/Terms/PostingsConsumer) for writing to the index. Query time: --Now, let's take Phrase Search as an example. --When IndexSearcher.search(phraseQuery,topN) is called, an instance of PhraseWeight will be created to wrap the query terms, --Then, IndexSearcher will create tasks to call method PhraseWeight.scorer(), inside which two instances: Terms and TermsEnum will be fetched from corresponding AtomicReader, --With the help of TermsEnum, for every phrase words, related docs and positions will be fetched through a DocsAndPositionsEnum, and result thus be generated. --Here, by inheriting TermsEnum and related *Enum classes, we get IndexSearcher(or IndexReader) understand our posting formats. Sounds right! And, here I have some questions: 1. Will multiple AtomicReaders created if I operate a search on a index with several segments? If not, when will there be multi AtomicReaders? And to further the question, what is the idea to introduce AtomicReader and CompositeReader into lucene 4? Right, it's one atomic reader (SegmentReader) per segment. We split composite/atomic readers in 4.0 so they'd be strongly typed (they have different methods and before the split they'd throw UnsupportedOperationExceptions from a number of methods, which was messy). 2. I must have missed something during query time, since subtype of PostingsReaderBase is just absent from what I explained. Is it created when an instance of AtomicReader is fetch from context? Where can I find related codes? PostingsWriter/ReaderBase is what our default terms dictionaries (Block/TreeTermsWriter/Reader) interact with. So, eg the Lucene40PostingsWriter/Reader subclass PostingsWriter/ReaderBase. 3. The wiki page here says we should provide an arbitrary skipDocs bit set during enumeration. Then, is posting list itself remains unchanged, even if I call deleteDocuments() ? Will deleted documents still remain in the postings file, even segments get merged? Deleted docs are simply marked in a bit set (the liveDocs bits), and the postings files themselves are unchanged. So when the postings reader enumerates the postings, it must checked the provided live docs (if it's not null) to confirm the doc is not deleted. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [GSoC] Question about LUCENE-3892
Hello, One quick question up front: are you subscribed to the dev list? If not, you may have missed my response to your last email with GSoC questions: http://lucene.markmail.org/thread/lqv6lyql2nlagv7f#query:+page:1+mid:ubjsvvfviuaexqlo+state:results Answers below: On Fri, Mar 23, 2012 at 2:09 PM, Han Jiang jiangha...@gmail.com wrote: I scanned through some discussions and codes around PForDelta, like LUCENE-1410, LUCENE-2903, ConversationBetweenMichaelAndLiLi. It is great to see so much information, and PForDelta seems to be a promising target. But as I look into the codes in branch-bulkpostings, it seems that most of the algorithms had already been implemented. Then, what is required to do for LUCENE-3892 , is the main target be the performance improvement, intergration with trunk version, or another implementation from the bottom up? We can work out the scope... but I think success would be a useful codec committed to 4.0? Ideally, and I think likely, it shows faster performance than our current default codec, in which case we may want to change our default, depending on other factors... Ie, you'd need to bring forward those old patches/branches to the current codec APIs, do performance testing to understand where they do well / poorly, whether more disk space is used, etc. Perhaps iterate on their implementations to improve performance... If the project succeeds in building a committable PForDelta codec that would be awesome! If that somehow winds up being too little, you can explore other intblock codecs as well... And another question about development. I am quite curious that some classes such as StandardAnalyzer were not found in the trunk or branch-bulkpostings, but replaced with Mock ones. Then how can I test my old codes, if I want to intergrate these classes with trunk library? We've moved all real analyzers to the module/analysis... what's in trunk are test analyzers, which you should use for new tests since they have more thorough checks. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [GSoC]About some general information
Hello! Answers below...: On Wed, Mar 21, 2012 at 11:03 AM, Han Jiang jiangha...@gmail.com wrote: Hi All, I'm Billy, a senior undergraduate student in Peking University. I'm working in the area of Information Retrieval and Web Mining. When going through the idea list, I felt quite interested in the LUCENE-3892 and LUCENE-3069. I am very proficient on java, and have been using lucene for about one year. I am looking forward to make a contribution to this project. Awesome. Here, I have a few questions about lucene: First of all, which version of lucene shall we use as a start point? The trunk or 3.5? Both of these issues will be trunk only I think: they both are far easier to do with the Codec API in 4.0. Is there any demo codes to show the idea of Codecs? Maybe the simplest demo would be to look at the SimpleText codec? It roughly tries to have simple source code as well as a simple (text only, human readable) on-disk format. How many posting formats are supposed to be implemented, for project LUCENE-3892 ? This can be worked out when scoping the project... but I think getting one postings format working well would be awesome :) If somehow that's too easy, then add more! Is there any further documentation for LUCENE-3069 ? Not that I know of... but I suspect the approach can be very similar to the MemoryPostingsFormat we already have, just that it'd only be the terms data stored in the FST, while the postings (docs/freqs/positions/offsets) are written to a file. Ideally, it would just act like a different terms dictionary implementation, ie so that we can then plug in any PostingsBaseFormat (even the one from LUCENE-3892!). Thank you! You're welcome, and welcome to Lucene/Solr! Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSOC 2012?
Mark, can you open an issue for this and lable it as: gsoc2012 lucene-gsoc-12 mentor just like this one https://issues.apache.org/jira/browse/LUCENE-2562 thanks, simon On Fri, Mar 2, 2012 at 12:26 PM, mark harwood markharw...@yahoo.co.uk wrote: Does anyone have any ideas? A framework for match metadata? Similar to the way tokenization was changed to allow tokenizers to to enrich a stream of tokens with arbitrary attributes, Scorers could provide MatchAttributes to provide arbitrary metadata about the stream of matches they produce. Same model is used - callers decide in advance which attribute decorations they want to consume and Scorers modify a singleton object which can be cloned if multiple attributes need to be retained by the caller. Helps support highlighting, explain and enables communication of added information between query objects in the tree. LUCENE-1999 was an example of a horrible work-around where additional match information that was required was smuggled through by bit-twiddling the score - this is because score is the only bit of match context we currently pass in Lucene APIs. Cheers Mark From: Robert Muir rcm...@gmail.com To: dev@lucene.apache.org Sent: Friday, 2 March 2012, 10:30 Subject: GSOC 2012? Hello, I was asked by a student if we are participating in GSOC this year. I hope the answer is yes? If we are planning to, I think it would be good if we came up with a list on the wiki of potential tasks. Does anyone have any ideas? One suggested idea I had (similar to LUCENE-2959 last year) would be to add a flexible query expansion framework. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSOC 2012?
On Fri, Mar 2, 2012 at 11:30 AM, Robert Muir rcm...@gmail.com wrote: Hello, I was asked by a student if we are participating in GSOC this year. I hope the answer is yes? If we are planning to, I think it would be good if we came up with a list on the wiki of potential tasks. Does anyone have any ideas? One suggested idea I had (similar to LUCENE-2959 last year) would be to add a flexible query expansion framework. +1 I'd love to help somebody to get PositionIterators in!!! simon -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSOC 2012?
Does anyone have any ideas? A framework for match metadata? Similar to the way tokenization was changed to allow tokenizers to to enrich a stream of tokens with arbitrary attributes, Scorers could provide MatchAttributes to provide arbitrary metadata about the stream of matches they produce. Same model is used - callers decide in advance which attribute decorations they want to consume and Scorers modify a singleton object which can be cloned if multiple attributes need to be retained by the caller. Helps support highlighting, explain and enables communication of added information between query objects in the tree. LUCENE-1999 was an example of a horrible work-around where additional match information that was required was smuggled through by bit-twiddling the score - this is because score is the only bit of match context we currently pass in Lucene APIs. Cheers Mark From: Robert Muir rcm...@gmail.com To: dev@lucene.apache.org Sent: Friday, 2 March 2012, 10:30 Subject: GSOC 2012? Hello, I was asked by a student if we are participating in GSOC this year. I hope the answer is yes? If we are planning to, I think it would be good if we came up with a list on the wiki of potential tasks. Does anyone have any ideas? One suggested idea I had (similar to LUCENE-2959 last year) would be to add a flexible query expansion framework. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSOC 2012?
I created an initial GSOC 2012 page here: http://wiki.apache.org/lucene-java/SummerOfCode2012 simon On Fri, Mar 2, 2012 at 12:26 PM, mark harwood markharw...@yahoo.co.uk wrote: Does anyone have any ideas? A framework for match metadata? Similar to the way tokenization was changed to allow tokenizers to to enrich a stream of tokens with arbitrary attributes, Scorers could provide MatchAttributes to provide arbitrary metadata about the stream of matches they produce. Same model is used - callers decide in advance which attribute decorations they want to consume and Scorers modify a singleton object which can be cloned if multiple attributes need to be retained by the caller. Helps support highlighting, explain and enables communication of added information between query objects in the tree. LUCENE-1999 was an example of a horrible work-around where additional match information that was required was smuggled through by bit-twiddling the score - this is because score is the only bit of match context we currently pass in Lucene APIs. Cheers Mark From: Robert Muir rcm...@gmail.com To: dev@lucene.apache.org Sent: Friday, 2 March 2012, 10:30 Subject: GSOC 2012? Hello, I was asked by a student if we are participating in GSOC this year. I hope the answer is yes? If we are planning to, I think it would be good if we came up with a list on the wiki of potential tasks. Does anyone have any ideas? One suggested idea I had (similar to LUCENE-2959 last year) would be to add a flexible query expansion framework. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSOC 2012?
Thanks for helping to get this started Simon and Mark! On Fri, Mar 2, 2012 at 7:10 AM, Simon Willnauer simon.willna...@googlemail.com wrote: I created an initial GSOC 2012 page here: http://wiki.apache.org/lucene-java/SummerOfCode2012 simon On Fri, Mar 2, 2012 at 12:26 PM, mark harwood markharw...@yahoo.co.uk wrote: Does anyone have any ideas? A framework for match metadata? Similar to the way tokenization was changed to allow tokenizers to to enrich a stream of tokens with arbitrary attributes, Scorers could provide MatchAttributes to provide arbitrary metadata about the stream of matches they produce. Same model is used - callers decide in advance which attribute decorations they want to consume and Scorers modify a singleton object which can be cloned if multiple attributes need to be retained by the caller. Helps support highlighting, explain and enables communication of added information between query objects in the tree. LUCENE-1999 was an example of a horrible work-around where additional match information that was required was smuggled through by bit-twiddling the score - this is because score is the only bit of match context we currently pass in Lucene APIs. Cheers Mark From: Robert Muir rcm...@gmail.com To: dev@lucene.apache.org Sent: Friday, 2 March 2012, 10:30 Subject: GSOC 2012? Hello, I was asked by a student if we are participating in GSOC this year. I hope the answer is yes? If we are planning to, I think it would be good if we came up with a list on the wiki of potential tasks. Does anyone have any ideas? One suggested idea I had (similar to LUCENE-2959 last year) would be to add a flexible query expansion framework. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC: LUCENE-2308: Separately specify a field's type
2011/5/12 Michael McCandless luc...@mikemccandless.com 2011/5/9 Nikola Tanković nikola.tanko...@gmail.com: Introduction of an FieldType class that will hold all the extra properties now stored inside Field instance other than field value itself. Seems like this is an easy first baby step -- leave current Field class, but break out the type details into a separate class that can be shared across Field instances. Yes, I agree, this could be a good first step. Mike submitted a patch on issue #2308. I think it's a solid base for this. Make that Chris. Ouch, sorry! New FieldTypeAttribute interface will be added to handle extension with new field properties inspired by IndexWriterConfig. How would this work? What's an example compelling usage? An app could use this for extensibility, and then make a matching codec that picks up this attr? EG, say, maybe for marking that a field is a primary key field and then codec could optimize accordingly...? Well that could be very interesting scenario. It didn't rang a bell to me for possible codec usage, but it seems very reasonable. Attributes otherwise don't make much sense, unless propertly used in custom codecs. How will we ensure attribute and codec compatibility? I'm just thinking we should have concrete reasons in mind for cutting over to attributes here... I'd rather see a fixed, well thought out concrete FieldType hierarchy first... Yes, I couldn't agree more, and I also think Chris has some great ideas on this field, given his work on Spatial indexing which tends to have use of this additional attributes. Refactoring and dividing of settings for term frequency and positioning can also be done (LUCENE-2048) Ahh great! So we can omit-positions-but-not-TF. Discuss possible effects of completion of LUCENE-2310 on this project This one is badly needed... but we should keep your project focused. We'll tackle this one afterwards. Good. Adequate Factory class for easier configuration of new Field instances together with manually added new FieldTypeAttributes FieldType, once instantiated is read-only. Only fields value can be changed. OK. Simple hierarchy of Field classes with core properties logically predefaulted. E.g.: NumberField, Can't this just be our existing NumericField? Yes, this is classic NumericField with changes proposed in LUCENE-2310. Tim Smith mentioned that Fieldable class should be kept for custom implementations to reduce number of setters (for defaults). Chris Male suggested new CoreFieldTypeAttribute interface, so maybe it should be implemented instead of Fieldable for custom implementations, so both Fieldable and AbstractField are not needed anymore. In my opinion Field shoud become abstract extended with others. Another proposal: how about keeping only Field (with no hierarchy) and move hierarchy to FieldType, such as NumericFieldType, StringFieldType since this hierarchy concerns type information only? I think hierarchy of both types and the value containers that hold the corresponding values could make sense? Hmm, I think we should get more opinions on this one also. e.g. Usage: FieldType number = new NumericFieldType(); Field price = new Field(); price.setType(number); // but this is much cleaner... Field price = new NumericField(); so maybe whe should have paraller XYZField with XYZFieldType... Am I complicating? StringField, This would be like NOT_ANALYZED? Yes, strings are often one word only. Or maybe we can name it NameField, NonAnalyzedField or something. StringField sounds good actually... TextField, This would be ANALYZED? Yes. OK. What is the best way to break this into small baby steps? Hopefully this becomes clearer as we iterate. Well, we know the first step: moving type details into FieldType class. Yes! Somehow tying into this as well is a stronger decoupling of the indexer from analysis/document. Ie, what indexer needs of a document is very minimal -- just an iterable over indexed stored values. Separately we can still provide a full featured Document class w/ add, get, remove, etc., but that's outside of the indexer. I'll get back to this one after additional research. Maybe we should do couple of more interactions, then I'll summarize the conclusions. Mike http://blog.mikemccandless.com Nikola
Re: GSoC: LUCENE-2308: Separately specify a field's type
2011/4/13 Nikola Tanković nikola.tanko...@gmail.com: Hi all, if everything goes well I'll be delighted to be part of this project this summer together with my assigned mentor Mike. My task will be to introduce new classes to Lucene core which will enable to separate Fields' Lucene properties from it's value (https://issues.apache.org/jira/browse/LUCENE-2308). Welcome Nikola! Changes will include: Introduction of an FieldType class that will hold all the extra properties now stored inside Field instance other than field value itself. Seems like this is an easy first baby step -- leave current Field class, but break out the type details into a separate class that can be shared across Field instances. New FieldTypeAttribute interface will be added to handle extension with new field properties inspired by IndexWriterConfig. How would this work? What's an example compelling usage? An app could use this for extensibility, and then make a matching codec that picks up this attr? EG, say, maybe for marking that a field is a primary key field and then codec could optimize accordingly...? Refactoring and dividing of settings for term frequency and positioning can also be done (LUCENE-2048) Ahh great! So we can omit-positions-but-not-TF. Discuss possible effects of completion of LUCENE-2310 on this project This one is badly needed... but we should keep your project focused. Adequate Factory class for easier configuration of new Field instances together with manually added new FieldTypeAttributes FieldType, once instantiated is read-only. Only fields value can be changed. OK. Simple hierarchy of Field classes with core properties logically predefaulted. E.g.: NumberField, Can't this just be our existing NumericField? StringField, This would be like NOT_ANALYZED? TextField, This would be ANALYZED? NonIndexedField, This would be only stored? My questions and issues: Backward compatibility? Will this go to Lucene 3.0? Maybe focus on 4.0 for starters and then if there's a nice backport we can do that...? What is the best way to break this into small baby steps? Hopefully this becomes clearer as we iterate. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC Lucene proposals
Done! --- Em qua, 6/4/11, Adriano Crestani adrianocrest...@apache.org escreveu: De: Adriano Crestani adrianocrest...@apache.org Assunto: GSoC Lucene proposals Para: dev@lucene.apache.org Data: Quarta-feira, 6 de Abril de 2011, 22:43 Hi students, We are receiving very good proposals this year, I am sure mentors are very happy :) I have one suggestion to make our (mentors) lives easier. Please, add the JIRA identifier to your proposal's title, example: LUCENE-2883: Consolidate Solr Lucene FunctionQuery into modules. This will let mentors to quickly search for Lucene and Solr proposals, as all Apache proposals are mixed and there is no way to sort by project. Thanks! --Adriano Crestani
Re: GSoC 2011
Hi Phillipe, You could start taking a look at these projects: LUCENE-2979 https://issues.apache.org/jira/browse/LUCENE-2979 https://issues.apache.org/jira/browse/LUCENE-2979LUCENE-2309https://issues.apache.org/jira/browse/LUCENE-2309 https://issues.apache.org/jira/browse/LUCENE-2309LUCENE-2450https://issues.apache.org/jira/browse/LUCENE-2450 https://issues.apache.org/jira/browse/LUCENE-2450LUCENE-1768https://issues.apache.org/jira/browse/LUCENE-1768 https://issues.apache.org/jira/browse/LUCENE-1768These ones are either related to analyzers/attributes or query parser. I hope this helps you to decide ;) On Thu, Mar 24, 2011 at 1:09 AM, Phillipe Ramalho phillipe.rama...@gmail.com wrote: Hello, I am planning to submit a project proposal to GSoC 2011 and Lucene seems to have a lot of GSoC projects this year. Last year I did a GSoC project using Lucene for PhotArk project. This year, instead of just using Lucene, I am planning to contribute code to it. My experience with Lucene is just as a regular user, the only code I have changed/extended so far was token streams/analyzers and query parser, so I have more knowledge on this part of the code. Based on that, I'm planning to focus on query parser and analyzer/token stream projects. Does that sound reasonable? I will be studying the code and planning the proposal(s), so you should start seeing more posts from me in the next few days. -- Phillipe Ramalho
Re: [GSoC] Apache Lucene @ Google Summer of Code 2011 [STUDENTS READ THIS]
Hey Simon and all, May we get an update on this? I understand that Google has published the list of accepted organizations, which -- not surprisingly -- includes the ASF. Is there any information on how many slots Apache got, and which issues will be selected? The student application period opens on the 28th, so I'm just wondering if I should go ahead and apply or wait for the decision. Thanks, David On 2011 March 11, Friday 17:23:58 Simon Willnauer wrote: Hey folks, Google Summer of Code 2011 is very close and the Project Applications Period has started recently. Now it's time to get some excited students on board for this year's GSoC. I encourage students to submit an application to the Google Summer of Code web-application. Lucene Solr are amazing projects and GSoC is an incredible opportunity to join the community and push the project forward. If you are a student and you are interested spending some time on a great open source project while getting paid for it, you should submit your application from March 28 - April 8, 2011. There are only 3 weeks until this process starts! Quote from the GSoC website: We hear almost universally from our mentoring organizations that the best applications they receive are from students who took the time to interact and discuss their ideas before submitting an application, so make sure to check out each organization's Ideas list to get to know a particular open source organization better. So if you have any ideas what Lucene Solr should have, or if you find any of the GSoC pre-selected projects [1] interesting, please join us on dev@lucene.apache.org [2]. Since you as a student must apply for a certain project via the GSoC website [3], it's a good idea to work on it ahead of time and include the community and possible mentors as soon as possible. Open source development here at the Apache Software Foundation happens almost exclusively in the public and I encourage you to follow this. Don't mail folks privately; please use the mailing list to get the best possible visibility and attract interested community members and push your idea forward. As always, it's the idea that counts not the person! That said, please do not underestimate the complexity of even small GSoC - Projects. Don't try to rewrite Lucene or Solr! A project usually gains more from a smaller, well discussed and carefully crafted tested feature than from a half baked monster change that's too large to work with. Once your proposal has been accepted and you begin work, you should give the community the opportunity to iterate with you. We prefer progress over perfection so don't hesitate to describe your overall vision, but when the rubber meets the road let's take it in small steps. A code patch of 20 KB is likely to be reviewed very quickly so get fast feedback, while a patch even 60kb in size can take very - Hide quoted text - long. So try to break up your vision and the community will work with you to get things done! On behalf of the Lucene Solr community, Go! join the mailing list and apply for GSoC 2011, Simon [1] https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQu ery=labels+%3D+lucene-gsoc-11 [2] http://lucene.apache.org/java/docs/mailinglists.html [3] http://www.google-melange.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [GSoC] Apache Lucene @ Google Summer of Code 2011 [STUDENTS READ THIS]
On Wed, Mar 23, 2011 at 9:37 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hey Simon and all, May we get an update on this? I understand that Google has published the list of accepted organizations, which -- not surprisingly -- includes the ASF. Is there any information on how many slots Apache got, and which issues will be selected? The student application period opens on the 28th, so I'm just wondering if I should go ahead and apply or wait for the decision. David, you should go ahead and apply via the GSoC website and reference the issue there this is how I understand it works. We will later rate the proposals from the GSoC website and decide which we choose. This is also when slots get assigned. simon Thanks, David On 2011 March 11, Friday 17:23:58 Simon Willnauer wrote: Hey folks, Google Summer of Code 2011 is very close and the Project Applications Period has started recently. Now it's time to get some excited students on board for this year's GSoC. I encourage students to submit an application to the Google Summer of Code web-application. Lucene Solr are amazing projects and GSoC is an incredible opportunity to join the community and push the project forward. If you are a student and you are interested spending some time on a great open source project while getting paid for it, you should submit your application from March 28 - April 8, 2011. There are only 3 weeks until this process starts! Quote from the GSoC website: We hear almost universally from our mentoring organizations that the best applications they receive are from students who took the time to interact and discuss their ideas before submitting an application, so make sure to check out each organization's Ideas list to get to know a particular open source organization better. So if you have any ideas what Lucene Solr should have, or if you find any of the GSoC pre-selected projects [1] interesting, please join us on dev@lucene.apache.org [2]. Since you as a student must apply for a certain project via the GSoC website [3], it's a good idea to work on it ahead of time and include the community and possible mentors as soon as possible. Open source development here at the Apache Software Foundation happens almost exclusively in the public and I encourage you to follow this. Don't mail folks privately; please use the mailing list to get the best possible visibility and attract interested community members and push your idea forward. As always, it's the idea that counts not the person! That said, please do not underestimate the complexity of even small GSoC - Projects. Don't try to rewrite Lucene or Solr! A project usually gains more from a smaller, well discussed and carefully crafted tested feature than from a half baked monster change that's too large to work with. Once your proposal has been accepted and you begin work, you should give the community the opportunity to iterate with you. We prefer progress over perfection so don't hesitate to describe your overall vision, but when the rubber meets the road let's take it in small steps. A code patch of 20 KB is likely to be reviewed very quickly so get fast feedback, while a patch even 60kb in size can take very - Hide quoted text - long. So try to break up your vision and the community will work with you to get things done! On behalf of the Lucene Solr community, Go! join the mailing list and apply for GSoC 2011, Simon [1] https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQu ery=labels+%3D+lucene-gsoc-11 [2] http://lucene.apache.org/java/docs/mailinglists.html [3] http://www.google-melange.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC
Ok, I have created a new issue, LUCENE-2959 for this project. I have uploaded the pdfs and added the gsoc2011 and lucene-gsoc-2011 labels as well. David On 2011 March 09, Wednesday 21:58:53 Simon Willnauer wrote: On Wed, Mar 9, 2011 at 5:48 PM, Grant Ingersoll gsing...@apache.org wrote: I think we, Lucene committers, need to identify who is willing to mentor. In my experience, it is less than 5 hours a week. Most of the work is done as part of the community. Sometimes you have to be tough and fail someone (I did last year) but most of the time, if you take the time to interview the candidates up front, it is a good experience for everyone. count me in I'd add it would be useful to have everyone put the lucene-gsoc-11 label on their issues too, that way we can quickly find the Lucene ones. done on at least one ;) simon Also, feel free to label existing bugs. On Mar 9, 2011, at 2:11 AM, Simon Willnauer wrote: Hey David and all others who want to contribute to GSoC, the ASF has applied for GSoC 2011 as a mentoring organization. As a ASF project we don't need to apply directly though but we need to register our ideas now. This works like almost anything in the ASF through JIRA. All ideas should be recorded as JIRA tickets labeled with gsoc2011. Once this is done it will show up here: http://s.apache.org/gsoc2011tasks Everybody who is interested in GSoC as a mentor or student should now read this too http://community.apache.org/gsoc.html Thanks, Simon On Thu, Feb 24, 2011 at 12:14 PM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Please find the implementation plan attached. The word soon gets a new meaning when power outages are taken into account. :) As before, comments are welcome. David On Tuesday, February 22, 2011 15:22:57 Simon Willnauer wrote: I think that is good for now. I should get started on codeawards and wrap up our proposals. I hope I can do that this week. simon On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hey, I have written the proposal. Please let me know if you want more / less of certain parts. Should I upload it somewhere? Implementation plan soon to follow. Sorry for the late reply; I have been rather busy these past few weeks. David On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote: Hey David, I saw that you added a tiny line to the GSoC Lucene wiki - thanks for that. On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi guys, Mark, Robert, Simon: thanks for the support! I really hope we can work together this summer (and before that, obviously). Same here! According to http://www.google- melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's still some time until the application period. So let me use this week to finish my PhD research plan, and get back to you next week. I am not really familiar with how the program works, i.e. how detailed the application description should be, when mentorship is decided, etc. so I guess we will have a lot to talk about. :) so from a 1ft view it work like this: 1. Write up a short proposal what your idea is about 2. make it public! and publish a implementation plan - how you would want to realize your proposal. If you don't follow that 100% in the actual impl. don't worry. Its just mean to give us an idea that you know what you are doing and where you want to go. something like a 1 A4 rough design doc. 3. give other people the change to apply for the same suggestion (this is how it works though) 4 Let the ASF / us assign one or more possible mentors to it 5. let us apply for a slot in GSoC (those are limited for organizations) 6. get accepted 7. rock it! (Actually, should we move this discussion private?) no - we usually do everything in public except of discussion within the PMC that are meant to be private for legal reasons or similar things. Lets stick to the mailing list for all communication except you have something that should clearly not be public. This also give other contributors a chance to help and get interested in your work!! simon David Hi David, honestly this sounds fantastic. It would be great to have someone to work with us on this issue! To date, progress is pretty slow-going (minor improvements, cleanups, additional stats here and there)... but we really need all the help we can get, especially from people who have a really good understanding of the various models. In case you are interested, here are some references to discussions about adding more flexibility (with some prototypes etc): http://www.lucidimagination.com/search/document/72787e0e54f798e4/ baby _st eps _towards_making_lucene_s_scoring_more_flexible
Re: GSoC
awesome thanks! simon On Thu, Mar 10, 2011 at 11:54 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Ok, I have created a new issue, LUCENE-2959 for this project. I have uploaded the pdfs and added the gsoc2011 and lucene-gsoc-2011 labels as well. David On 2011 March 09, Wednesday 21:58:53 Simon Willnauer wrote: On Wed, Mar 9, 2011 at 5:48 PM, Grant Ingersoll gsing...@apache.org wrote: I think we, Lucene committers, need to identify who is willing to mentor. In my experience, it is less than 5 hours a week. Most of the work is done as part of the community. Sometimes you have to be tough and fail someone (I did last year) but most of the time, if you take the time to interview the candidates up front, it is a good experience for everyone. count me in I'd add it would be useful to have everyone put the lucene-gsoc-11 label on their issues too, that way we can quickly find the Lucene ones. done on at least one ;) simon Also, feel free to label existing bugs. On Mar 9, 2011, at 2:11 AM, Simon Willnauer wrote: Hey David and all others who want to contribute to GSoC, the ASF has applied for GSoC 2011 as a mentoring organization. As a ASF project we don't need to apply directly though but we need to register our ideas now. This works like almost anything in the ASF through JIRA. All ideas should be recorded as JIRA tickets labeled with gsoc2011. Once this is done it will show up here: http://s.apache.org/gsoc2011tasks Everybody who is interested in GSoC as a mentor or student should now read this too http://community.apache.org/gsoc.html Thanks, Simon On Thu, Feb 24, 2011 at 12:14 PM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Please find the implementation plan attached. The word soon gets a new meaning when power outages are taken into account. :) As before, comments are welcome. David On Tuesday, February 22, 2011 15:22:57 Simon Willnauer wrote: I think that is good for now. I should get started on codeawards and wrap up our proposals. I hope I can do that this week. simon On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hey, I have written the proposal. Please let me know if you want more / less of certain parts. Should I upload it somewhere? Implementation plan soon to follow. Sorry for the late reply; I have been rather busy these past few weeks. David On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote: Hey David, I saw that you added a tiny line to the GSoC Lucene wiki - thanks for that. On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi guys, Mark, Robert, Simon: thanks for the support! I really hope we can work together this summer (and before that, obviously). Same here! According to http://www.google- melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's still some time until the application period. So let me use this week to finish my PhD research plan, and get back to you next week. I am not really familiar with how the program works, i.e. how detailed the application description should be, when mentorship is decided, etc. so I guess we will have a lot to talk about. :) so from a 1ft view it work like this: 1. Write up a short proposal what your idea is about 2. make it public! and publish a implementation plan - how you would want to realize your proposal. If you don't follow that 100% in the actual impl. don't worry. Its just mean to give us an idea that you know what you are doing and where you want to go. something like a 1 A4 rough design doc. 3. give other people the change to apply for the same suggestion (this is how it works though) 4 Let the ASF / us assign one or more possible mentors to it 5. let us apply for a slot in GSoC (those are limited for organizations) 6. get accepted 7. rock it! (Actually, should we move this discussion private?) no - we usually do everything in public except of discussion within the PMC that are meant to be private for legal reasons or similar things. Lets stick to the mailing list for all communication except you have something that should clearly not be public. This also give other contributors a chance to help and get interested in your work!! simon David Hi David, honestly this sounds fantastic. It would be great to have someone to work with us on this issue! To date, progress is pretty slow-going (minor improvements, cleanups, additional stats here and there)... but we really need all the help we can get, especially from people who have a really good understanding of the various models. In case you are interested, here are some references to discussions about adding more flexibility (with some prototypes etc): http://www.lucidimagination.com/search/document/72787e0e54f798e4/
Re: GSoC
On Wed, Mar 9, 2011 at 3:58 PM, Simon Willnauer simon.willna...@googlemail.com wrote: On Wed, Mar 9, 2011 at 5:48 PM, Grant Ingersoll gsing...@apache.org wrote: I think we, Lucene committers, need to identify who is willing to mentor. In my experience, it is less than 5 hours a week. Most of the work is done as part of the community. Sometimes you have to be tough and fail someone (I did last year) but most of the time, if you take the time to interview the candidates up front, it is a good experience for everyone. count me in I'll also be a GSOC mentor! -- Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC
I think we, Lucene committers, need to identify who is willing to mentor.In my experience, it is less than 5 hours a week. Most of the work is done as part of the community. Sometimes you have to be tough and fail someone (I did last year) but most of the time, if you take the time to interview the candidates up front, it is a good experience for everyone. I'd add it would be useful to have everyone put the lucene-gsoc-11 label on their issues too, that way we can quickly find the Lucene ones. Also, feel free to label existing bugs. On Mar 9, 2011, at 2:11 AM, Simon Willnauer wrote: Hey David and all others who want to contribute to GSoC, the ASF has applied for GSoC 2011 as a mentoring organization. As a ASF project we don't need to apply directly though but we need to register our ideas now. This works like almost anything in the ASF through JIRA. All ideas should be recorded as JIRA tickets labeled with gsoc2011. Once this is done it will show up here: http://s.apache.org/gsoc2011tasks Everybody who is interested in GSoC as a mentor or student should now read this too http://community.apache.org/gsoc.html Thanks, Simon On Thu, Feb 24, 2011 at 12:14 PM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Please find the implementation plan attached. The word soon gets a new meaning when power outages are taken into account. :) As before, comments are welcome. David On Tuesday, February 22, 2011 15:22:57 Simon Willnauer wrote: I think that is good for now. I should get started on codeawards and wrap up our proposals. I hope I can do that this week. simon On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hey, I have written the proposal. Please let me know if you want more / less of certain parts. Should I upload it somewhere? Implementation plan soon to follow. Sorry for the late reply; I have been rather busy these past few weeks. David On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote: Hey David, I saw that you added a tiny line to the GSoC Lucene wiki - thanks for that. On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi guys, Mark, Robert, Simon: thanks for the support! I really hope we can work together this summer (and before that, obviously). Same here! According to http://www.google- melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's still some time until the application period. So let me use this week to finish my PhD research plan, and get back to you next week. I am not really familiar with how the program works, i.e. how detailed the application description should be, when mentorship is decided, etc. so I guess we will have a lot to talk about. :) so from a 1ft view it work like this: 1. Write up a short proposal what your idea is about 2. make it public! and publish a implementation plan - how you would want to realize your proposal. If you don't follow that 100% in the actual impl. don't worry. Its just mean to give us an idea that you know what you are doing and where you want to go. something like a 1 A4 rough design doc. 3. give other people the change to apply for the same suggestion (this is how it works though) 4 Let the ASF / us assign one or more possible mentors to it 5. let us apply for a slot in GSoC (those are limited for organizations) 6. get accepted 7. rock it! (Actually, should we move this discussion private?) no - we usually do everything in public except of discussion within the PMC that are meant to be private for legal reasons or similar things. Lets stick to the mailing list for all communication except you have something that should clearly not be public. This also give other contributors a chance to help and get interested in your work!! simon David Hi David, honestly this sounds fantastic. It would be great to have someone to work with us on this issue! To date, progress is pretty slow-going (minor improvements, cleanups, additional stats here and there)... but we really need all the help we can get, especially from people who have a really good understanding of the various models. In case you are interested, here are some references to discussions about adding more flexibility (with some prototypes etc): http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby _st eps _towards_making_lucene_s_scoring_more_flexible https://issues.apache.org/jira/browse/LUCENE-2392 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi all, I have already sent this mail to Simon Willnauer, and he suggested me to post it here for discussion. I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest, Hungary. I am doing an IR-related research, and we have considered using Lucene as our search engine. We were quite satisfied with the speed and ease
Re: GSoC
On Wed, Mar 9, 2011 at 5:48 PM, Grant Ingersoll gsing...@apache.org wrote: I think we, Lucene committers, need to identify who is willing to mentor. In my experience, it is less than 5 hours a week. Most of the work is done as part of the community. Sometimes you have to be tough and fail someone (I did last year) but most of the time, if you take the time to interview the candidates up front, it is a good experience for everyone. count me in I'd add it would be useful to have everyone put the lucene-gsoc-11 label on their issues too, that way we can quickly find the Lucene ones. done on at least one ;) simon Also, feel free to label existing bugs. On Mar 9, 2011, at 2:11 AM, Simon Willnauer wrote: Hey David and all others who want to contribute to GSoC, the ASF has applied for GSoC 2011 as a mentoring organization. As a ASF project we don't need to apply directly though but we need to register our ideas now. This works like almost anything in the ASF through JIRA. All ideas should be recorded as JIRA tickets labeled with gsoc2011. Once this is done it will show up here: http://s.apache.org/gsoc2011tasks Everybody who is interested in GSoC as a mentor or student should now read this too http://community.apache.org/gsoc.html Thanks, Simon On Thu, Feb 24, 2011 at 12:14 PM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Please find the implementation plan attached. The word soon gets a new meaning when power outages are taken into account. :) As before, comments are welcome. David On Tuesday, February 22, 2011 15:22:57 Simon Willnauer wrote: I think that is good for now. I should get started on codeawards and wrap up our proposals. I hope I can do that this week. simon On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hey, I have written the proposal. Please let me know if you want more / less of certain parts. Should I upload it somewhere? Implementation plan soon to follow. Sorry for the late reply; I have been rather busy these past few weeks. David On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote: Hey David, I saw that you added a tiny line to the GSoC Lucene wiki - thanks for that. On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi guys, Mark, Robert, Simon: thanks for the support! I really hope we can work together this summer (and before that, obviously). Same here! According to http://www.google- melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's still some time until the application period. So let me use this week to finish my PhD research plan, and get back to you next week. I am not really familiar with how the program works, i.e. how detailed the application description should be, when mentorship is decided, etc. so I guess we will have a lot to talk about. :) so from a 1ft view it work like this: 1. Write up a short proposal what your idea is about 2. make it public! and publish a implementation plan - how you would want to realize your proposal. If you don't follow that 100% in the actual impl. don't worry. Its just mean to give us an idea that you know what you are doing and where you want to go. something like a 1 A4 rough design doc. 3. give other people the change to apply for the same suggestion (this is how it works though) 4 Let the ASF / us assign one or more possible mentors to it 5. let us apply for a slot in GSoC (those are limited for organizations) 6. get accepted 7. rock it! (Actually, should we move this discussion private?) no - we usually do everything in public except of discussion within the PMC that are meant to be private for legal reasons or similar things. Lets stick to the mailing list for all communication except you have something that should clearly not be public. This also give other contributors a chance to help and get interested in your work!! simon David Hi David, honestly this sounds fantastic. It would be great to have someone to work with us on this issue! To date, progress is pretty slow-going (minor improvements, cleanups, additional stats here and there)... but we really need all the help we can get, especially from people who have a really good understanding of the various models. In case you are interested, here are some references to discussions about adding more flexibility (with some prototypes etc): http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby _st eps _towards_making_lucene_s_scoring_more_flexible https://issues.apache.org/jira/browse/LUCENE-2392 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi all, I have already sent this mail to Simon Willnauer, and he suggested me to post it here for discussion. I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest, Hungary. I am doing an IR-related research, and we have
Re: GSoC
Hey David and all others who want to contribute to GSoC, the ASF has applied for GSoC 2011 as a mentoring organization. As a ASF project we don't need to apply directly though but we need to register our ideas now. This works like almost anything in the ASF through JIRA. All ideas should be recorded as JIRA tickets labeled with gsoc2011. Once this is done it will show up here: http://s.apache.org/gsoc2011tasks Everybody who is interested in GSoC as a mentor or student should now read this too http://community.apache.org/gsoc.html Thanks, Simon On Thu, Feb 24, 2011 at 12:14 PM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Please find the implementation plan attached. The word soon gets a new meaning when power outages are taken into account. :) As before, comments are welcome. David On Tuesday, February 22, 2011 15:22:57 Simon Willnauer wrote: I think that is good for now. I should get started on codeawards and wrap up our proposals. I hope I can do that this week. simon On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hey, I have written the proposal. Please let me know if you want more / less of certain parts. Should I upload it somewhere? Implementation plan soon to follow. Sorry for the late reply; I have been rather busy these past few weeks. David On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote: Hey David, I saw that you added a tiny line to the GSoC Lucene wiki - thanks for that. On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi guys, Mark, Robert, Simon: thanks for the support! I really hope we can work together this summer (and before that, obviously). Same here! According to http://www.google- melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's still some time until the application period. So let me use this week to finish my PhD research plan, and get back to you next week. I am not really familiar with how the program works, i.e. how detailed the application description should be, when mentorship is decided, etc. so I guess we will have a lot to talk about. :) so from a 1ft view it work like this: 1. Write up a short proposal what your idea is about 2. make it public! and publish a implementation plan - how you would want to realize your proposal. If you don't follow that 100% in the actual impl. don't worry. Its just mean to give us an idea that you know what you are doing and where you want to go. something like a 1 A4 rough design doc. 3. give other people the change to apply for the same suggestion (this is how it works though) 4 Let the ASF / us assign one or more possible mentors to it 5. let us apply for a slot in GSoC (those are limited for organizations) 6. get accepted 7. rock it! (Actually, should we move this discussion private?) no - we usually do everything in public except of discussion within the PMC that are meant to be private for legal reasons or similar things. Lets stick to the mailing list for all communication except you have something that should clearly not be public. This also give other contributors a chance to help and get interested in your work!! simon David Hi David, honestly this sounds fantastic. It would be great to have someone to work with us on this issue! To date, progress is pretty slow-going (minor improvements, cleanups, additional stats here and there)... but we really need all the help we can get, especially from people who have a really good understanding of the various models. In case you are interested, here are some references to discussions about adding more flexibility (with some prototypes etc): http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby _st eps _towards_making_lucene_s_scoring_more_flexible https://issues.apache.org/jira/browse/LUCENE-2392 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi all, I have already sent this mail to Simon Willnauer, and he suggested me to post it here for discussion. I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest, Hungary. I am doing an IR-related research, and we have considered using Lucene as our search engine. We were quite satisfied with the speed and ease of use. However, we would like to experiment with different ranking algorithms, and this is where problems arise. Lucene only supports the VSM, and unfortunately the ranking architecture seems to be tailored specifically to its needs. I would be very much interested in revamping the ranking component as a GSoC project. The following modifications should be doable in the allocated time frame: - a new ranking class hierarchy, which is generic enough to allow easy
Re: GSoC
I think that is good for now. I should get started on codeawards and wrap up our proposals. I hope I can do that this week. simon On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hey, I have written the proposal. Please let me know if you want more / less of certain parts. Should I upload it somewhere? Implementation plan soon to follow. Sorry for the late reply; I have been rather busy these past few weeks. David On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote: Hey David, I saw that you added a tiny line to the GSoC Lucene wiki - thanks for that. On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi guys, Mark, Robert, Simon: thanks for the support! I really hope we can work together this summer (and before that, obviously). Same here! According to http://www.google- melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's still some time until the application period. So let me use this week to finish my PhD research plan, and get back to you next week. I am not really familiar with how the program works, i.e. how detailed the application description should be, when mentorship is decided, etc. so I guess we will have a lot to talk about. :) so from a 1ft view it work like this: 1. Write up a short proposal what your idea is about 2. make it public! and publish a implementation plan - how you would want to realize your proposal. If you don't follow that 100% in the actual impl. don't worry. Its just mean to give us an idea that you know what you are doing and where you want to go. something like a 1 A4 rough design doc. 3. give other people the change to apply for the same suggestion (this is how it works though) 4 Let the ASF / us assign one or more possible mentors to it 5. let us apply for a slot in GSoC (those are limited for organizations) 6. get accepted 7. rock it! (Actually, should we move this discussion private?) no - we usually do everything in public except of discussion within the PMC that are meant to be private for legal reasons or similar things. Lets stick to the mailing list for all communication except you have something that should clearly not be public. This also give other contributors a chance to help and get interested in your work!! simon David Hi David, honestly this sounds fantastic. It would be great to have someone to work with us on this issue! To date, progress is pretty slow-going (minor improvements, cleanups, additional stats here and there)... but we really need all the help we can get, especially from people who have a really good understanding of the various models. In case you are interested, here are some references to discussions about adding more flexibility (with some prototypes etc): http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_st eps _towards_making_lucene_s_scoring_more_flexible https://issues.apache.org/jira/browse/LUCENE-2392 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi all, I have already sent this mail to Simon Willnauer, and he suggested me to post it here for discussion. I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest, Hungary. I am doing an IR-related research, and we have considered using Lucene as our search engine. We were quite satisfied with the speed and ease of use. However, we would like to experiment with different ranking algorithms, and this is where problems arise. Lucene only supports the VSM, and unfortunately the ranking architecture seems to be tailored specifically to its needs. I would be very much interested in revamping the ranking component as a GSoC project. The following modifications should be doable in the allocated time frame: - a new ranking class hierarchy, which is generic enough to allow easy implementation of new weighting schemes (at least bag-of-words ones), - addition of state-of-the-art ranking methods, such as Okapi BM25, proximity and DFR models, - configuration for ranking selection, with the old method as default. I believe all users of Lucene would profit from such a project. It would provide the scientific community with an even more useful research aid, while regular users could benefit from superior ranking results. Please let me know your opinion about this proposal. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To
Re: GSoC
This also give other contributors a chance to help and get interested in your work!! I really would love to contribute to this project! Regards. Fernando. De: Simon Willnauer simon.willna...@googlemail.com Para: dev@lucene.apache.org CC: David Nemeskey nemeskey.da...@sztaki.hu Enviado: martes, 22 de febrero, 2011 11:22:57 Asunto: Re: GSoC I think that is good for now. I should get started on codeawards and wrap up our proposals. I hope I can do that this week. simon On Tue, Feb 22, 2011 at 3:16 PM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hey, I have written the proposal. Please let me know if you want more / less of certain parts. Should I upload it somewhere? Implementation plan soon to follow. Sorry for the late reply; I have been rather busy these past few weeks. David On Wednesday, February 02, 2011 10:35:55 Simon Willnauer wrote: Hey David, I saw that you added a tiny line to the GSoC Lucene wiki - thanks for that. On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi guys, Mark, Robert, Simon: thanks for the support! I really hope we can work together this summer (and before that, obviously). Same here! According to http://www.google- melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's still some time until the application period. So let me use this week to finish my PhD research plan, and get back to you next week. I am not really familiar with how the program works, i.e. how detailed the application description should be, when mentorship is decided, etc. so I guess we will have a lot to talk about. :) so from a 1ft view it work like this: 1. Write up a short proposal what your idea is about 2. make it public! and publish a implementation plan - how you would want to realize your proposal. If you don't follow that 100% in the actual impl. don't worry. Its just mean to give us an idea that you know what you are doing and where you want to go. something like a 1 A4 rough design doc. 3. give other people the change to apply for the same suggestion (this is how it works though) 4 Let the ASF / us assign one or more possible mentors to it 5. let us apply for a slot in GSoC (those are limited for organizations) 6. get accepted 7. rock it! (Actually, should we move this discussion private?) no - we usually do everything in public except of discussion within the PMC that are meant to be private for legal reasons or similar things. Lets stick to the mailing list for all communication except you have something that should clearly not be public. This also give other contributors a chance to help and get interested in your work!! simon David Hi David, honestly this sounds fantastic. It would be great to have someone to work with us on this issue! To date, progress is pretty slow-going (minor improvements, cleanups, additional stats here and there)... but we really need all the help we can get, especially from people who have a really good understanding of the various models. In case you are interested, here are some references to discussions about adding more flexibility (with some prototypes etc): http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_st eps _towards_making_lucene_s_scoring_more_flexible https://issues.apache.org/jira/browse/LUCENE-2392 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi all, I have already sent this mail to Simon Willnauer, and he suggested me to post it here for discussion. I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest, Hungary. I am doing an IR-related research, and we have considered using Lucene as our search engine. We were quite satisfied with the speed and ease of use. However, we would like to experiment with different ranking algorithms, and this is where problems arise. Lucene only supports the VSM, and unfortunately the ranking architecture seems to be tailored specifically to its needs. I would be very much interested in revamping the ranking component as a GSoC project. The following modifications should be doable in the allocated time frame: - a new ranking class hierarchy, which is generic enough to allow easy implementation of new weighting schemes (at least bag-of-words ones), - addition of state-of-the-art ranking methods, such as Okapi BM25, proximity and DFR models, - configuration for ranking selection, with the old method as default. I believe all users of Lucene would profit from such a project. It would provide the scientific community with an even more useful research aid, while regular users could benefit from superior ranking results. Please let me know your opinion about this proposal. - To unsubscribe
Re: GSoC
Hi guys, Mark, Robert, Simon: thanks for the support! I really hope we can work together this summer (and before that, obviously). According to http://www.google- melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's still some time until the application period. So let me use this week to finish my PhD research plan, and get back to you next week. I am not really familiar with how the program works, i.e. how detailed the application description should be, when mentorship is decided, etc. so I guess we will have a lot to talk about. :) (Actually, should we move this discussion private?) David Hi David, honestly this sounds fantastic. It would be great to have someone to work with us on this issue! To date, progress is pretty slow-going (minor improvements, cleanups, additional stats here and there)... but we really need all the help we can get, especially from people who have a really good understanding of the various models. In case you are interested, here are some references to discussions about adding more flexibility (with some prototypes etc): http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_steps _towards_making_lucene_s_scoring_more_flexible https://issues.apache.org/jira/browse/LUCENE-2392 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi all, I have already sent this mail to Simon Willnauer, and he suggested me to post it here for discussion. I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest, Hungary. I am doing an IR-related research, and we have considered using Lucene as our search engine. We were quite satisfied with the speed and ease of use. However, we would like to experiment with different ranking algorithms, and this is where problems arise. Lucene only supports the VSM, and unfortunately the ranking architecture seems to be tailored specifically to its needs. I would be very much interested in revamping the ranking component as a GSoC project. The following modifications should be doable in the allocated time frame: - a new ranking class hierarchy, which is generic enough to allow easy implementation of new weighting schemes (at least bag-of-words ones), - addition of state-of-the-art ranking methods, such as Okapi BM25, proximity and DFR models, - configuration for ranking selection, with the old method as default. I believe all users of Lucene would profit from such a project. It would provide the scientific community with an even more useful research aid, while regular users could benefit from superior ranking results. Please let me know your opinion about this proposal. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC
Hey David, I saw that you added a tiny line to the GSoC Lucene wiki - thanks for that. On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi guys, Mark, Robert, Simon: thanks for the support! I really hope we can work together this summer (and before that, obviously). Same here! According to http://www.google- melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's still some time until the application period. So let me use this week to finish my PhD research plan, and get back to you next week. I am not really familiar with how the program works, i.e. how detailed the application description should be, when mentorship is decided, etc. so I guess we will have a lot to talk about. :) so from a 1ft view it work like this: 1. Write up a short proposal what your idea is about 2. make it public! and publish a implementation plan - how you would want to realize your proposal. If you don't follow that 100% in the actual impl. don't worry. Its just mean to give us an idea that you know what you are doing and where you want to go. something like a 1 A4 rough design doc. 3. give other people the change to apply for the same suggestion (this is how it works though) 4 Let the ASF / us assign one or more possible mentors to it 5. let us apply for a slot in GSoC (those are limited for organizations) 6. get accepted 7. rock it! (Actually, should we move this discussion private?) no - we usually do everything in public except of discussion within the PMC that are meant to be private for legal reasons or similar things. Lets stick to the mailing list for all communication except you have something that should clearly not be public. This also give other contributors a chance to help and get interested in your work!! simon David Hi David, honestly this sounds fantastic. It would be great to have someone to work with us on this issue! To date, progress is pretty slow-going (minor improvements, cleanups, additional stats here and there)... but we really need all the help we can get, especially from people who have a really good understanding of the various models. In case you are interested, here are some references to discussions about adding more flexibility (with some prototypes etc): http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_steps _towards_making_lucene_s_scoring_more_flexible https://issues.apache.org/jira/browse/LUCENE-2392 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi all, I have already sent this mail to Simon Willnauer, and he suggested me to post it here for discussion. I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest, Hungary. I am doing an IR-related research, and we have considered using Lucene as our search engine. We were quite satisfied with the speed and ease of use. However, we would like to experiment with different ranking algorithms, and this is where problems arise. Lucene only supports the VSM, and unfortunately the ranking architecture seems to be tailored specifically to its needs. I would be very much interested in revamping the ranking component as a GSoC project. The following modifications should be doable in the allocated time frame: - a new ranking class hierarchy, which is generic enough to allow easy implementation of new weighting schemes (at least bag-of-words ones), - addition of state-of-the-art ranking methods, such as Okapi BM25, proximity and DFR models, - configuration for ranking selection, with the old method as default. I believe all users of Lucene would profit from such a project. It would provide the scientific community with an even more useful research aid, while regular users could benefit from superior ranking results. Please let me know your opinion about this proposal. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC
On Feb 2, 2011, at 4:10 AM, David Nemeskey wrote: Hi guys, Mark, Robert, Simon: thanks for the support! I really hope we can work together this summer (and before that, obviously). Sounds like a great idea. Looking forward to the proposal. According to http://www.google- melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's still some time until the application period. So let me use this week to finish my PhD research plan, and get back to you next week. I am not really familiar with how the program works, i.e. how detailed the application description should be, when mentorship is decided, etc. so I guess we will have a lot to talk about. :) It's pretty competitive, especially since you are not only competing against others for Lucene slots, but you are competing against other ASF projects. I highly recommend you, as well as interested mentors, look through Mahout's past GSOC projects: http://www.lucidimagination.com/search/?q=GSOC#/p:mahout and http://www.lucidimagination.com/search/document/2acd6fd380feec3/thoughts_on_gsoc and https://cwiki.apache.org/confluence/display/MAHOUT/GSOC (Actually, should we move this discussion private?) No, you shouldn't and it would be to your detriment come the ranking process since people won't have a track record of what you've done as it relates to your proposal. The goal of GSOC is to learn how Open Source works. Even though you have a mentor, that person is there to help you navigate the community, not to be a private tutor on technical details. I routinely tell all my students that I will help them w/ personal issues (vacation, emergencies, etc.) but that all technical stuff must be done on list (JIRA, IRC, dev@, patches, etc.) David Hi David, honestly this sounds fantastic. It would be great to have someone to work with us on this issue! To date, progress is pretty slow-going (minor improvements, cleanups, additional stats here and there)... but we really need all the help we can get, especially from people who have a really good understanding of the various models. In case you are interested, here are some references to discussions about adding more flexibility (with some prototypes etc): http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_steps _towards_making_lucene_s_scoring_more_flexible https://issues.apache.org/jira/browse/LUCENE-2392 On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi all, I have already sent this mail to Simon Willnauer, and he suggested me to post it here for discussion. I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest, Hungary. I am doing an IR-related research, and we have considered using Lucene as our search engine. We were quite satisfied with the speed and ease of use. However, we would like to experiment with different ranking algorithms, and this is where problems arise. Lucene only supports the VSM, and unfortunately the ranking architecture seems to be tailored specifically to its needs. I would be very much interested in revamping the ranking component as a GSoC project. The following modifications should be doable in the allocated time frame: - a new ranking class hierarchy, which is generic enough to allow easy implementation of new weighting schemes (at least bag-of-words ones), - addition of state-of-the-art ranking methods, such as Okapi BM25, proximity and DFR models, - configuration for ranking selection, with the old method as default. I believe all users of Lucene would profit from such a project. It would provide the scientific community with an even more useful research aid, while regular users could benefit from superior ranking results. Please let me know your opinion about this proposal. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC
+1 the proposal. We already have a committer digging into this area - he would make a perfect GSoC mentor! And would likely love the help. His response likely to follow... - Mark On Jan 28, 2011, at 11:32 AM, David Nemeskey wrote: Hi all, I have already sent this mail to Simon Willnauer, and he suggested me to post it here for discussion. I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest, Hungary. I am doing an IR-related research, and we have considered using Lucene as our search engine. We were quite satisfied with the speed and ease of use. However, we would like to experiment with different ranking algorithms, and this is where problems arise. Lucene only supports the VSM, and unfortunately the ranking architecture seems to be tailored specifically to its needs. I would be very much interested in revamping the ranking component as a GSoC project. The following modifications should be doable in the allocated time frame: - a new ranking class hierarchy, which is generic enough to allow easy implementation of new weighting schemes (at least bag-of-words ones), - addition of state-of-the-art ranking methods, such as Okapi BM25, proximity and DFR models, - configuration for ranking selection, with the old method as default. I believe all users of Lucene would profit from such a project. It would provide the scientific community with an even more useful research aid, while regular users could benefit from superior ranking results. Please let me know your opinion about this proposal. Thank you very much, David Nemeskey - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - Mark Miller lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC
On Fri, Jan 28, 2011 at 5:42 PM, Mark Miller markrmil...@gmail.com wrote: +1 the proposal. We already have a committer digging into this area - he would make a perfect GSoC mentor! And would likely love the help. same here +1 - if there is mentoring needed I will be there too. Robert I recommend you already when David contacted me in the first place :) it's all yours :) simon His response likely to follow... - Mark On Jan 28, 2011, at 11:32 AM, David Nemeskey wrote: Hi all, I have already sent this mail to Simon Willnauer, and he suggested me to post it here for discussion. I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest, Hungary. I am doing an IR-related research, and we have considered using Lucene as our search engine. We were quite satisfied with the speed and ease of use. However, we would like to experiment with different ranking algorithms, and this is where problems arise. Lucene only supports the VSM, and unfortunately the ranking architecture seems to be tailored specifically to its needs. I would be very much interested in revamping the ranking component as a GSoC project. The following modifications should be doable in the allocated time frame: - a new ranking class hierarchy, which is generic enough to allow easy implementation of new weighting schemes (at least bag-of-words ones), - addition of state-of-the-art ranking methods, such as Okapi BM25, proximity and DFR models, - configuration for ranking selection, with the old method as default. I believe all users of Lucene would profit from such a project. It would provide the scientific community with an even more useful research aid, while regular users could benefit from superior ranking results. Please let me know your opinion about this proposal. Thank you very much, David Nemeskey - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - Mark Miller lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSoC
On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi all, I have already sent this mail to Simon Willnauer, and he suggested me to post it here for discussion. I am David Nemeskey, a PhD student at the Eotvos Lorand University, Budapest, Hungary. I am doing an IR-related research, and we have considered using Lucene as our search engine. We were quite satisfied with the speed and ease of use. However, we would like to experiment with different ranking algorithms, and this is where problems arise. Lucene only supports the VSM, and unfortunately the ranking architecture seems to be tailored specifically to its needs. I would be very much interested in revamping the ranking component as a GSoC project. The following modifications should be doable in the allocated time frame: - a new ranking class hierarchy, which is generic enough to allow easy implementation of new weighting schemes (at least bag-of-words ones), - addition of state-of-the-art ranking methods, such as Okapi BM25, proximity and DFR models, - configuration for ranking selection, with the old method as default. I believe all users of Lucene would profit from such a project. It would provide the scientific community with an even more useful research aid, while regular users could benefit from superior ranking results. Please let me know your opinion about this proposal. Hi David, honestly this sounds fantastic. It would be great to have someone to work with us on this issue! To date, progress is pretty slow-going (minor improvements, cleanups, additional stats here and there)... but we really need all the help we can get, especially from people who have a really good understanding of the various models. In case you are interested, here are some references to discussions about adding more flexibility (with some prototypes etc): http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_steps_towards_making_lucene_s_scoring_more_flexible https://issues.apache.org/jira/browse/LUCENE-2392 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [GSOC] Congrats to all students
Thanks guys! So happy to get it, and really excited that Mahout got 5 slots. @Robin: I'm totally up for a shared blog, was planning on blogging about it anyway. Robin Anil wrote: Congrats everyone.And a special thanks to Benson for helping us get the slots to 5 this year :) For students that do not get accepted into Google Summer of Code and still ready to work on your proposal. ASF has a formalized process by which you can work on it if you get a willing mentor from the community. It will be a great learning experience and you will get a certification on successful completion from Apache. Do take a look. Also its open for everyone not just for students. http://community.apache.org/mentoringprogramme.html @FamousFive(The selected students :P) Would you guys be interested in keeping track of your experiences via a shared blog. I am thinking of setting up one for Mahout along with the website change. Robin Congrats again.
Re: [GSOC] Congrats to all students
Thanks everyone! I am so exciting to be accepted and I will do my best to finish my proposal in time. A shared blog sounds great to me. The GSoC looks like a training, we suppose to share the experience with all who interested in Mahout project. Cheers, Zhendong On Tue, Apr 27, 2010 at 3:22 PM, Robin Anil robin.a...@gmail.com wrote: Congrats everyone.And a special thanks to Benson for helping us get the slots to 5 this year :) For students that do not get accepted into Google Summer of Code and still ready to work on your proposal. ASF has a formalized process by which you can work on it if you get a willing mentor from the community. It will be a great learning experience and you will get a certification on successful completion from Apache. Do take a look. Also its open for everyone not just for students. http://community.apache.org/mentoringprogramme.html @FamousFive(The selected students :P) Would you guys be interested in keeping track of your experiences via a shared blog. I am thinking of setting up one for Mahout along with the website change. Robin Congrats again. -- - Zhen-Dong Zhao (Maxim) Department of Computer Science School of Computing National University of Singapore
Re: [GSOC] Congrats to all students
+1 for shared blog!
Re: [GSOC] Congrats to all students
Thanks. It's great to finally have the chance to be a part of Apache Mahout. Congratulations to everyone who got selected! +1 for the shared blog idea! On Tue, Apr 27, 2010 at 12:52 PM, Robin Anil robin.a...@gmail.com wrote: Congrats everyone.And a special thanks to Benson for helping us get the slots to 5 this year :) For students that do not get accepted into Google Summer of Code and still ready to work on your proposal. ASF has a formalized process by which you can work on it if you get a willing mentor from the community. It will be a great learning experience and you will get a certification on successful completion from Apache. Do take a look. Also its open for everyone not just for students. http://community.apache.org/mentoringprogramme.html @FamousFive(The selected students :P) Would you guys be interested in keeping track of your experiences via a shared blog. I am thinking of setting up one for Mahout along with the website change. Robin Congrats again. -- Zaid Md. Abdul Wahab Sheikh Senior Undergraduate B.Tech Computer Science and Engineering NIT Allahabad (MNNIT)
Re: [GSOC] Congrats to all students
Thanks everyone! This is a fantastic opportunity, and I'll try to make the best of this for myself, as well as Mahout. Hopefully, we'll have a great compilation of deep learning networks within the next few releases. BTW, congrats to everyone on Mahout becoming a TLP! On Tue, Apr 27, 2010 at 1:13 AM, Grant Ingersoll gsing...@apache.orgwrote: Looks like student GSOC announcements are up ( http://socghop.appspot.com/gsoc/program/list_projects/google/gsoc2010). Mahout got quite a few projects (5) accepted this year, which is a true credit to the ASF, Mahout, the mentors, and most of all the students! We had a good number of very high quality student proposals for Mahout this year and it was very difficult to choose. Of the ones selected, I think they all bode well for the future of Mahout and the students. For those who didn't make the cut, I know it's small consolation, but I would encourage you all to stay involved in open source, if not Mahout specifically. We'd certainly love to see you contributing here as many of you had very good ideas. At any rate, for everyone, keep an eye out on the Mahout project, as you should be seeing lots of exciting features coming to Mahout soon in the form of scalable Neural Networks, Restricted Boltzmann Machines (recommenders), SVD-based recommenders, EigenCuts Spectral Clustering and Support Vector Machines (SVM)! Should be an exciting summer! -Grant -- SK
Re: [GSOC] 2010 Timelines
Timeline including Apache internal deadlines: http://cwiki.apache.org/confluence/display/COMDEVxSITE/GSoC Mentors, please also click on the ranking link to the ranking explanation [1] for more information on how to rank student proposals. Isabel [1] http://cwiki.apache.org/confluence/display/COMDEVxSITE/Mentee+Ranking+Process signature.asc Description: This is a digitally signed message part.
Re: [GSOC] Wiki Page Added
Hi Grant, Could you please give us the link of this page? Cheers, Zhendong On Wed, Mar 31, 2010 at 8:53 PM, Grant Ingersoll gsing...@apache.orgwrote: I created a Wiki page on GSOC. I hope everyone considering GSOC reads it. Mentors, please add as you see fit. Would be good to get a Mahout FAQ going to. Perhaps, Robin, Deneche and David would consider adding their past year proposals up there as examples, too. Cheers, Grant -- - Zhen-Dong Zhao (Maxim) Department of Computer Science School of Computing National University of Singapore
Re: [GSOC] Wiki Page Added
D'oh! My bad: http://cwiki.apache.org/MAHOUT/gsoc.html. It's linked from the front wiki page under community. -Grant On Mar 31, 2010, at 9:11 AM, zhao zhendong wrote: Hi Grant, Could you please give us the link of this page? Cheers, Zhendong On Wed, Mar 31, 2010 at 8:53 PM, Grant Ingersoll gsing...@apache.orgwrote: I created a Wiki page on GSOC. I hope everyone considering GSOC reads it. Mentors, please add as you see fit. Would be good to get a Mahout FAQ going to. Perhaps, Robin, Deneche and David would consider adding their past year proposals up there as examples, too. Cheers, Grant -- - Zhen-Dong Zhao (Maxim) Department of Computer Science School of Computing National University of Singapore
Re: [GSOC] Wiki Page Added
Ha, thanks. On Wed, Mar 31, 2010 at 9:29 PM, Grant Ingersoll gsing...@apache.orgwrote: D'oh! My bad: http://cwiki.apache.org/MAHOUT/gsoc.html. It's linked from the front wiki page under community. -Grant On Mar 31, 2010, at 9:11 AM, zhao zhendong wrote: Hi Grant, Could you please give us the link of this page? Cheers, Zhendong On Wed, Mar 31, 2010 at 8:53 PM, Grant Ingersoll gsing...@apache.org wrote: I created a Wiki page on GSOC. I hope everyone considering GSOC reads it. Mentors, please add as you see fit. Would be good to get a Mahout FAQ going to. Perhaps, Robin, Deneche and David would consider adding their past year proposals up there as examples, too. Cheers, Grant -- - Zhen-Dong Zhao (Maxim) Department of Computer Science School of Computing National University of Singapore -- - Zhen-Dong Zhao (Maxim) Department of Computer Science School of Computing National University of Singapore
Re: GSOC 2010
Hi Tanya, MAHOUT-328 is just a general stub. There is no detailed project description other than what is given there. The idea is we let you propose to implement a clustering algorithm in Mahout. Start here http://cwiki.apache.org/MAHOUT/gsoc.html. Browse through the Wiki. Look at what mahout has at the moment http://cwiki.apache.org/MAHOUT/algorithms.html. There are couple of algorithms missing from mahout like min-hash or hierarchical clustering or even a generic EM framework. I would suggest you to read carefully through the discussions on the mailing list using the archives and then zero in on the algorithm you would want to implement and then propose to implement it. Robin On Wed, Mar 31, 2010 at 10:27 PM, Tanya Gupta gtany...@gmail.com wrote: Hi I would like a detailed project description for MAHOUT-328. Thanking You Tanya Gupta
Re: GSOC 2010 is here
On Mon Robin Anil robin.a...@gmail.com wrote: 2. UIMA Integration with Mahout? (Maybe a good project if UIMA folks are taking in GSOC students) I guess one could easily split this one in two: a) Using UIMA (whole pipeline or just the analysers if that is possible) for data pre-processing before Mahout algorithms are run. b) Making it easy to integrate Mahout algorithms (classification models etc.) as UIMA annotators. Isabel
Re: GSOC 2010 is here
On Wed Robin Anil robin.a...@gmail.com wrote: Greetings! Fellow GSOC alums, administrators and dear mentors, the next edition is right here. Details are given in the link below. https://groups.google.com/group/google-summer-of-code-discuss/browse_thread/thread/d839c0b02ac15b3f Some additional notes to committers: First of all mentoring a GSoC student is a great experience, so if you do have some cycles left, I would highly recommend participating in GSoC as a mentor (thanks Grant for convincing myself last year...). We had several successful students here at Mahout in past GSoC years. Each year there were strong proposals for projects within Mahout. As a results projects usually turn out to be interesting for both, mentor and student. One final note: If there is anyone on this list who might be interested in helping with general ASF GSoC logistics and administration tasks, please have a look at the newly founded community development project (d...@community.apache.org) Maybe we could identify key areas in Mahout which we need to develop apart from the ML implementations and list it down for students to see before they start trickling in. And motivate students to come up with their own ideas and discuss them on-list before submitting their submission. Some ideas: Benchmarking Framework with EC2 wrappers +1 I would love to see that. Commandline Console+Launcher like Hbase and hadoop +1 Online Tool/Query UI for Algorithms in Mahout(like CF) Possible ideas(I have no idea what i am talking here but there are nice problems to solve) Improvements in Math? How to tackle management of datasets? Error Recovery if a job fails? How to tackle managment of learned classification models? Better tooling for Mahout integration? (Lucene for tokenization and analysers?, data import and export?) Isabel
Re: GSOC 2010 is here
Some more Wild and Wacky Ideas. Might be out of scope for GSOC, but are nice to have features for mahout. I would like to encourage all of you to put down your ideas here. 1. Data Visualization tool backed with HDFS/Hbase for inspecting clusters, Topic model etc etc - It could have many map/reduce jobs which transform the clustering output, aggregates things and produce interesting stats or visualization of data 2. UIMA Integration with Mahout? (Maybe a good project if UIMA folks are taking in GSOC students) Robin On Mon, Feb 1, 2010 at 6:17 PM, Isabel Drost isa...@apache.org wrote: On Wed Robin Anil robin.a...@gmail.com wrote: Greetings! Fellow GSOC alums, administrators and dear mentors, the next edition is right here. Details are given in the link below. https://groups.google.com/group/google-summer-of-code-discuss/browse_thread/thread/d839c0b02ac15b3f Some additional notes to committers: First of all mentoring a GSoC student is a great experience, so if you do have some cycles left, I would highly recommend participating in GSoC as a mentor (thanks Grant for convincing myself last year...). We had several successful students here at Mahout in past GSoC years. Each year there were strong proposals for projects within Mahout. As a results projects usually turn out to be interesting for both, mentor and student. One final note: If there is anyone on this list who might be interested in helping with general ASF GSoC logistics and administration tasks, please have a look at the newly founded community development project (d...@community.apache.org) Maybe we could identify key areas in Mahout which we need to develop apart from the ML implementations and list it down for students to see before they start trickling in. And motivate students to come up with their own ideas and discuss them on-list before submitting their submission. Some ideas: Benchmarking Framework with EC2 wrappers +1 I would love to see that. Commandline Console+Launcher like Hbase and hadoop +1 Online Tool/Query UI for Algorithms in Mahout(like CF) Possible ideas(I have no idea what i am talking here but there are nice problems to solve) Improvements in Math? How to tackle management of datasets? Error Recovery if a job fails? How to tackle managment of learned classification models? Better tooling for Mahout integration? (Lucene for tokenization and analysers?, data import and export?) Isabel
Re : [GSOC] Code Submissions
done. --- En date de : Mar 8.9.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: [GSOC] Code Submissions À: Mahout Dev List mahout-dev@lucene.apache.org Date: Mardi 8 Septembre 2009, 13h09 Hi Robin, David and Deneche, You will need to submit code samples. Please see http://groups.google.com/group/google-summer-of-code-announce/web/how-to-provide-google-with-sample-code -Grant
Re: Re : [GSOC] July 6 is mid-term evaluations
I filled out one for Deneche. On Tue, Jul 7, 2009 at 9:32 AM, deneche abdelhakim a_dene...@yahoo.frwrote: The students mid-term survey is available online. I'm posting this because I almost forgot it =P --- En date de : Mer 17.6.09, Grant Ingersoll gsing...@apache.org a écrit : De: Grant Ingersoll gsing...@apache.org Objet: [GSOC] July 6 is mid-term evaluations À: mahout-dev@lucene.apache.org Date: Mercredi 17 Juin 2009, 15h54 Just a reminder to GSOC students that July 6 is mid-term evaluation. http://socghop.appspot.com/document/show/program/google/gsoc2009/timeline -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 http://www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)
Re: [GSOC] July 6 is mid-term evaluations
On Tuesday 07 July 2009 20:34:09 Ted Dunning wrote: I filled out one for Deneche. I submitted the one for Robin yesterday evening. Isabel -- QOTD: Produtos desenvolvidos para todo tipo de idiota * Impresso no fundo, embaixo, de uma sobremesa tiramisudo Tesco: ``N�o vire de ponta cabe�a.'' |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://main...@spaceboyz.net signature.asc Description: This is a digitally signed message part.
Re: [GSOC] Thoughts about Random forests map-reduce implementation
Very similar, but I was talking about building trees on each split of the data (a la map reduce split). That would give many small splits and would thus give very different results from bagging because the splits would be small and contigous rather than large and random. On Thu, Jun 18, 2009 at 1:37 AM, deneche abdelhakim a_dene...@yahoo.frwrote: build multiple trees for different portions of the data What's the difference with the basic bagging algorithm, which builds 'each tree' using a different portion (about 2/3) of the data ?
Re: [GSOC] GSOC Start time nearing
On Tuesday 12 May 2009 19:50:21 Grant Ingersoll wrote: http://socghop.appspot.com/document/show/program/google/gsoc2009/timeline May 23. Hope all of our students and mentors are ready to go. I certainly am*. Isabel * Might be a bit distracted on that exact day though: It's my birthday ;)
Re: [GSOC] Accepted Students
It's also helpful to get yourself a Wiki account and a JIRA account if you don't already have them. Small patches to the existing docs/code can also help you figure out the process On Apr 21, 2009, at 1:19 PM, Isabel Drost wrote: On Tuesday 21 April 2009 08:30:34 David Hall wrote: As for questions, what am I supposed to be reading during this community building period? I see: * http://cwiki.apache.org/MAHOUT/howtocontribute.html * http://www.apache.org/foundation/how-it-works.html plus skimming javadocs. These are certainly of interest. In addition you can checkout and have a look at the code. Try to get a rough idea of where your contribution would fit best. Please share your ideas with the community to get feedback early on.
Re: [GSOC] Accepted Students
Thanks everyone! -- David On Thu, Apr 23, 2009 at 12:53 PM, Grant Ingersoll gsing...@apache.org wrote: It's also helpful to get yourself a Wiki account and a JIRA account if you don't already have them. Small patches to the existing docs/code can also help you figure out the process On Apr 21, 2009, at 1:19 PM, Isabel Drost wrote: On Tuesday 21 April 2009 08:30:34 David Hall wrote: As for questions, what am I supposed to be reading during this community building period? I see: * http://cwiki.apache.org/MAHOUT/howtocontribute.html * http://www.apache.org/foundation/how-it-works.html plus skimming javadocs. These are certainly of interest. In addition you can checkout and have a look at the code. Try to get a rough idea of where your contribution would fit best. Please share your ideas with the community to get feedback early on.
Re: [GSOC] Accepted Students
Hi David, Welcome into Mahout =) The How To Contribute Wiki page is a must read, it gives you a quick overview about all you'll need to when contributing to Mahout. In my own experience you'll also need to: * know how to build the latest version of Mahout: http://cwiki.apache.org/MAHOUT/buildingmahout.html although, depending on your project you may skip the Taste Web part if you're not working with Taste. * know how to run an example in Hadoop, at least in pseudo-distributed: http://hadoop.apache.org/core/docs/current/quickstart.html --- En date de : Mar 21.4.09, David Hall d...@cs.stanford.edu a écrit : De: David Hall d...@cs.stanford.edu Objet: Re: [GSOC] Accepted Students À: mahout-dev@lucene.apache.org Date: Mardi 21 Avril 2009, 8h30 On Mon, Apr 20, 2009 at 11:18 PM, deneche abdelhakim a_dene...@yahoo.fr wrote: Hi, =D I've been accepted. And I'll be working on Random Forests =P Given it's my second participation, I have one advise : don't be shy to ask about anything related to your project on this list (starting from now), its the fastest way to learn about Mahout. Who else has been accepted ? I'm here. I'll be working on Latent Dirichlet Allocation. As for questions, what am I supposed to be reading during this community building period? I see: * http://cwiki.apache.org/MAHOUT/howtocontribute.html * http://www.apache.org/foundation/how-it-works.html plus skimming javadocs. Other suggestions? Either general, or more specific to my project? -- David - abdelhakim
Re: [GSOC] Accepted Students
Deneche / David / Robin, Congrats on getting selected for Mahout project. Have fun coding... Best regards, Joe. On Tue, Apr 21, 2009 at 7:53 AM, Robin Anil robin.a...@gmail.com wrote: Hi, Seems Like I am the last one to know :) Hoping for a great Summer of Code ahead. Robin PS: Trying hard to survive a heatwave of 45C http://www.iitkgp.ac.in/topfiles/wgraph.php On Tue, Apr 21, 2009 at 1:51 PM, deneche abdelhakim a_dene...@yahoo.fr wrote: Hi David, Welcome into Mahout =) The How To Contribute Wiki page is a must read, it gives you a quick overview about all you'll need to when contributing to Mahout. In my own experience you'll also need to: * know how to build the latest version of Mahout: http://cwiki.apache.org/MAHOUT/buildingmahout.html although, depending on your project you may skip the Taste Web part if you're not working with Taste. * know how to run an example in Hadoop, at least in pseudo-distributed: http://hadoop.apache.org/core/docs/current/quickstart.html --- En date de : Mar 21.4.09, David Hall d...@cs.stanford.edu a écrit : De: David Hall d...@cs.stanford.edu Objet: Re: [GSOC] Accepted Students À: mahout-dev@lucene.apache.org Date: Mardi 21 Avril 2009, 8h30 On Mon, Apr 20, 2009 at 11:18 PM, deneche abdelhakim a_dene...@yahoo.fr wrote: Hi, =D I've been accepted. And I'll be working on Random Forests =P Given it's my second participation, I have one advise : don't be shy to ask about anything related to your project on this list (starting from now), its the fastest way to learn about Mahout. Who else has been accepted ? I'm here. I'll be working on Latent Dirichlet Allocation. As for questions, what am I supposed to be reading during this community building period? I see: * http://cwiki.apache.org/MAHOUT/howtocontribute.html * http://www.apache.org/foundation/how-it-works.html plus skimming javadocs. Other suggestions? Either general, or more specific to my project? -- David - abdelhakim
Re: [GSOC] Accepted Students
Hi, Seems Like I am the last one to know :) Hoping for a great Summer of Code ahead. Robin PS: Trying hard to survive a heatwave of 45C http://www.iitkgp.ac.in/topfiles/wgraph.php On Tue, Apr 21, 2009 at 1:51 PM, deneche abdelhakim a_dene...@yahoo.frwrote: Hi David, Welcome into Mahout =) The How To Contribute Wiki page is a must read, it gives you a quick overview about all you'll need to when contributing to Mahout. In my own experience you'll also need to: * know how to build the latest version of Mahout: http://cwiki.apache.org/MAHOUT/buildingmahout.html although, depending on your project you may skip the Taste Web part if you're not working with Taste. * know how to run an example in Hadoop, at least in pseudo-distributed: http://hadoop.apache.org/core/docs/current/quickstart.html --- En date de : Mar 21.4.09, David Hall d...@cs.stanford.edu a écrit : De: David Hall d...@cs.stanford.edu Objet: Re: [GSOC] Accepted Students À: mahout-dev@lucene.apache.org Date: Mardi 21 Avril 2009, 8h30 On Mon, Apr 20, 2009 at 11:18 PM, deneche abdelhakim a_dene...@yahoo.fr wrote: Hi, =D I've been accepted. And I'll be working on Random Forests =P Given it's my second participation, I have one advise : don't be shy to ask about anything related to your project on this list (starting from now), its the fastest way to learn about Mahout. Who else has been accepted ? I'm here. I'll be working on Latent Dirichlet Allocation. As for questions, what am I supposed to be reading during this community building period? I see: * http://cwiki.apache.org/MAHOUT/howtocontribute.html * http://www.apache.org/foundation/how-it-works.html plus skimming javadocs. Other suggestions? Either general, or more specific to my project? -- David - abdelhakim
Re: [GSOC] Accepted Students
On Tuesday 21 April 2009 08:30:34 David Hall wrote: As for questions, what am I supposed to be reading during this community building period? I see: * http://cwiki.apache.org/MAHOUT/howtocontribute.html * http://www.apache.org/foundation/how-it-works.html plus skimming javadocs. These are certainly of interest. In addition you can checkout and have a look at the code. Try to get a rough idea of where your contribution would fit best. Please share your ideas with the community to get feedback early on. Isabel
Re: gsoc , EM or SVM?
Hi I decided to go with the mixture model for EM. I have modified my proposal and submit it both on gsoc website and apache wiki. Best Regards Yifan 2009/4/1 Yifan Wang heavens...@gmail.com: I will choose Mixture Model for the EM implementation. Yifan 2009/4/1 Ted Dunning ted.dunn...@gmail.com: Yifan, EM is a highly non-specific term and covers a huge range of very different algorithms. For example, pLSI, HMM's, and mixture models can all be estimated using EM. What exactly did you mean to address with an EM implementation? On Wed, Apr 1, 2009 at 1:05 PM, Grant Ingersoll gsing...@apache.org wrote: Hi Yifan, I think both are good candidates, although AIUI, SVM is a bit harder to parallelize, so maybe it would make sense to focus on EM. Of course, we don't have to be distributed, so you could propose a non-distributed SVM implementation as a first cut and then work on the distributed part as the project develops. ... For EM, it is a generalization of the k-means algorithm, and we already have k-means in the Mahout library.
Re: [gsoc] Collaborative filtering algorithms
I would hope that your SVD implementation would not be limited to NetFlix like problems, but would be applicable to any reasonably sparse matrix-like data. Likewise, I would expect a good SVD implementation to be useful for nearest neighbor methods or direct prediction by smoothing the history vector. On Tue, Mar 31, 2009 at 11:09 PM, Atul Kulkarni atulskulka...@gmail.comwrote: I have worked with Netflix Prize problem and hence most of my suggested algorithms revolve around that problem. But I am open to other algorithms that might be out there. Is this a good thing to do? -- Ted Dunning, CTO DeepDyve
Re: [gsoc] Collaborative filtering algorithms
On Wed, Apr 1, 2009 at 1:30 AM, Ted Dunning ted.dunn...@gmail.com wrote: I would hope that your SVD implementation would not be limited to NetFlix like problems, but would be applicable to any reasonably sparse matrix-like data. Yes, ofcourse. it would apply to any large sparse matrix implementation. Likewise, I would expect a good SVD implementation to be useful for nearest neighbor methods or direct prediction by smoothing the history vector. I do not have knowledge about this as of now, will read up and comment. On Tue, Mar 31, 2009 at 11:09 PM, Atul Kulkarni atulskulka...@gmail.com wrote: I have worked with Netflix Prize problem and hence most of my suggested algorithms revolve around that problem. But I am open to other algorithms that might be out there. Is this a good thing to do? -- Ted Dunning, CTO DeepDyve -- Regards, Atul Kulkarni Teaching Assistant, Department of Computer Science, University of Minnesota Duluth Duluth. 55805. www.d.umn.edu/~kulka053
Re: [gsoc] Collaborative filtering algorithms
Thanks David, that helped. On Wed, Apr 1, 2009 at 1:47 AM, David Hall d...@cs.stanford.edu wrote: On Tue, Mar 31, 2009 at 11:43 PM, Atul Kulkarni atulskulka...@gmail.com wrote: questions in line. On Wed, Apr 1, 2009 at 1:27 AM, Ted Dunning ted.dunn...@gmail.com wrote: Nobody is working on SVD yet, but one GSOC applicant has said that they would like to work on LDA which is a probabilistic relative of SVD. I do not understand the relation in LDA and SVD. In my limited understanding I understand LDA transforms data points in to a coordinate system where they can be easily discriminated/classified. SVD on the other hand is used for dimension reduction, can you help me bridge the gap by providing something to read on? LDA is an overloaded term. To the frequentist, it usually means Linear Discriminant Analysis, which is what you're talking about; to the bayesian machine learning people, it means Latent Dirichlet Allocation, which is a probabilistic dimensionality reduction technique for projecting documents in V-dimensional space to the K-simplex, with K \ll V. -- David The approach in your reference (3) is highly amenable to parallel implementation. Yes, I felt so too, but again did not want to comment on it untill I had the MapReduce basics related with it. Large-scale SVD would be a very interesting application for Mahout. On Tue, Mar 31, 2009 at 11:09 PM, Atul Kulkarni atulskulka...@gmail.com wrote: Is there anyone doing the SVD part or are their any SVD algorithm implementation on Hadoop? If there are then I would like to implement the methods described in [1],[2],[3] for matrix factorization, in specific. -- Ted Dunning, CTO DeepDyve -- Regards, Atul Kulkarni Teaching Assistant, Department of Computer Science, University of Minnesota Duluth Duluth. 55805. www.d.umn.edu/~kulka053 http://www.d.umn.edu/%7Ekulka053 -- Regards, Atul Kulkarni Teaching Assistant, Department of Computer Science, University of Minnesota Duluth Duluth. 55805. www.d.umn.edu/~kulka053
Re: [GSOC] Ranking Process
I'm preparing an application, but haven't submitted yet as I was waiting on confirmation of my student status... as I now know that I'm going to be eligible I'll get my application in soon :) 2009/4/1 Ted Dunning ted.dunn...@gmail.com: I only see two applications for Mahout, one reasonably strong, one much less so. Are there students out there who still need to prepare an application? The deadline is coming up fast. 2009/3/31 Grant Ingersoll gsing...@apache.org FYI: http://wiki.apache.org/general/RankingProcess -Grant -- Ted Dunning, CTO DeepDyve
Re: [GSOC] Ranking Process
Hmm, I see several in there, but they aren't all labeled w/ Mahout, so that may be why. I also expanded to see 100 at a time. -Grant On Mar 31, 2009, at 8:43 PM, Ted Dunning wrote: I only see two applications for Mahout, one reasonably strong, one much less so. Are there students out there who still need to prepare an application? The deadline is coming up fast. 2009/3/31 Grant Ingersoll gsing...@apache.org FYI: http://wiki.apache.org/general/RankingProcess -Grant -- Ted Dunning, CTO DeepDyve
Re: [GSOC] Ranking Process
The other thing to note, here, is that people should be aware that the ASF is only going to get a certain number of slots from Google (last year, it was somewhere in the 30-40 range, I think), which are distributed across all projects that have expressed an interest in mentoring. While Mahout has 4 interested mentors, that does not mean Mahout will get 4 projects. At any rate, best of luck to everyone. If you don't get picked, we still welcome your contributions! Remember, open source is an excellent resume builder. Cheers, Grant On Mar 31, 2009, at 4:43 PM, Grant Ingersoll wrote: FYI: http://wiki.apache.org/general/RankingProcess -Grant
Re: [gsoc] Collaborative filtering algorithms
The machinery of SVD is almost always described in terms of least squares matrix approximation without mentioning the probabilistic underpinnings of why least-squares is a good idea. The connection, however, goes all the way back to Gauss' reduction of planetary position observations (this is *why* the normal distribution is often called a Gaussian). Gauss provided such a compelling rationale for both the normal distribution (what I called a Gaussian below) and the resulting least squared error formulation of the estimation problem that everybody has just assumed that least-squared-error estimation is the way to go. Generally this is a pretty good approximation. Occasionally it is not at all good. One place where it is a really bad approximation is with very sparse count data. Netflix data is a great example, text represented as word counts per document is another. To fill in more detail, here is a relatively jargon-filled explanation of the connection. I apologize for not being able to express this more lucidly. A more general view of both SVD and LDA are that they find probabilistic mixture models to describe data. SVD finds a single mixture of Gaussian distributions that all have the same variance and uses maximum likelihood to find this mixture. LDA finds a multi-level mixture of multinomial models and gives you a distribution of models that represents the distribution of possible models given your data and explicit assumptions. Gaussian distributions and multinomials look quite different, but for relatively large observed counts their log-likelihood functions become very similar. For Gaussians, the log-likelihood is just the sum of squared deviations from the mean. For large counts, the log-likelihood for multinomials approximates squared deviations from the mean. On Tue, Mar 31, 2009 at 11:43 PM, Atul Kulkarni atulskulka...@gmail.comwrote: I do not understand the relation in LDA and SVD. In my limited understanding I understand LDA transforms data points in to a coordinate system where they can be easily discriminated/classified. SVD on the other hand is used for dimension reduction, can you help me bridge the gap by providing something to read on? -- Ted Dunning, CTO DeepDyve
Re: [GSOC] Ranking Process
Let me second that. When I am hiring a student without professional experience, it is almost a perfect predictor that if they have done significant work on a significant outside project they will get an interview with me and if not, they won't. Moreover, if I have a candidate at any level who has made significant contributions to a major open source project, I generally don't even drill much more on code hygiene issues. The standards in most open source projects regarding testing and continuous integration are high enough that I don't have to worry about whether the applicant understands how to code and how to code with others. On the other hand, the only use I make of the list of buzzwords generally found under skills on a resume is that I start at the end of the list and ask a question about that area's fundamentals to see if the student is padding their list. When interviewing with me don't ever put anything on your resume that you don't really know. I don't know how widespread my attitude is, but I can't believe I am alone in this. On Wed, Apr 1, 2009 at 3:42 AM, Grant Ingersoll gsing...@apache.org wrote: Remember, open source is an excellent resume builder. -- Ted Dunning, CTO DeepDyve
Re: gsoc , EM or SVM?
Hi Yifan, I think both are good candidates, although AIUI, SVM is a bit harder to parallelize, so maybe it would make sense to focus on EM. Of course, we don't have to be distributed, so you could propose a non- distributed SVM implementation as a first cut and then work on the distributed part as the project develops. -Grant On Mar 31, 2009, at 2:48 AM, Yifan Wang wrote: Hi, My Name is Yifan. I submitted a proposal for the gsoc this year. I am interested in the classification and clustering algorithms. Because I need one such algorithm for the experimental project that I started myself for text classification and clustering. In my proposal, I planned to implement two of the machine learning algorithms: EM and SVM. But it seems a bit much to implement two algorithms in gsoc, so now I need to choose one between the two algorithms. For EM, it is a generalization of the k-means algorithm, and we already have k-means in the Mahout library. For SVM, It is a quite important algorithm for classification while implementation of it can be hard. So any suggestions of which one has the most benefit to the Mahout library and may be a good candidate for the gsoc? Best Regards Yifan -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: gsoc , EM or SVM?
Yifan, EM is a highly non-specific term and covers a huge range of very different algorithms. For example, pLSI, HMM's, and mixture models can all be estimated using EM. What exactly did you mean to address with an EM implementation? On Wed, Apr 1, 2009 at 1:05 PM, Grant Ingersoll gsing...@apache.org wrote: Hi Yifan, I think both are good candidates, although AIUI, SVM is a bit harder to parallelize, so maybe it would make sense to focus on EM. Of course, we don't have to be distributed, so you could propose a non-distributed SVM implementation as a first cut and then work on the distributed part as the project develops. ... For EM, it is a generalization of the k-means algorithm, and we already have k-means in the Mahout library.
Re: gsoc , EM or SVM?
I will choose Mixture Model for the EM implementation. Yifan 2009/4/1 Ted Dunning ted.dunn...@gmail.com: Yifan, EM is a highly non-specific term and covers a huge range of very different algorithms. For example, pLSI, HMM's, and mixture models can all be estimated using EM. What exactly did you mean to address with an EM implementation? On Wed, Apr 1, 2009 at 1:05 PM, Grant Ingersoll gsing...@apache.org wrote: Hi Yifan, I think both are good candidates, although AIUI, SVM is a bit harder to parallelize, so maybe it would make sense to focus on EM. Of course, we don't have to be distributed, so you could propose a non-distributed SVM implementation as a first cut and then work on the distributed part as the project develops. ... For EM, it is a generalization of the k-means algorithm, and we already have k-means in the Mahout library.
Re: [GSoC] SimRank Algorithms on Mahout Proposal draft from Xuan Yang
On Wed, Apr 1, 2009 at 7:12 PM, Xuan Yang sailingw...@gmail.com wrote: Hello everyone, This is my proposal draft. BTW remember http://markmail.org/message/rbwp2hf6iipc2ut3 - robert
Re: [GSoC] SimRank Algorithms on Mahout Proposal draft from Xuan Yang
Thanks, I have submited it there. :) 2009/4/2 Robert Burrell Donkin robertburrelldon...@gmail.com: On Wed, Apr 1, 2009 at 7:12 PM, Xuan Yang sailingw...@gmail.com wrote: Hello everyone, This is my proposal draft. BTW remember http://markmail.org/message/rbwp2hf6iipc2ut3 - robert -- Xuan Yang
Re: [gsoc] random forests
Here is a draft of my proposal ** Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests Student: AbdelHakim Deneche Student e-mail: ... Student Major: Phd in Computer Science Student Degree: Master in Computer Science Student Graduation: Spring 2011 Organization: The Apache Software Foundation Assigned Mentor: Abstract: My goal is to add the power of random/regression forests to Mahout. At the end of this summer one should be able to build random/regression forests for large, possibly, distributed datasets, store the forest and reuse it to classify new data. In addition, a demo on EC2 is planned. Detailed Description: This project is all about random/regression forests. The core component is the tree building algorithm from a random bootstrap from the whole dataset. I already wrote a detailed description on Mahout Wiki [RandomForests]. Given the size of the dataset, two distributed implementation are possible: 1. The most straightforward one deals with relatively small datasets. By small, I mean a dataset that can be replicated on every node of the cluster. Basically, each mapper has access to the whole dataset, so if the forest contains N trees and we have M mappers, each mapper runs the core building algorithm N/M times. This implementation is, relatively, easy because each mapper runs the basic building algorithm as it is. It is also of great interest if the user wants to try different parameters when building the forest. An out-of-core implementation is also possible to deal with datasets that cannot fit into the node memory. 2. The second implementation, which is the most difficult, is concerned with very large datasets that cannot fit in every machine of the cluster. In this case the mappers work differently, each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. The core building algorithm must thus be rewritten in a map-reduce form. This implementation can deal with datasets of any size, as long as they are on the cluster. Although the first implementation is easier to implement, the CPU and IO overhead of the out-of-core implementation are still unknown. A reference, non-parallel, implementation should thus be built to better understand the effects of the out-of-core implementation, especially for large datasets. This reference implementation is also usefull to asses the correctness of the distributed implementation. Working Plan and list of deliverables Must-Have: 1. reference implementation of Random/Regression Forests Building Algorithm: . Build a forest of trees, the basic algorithm (described in the wiki) takes a subset from the dataset as a training set and builds a decision tree. This algorithm is repeated for each tree of the forest. . The forest is stored in a file, this way it can be re-used, at any time, to classify new cases . At this step, the necessary changes to Mahout's Classifier interface are made to extend its use to more than Text datasets. 2. Study the effects of large datasets on the reference implementation . This step should guide our choice of the proper parallel implementation 3. Parallel implementation, choose one of the following: 3a. Parallel implementation A . When the dataset can be replicated to all computing nodes. . Each mapper has access to the whole dataset, if the forest contains N trees and we have M mappers, each mapper runs the basic building algorithm N/M times. The mapper if also responsible of computing the out-of-bag error estimation. . The reducer store the trees in the RF file, and merges the oob error estimations. 3b. Parallel implementation B: . When the dataset is so big that it can no longer fit on every computing node, it must be distributed over the cluster. . Each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. . In this case, the basic algorithm must be rewritten to fit in the map-reduce paradigm. Should-Have: 4. Run the Random Forest with a real dataset on EC2: . This step is important, because running the RF on a local dual core machine is different from running it on a real cluster with a real dataset. . This can make a good demo for Mahout . Amazon has put some interesting datasets to play with [PublicDatasets]. The US Census dataset comes in various sizes ranging from 2Go to 200Go, and should make a very good example. . At this stage it may be useful to implement [MAHOUT-71] (Dataset to Matrix Reader). Wanna-Have: 5. If there is still time, implement one or two other important features of RFs such as Variable importance and Proximity estimation Additional Information: I am a PhD student at the University Mentouri of Constantine. My primary research goal is a framework to help build Intelligent Adaptive Systems. For the purpose of my Master, I worked on
Re: [GSOC] Ranking Process
I only see two applications for Mahout, one reasonably strong, one much less so. Are there students out there who still need to prepare an application? The deadline is coming up fast. 2009/3/31 Grant Ingersoll gsing...@apache.org FYI: http://wiki.apache.org/general/RankingProcess -Grant -- Ted Dunning, CTO DeepDyve
Re: [gsoc] random forests
Deneche, I don't see your application on the GSOC web site. Nor on the apache wiki. Time is running out and I would hate to not see you in the program. Is it just that I can't see the application yet? On Tue, Mar 31, 2009 at 1:05 PM, deneche abdelhakim a_dene...@yahoo.frwrote: Here is a draft of my proposal ** Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests Student: AbdelHakim Deneche Student e-mail: ... Student Major: Phd in Computer Science Student Degree: Master in Computer Science Student Graduation: Spring 2011 Organization: The Apache Software Foundation Assigned Mentor: Abstract: My goal is to add the power of random/regression forests to Mahout. At the end of this summer one should be able to build random/regression forests for large, possibly, distributed datasets, store the forest and reuse it to classify new data. In addition, a demo on EC2 is planned. Detailed Description: This project is all about random/regression forests. The core component is the tree building algorithm from a random bootstrap from the whole dataset. I already wrote a detailed description on Mahout Wiki [RandomForests]. Given the size of the dataset, two distributed implementation are possible: 1. The most straightforward one deals with relatively small datasets. By small, I mean a dataset that can be replicated on every node of the cluster. Basically, each mapper has access to the whole dataset, so if the forest contains N trees and we have M mappers, each mapper runs the core building algorithm N/M times. This implementation is, relatively, easy because each mapper runs the basic building algorithm as it is. It is also of great interest if the user wants to try different parameters when building the forest. An out-of-core implementation is also possible to deal with datasets that cannot fit into the node memory. 2. The second implementation, which is the most difficult, is concerned with very large datasets that cannot fit in every machine of the cluster. In this case the mappers work differently, each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. The core building algorithm must thus be rewritten in a map-reduce form. This implementation can deal with datasets of any size, as long as they are on the cluster. Although the first implementation is easier to implement, the CPU and IO overhead of the out-of-core implementation are still unknown. A reference, non-parallel, implementation should thus be built to better understand the effects of the out-of-core implementation, especially for large datasets. This reference implementation is also usefull to asses the correctness of the distributed implementation. Working Plan and list of deliverables Must-Have: 1. reference implementation of Random/Regression Forests Building Algorithm: . Build a forest of trees, the basic algorithm (described in the wiki) takes a subset from the dataset as a training set and builds a decision tree. This algorithm is repeated for each tree of the forest. . The forest is stored in a file, this way it can be re-used, at any time, to classify new cases . At this step, the necessary changes to Mahout's Classifier interface are made to extend its use to more than Text datasets. 2. Study the effects of large datasets on the reference implementation . This step should guide our choice of the proper parallel implementation 3. Parallel implementation, choose one of the following: 3a. Parallel implementation A . When the dataset can be replicated to all computing nodes. . Each mapper has access to the whole dataset, if the forest contains N trees and we have M mappers, each mapper runs the basic building algorithm N/M times. The mapper if also responsible of computing the out-of-bag error estimation. . The reducer store the trees in the RF file, and merges the oob error estimations. 3b. Parallel implementation B: . When the dataset is so big that it can no longer fit on every computing node, it must be distributed over the cluster. . Each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. . In this case, the basic algorithm must be rewritten to fit in the map-reduce paradigm. Should-Have: 4. Run the Random Forest with a real dataset on EC2: . This step is important, because running the RF on a local dual core machine is different from running it on a real cluster with a real dataset. . This can make a good demo for Mahout . Amazon has put some interesting datasets to play with [PublicDatasets]. The US Census dataset comes in various sizes ranging from 2Go to 200Go, and should make a very good example. . At this stage it may be useful to implement [MAHOUT-71] (Dataset to Matrix Reader). Wanna-Have: 5. If there is still time,
Re: [gsoc] random forests
Thank you for your answer, it just made me aware of many hidden-possible-future problems with my implementation. The first is that for any given application, the odds that the data will not fit in a single machine are small, especially if you have an out-of-core tree builder. Really, really big datasets are increasingly common, but are still a small minority of all datasets. by out-of-core you mean the builder can fetch the data directly from a file instead of working from in-memory only (?) One question I have about your plan is whether your step (1) involves building trees or forests only from data held in memory or whether it can be adapted to stream through the data (possibly several times). If a streaming implementation is viable, then it may well be that performance is still quite good for small datasets due to buffering. I was planning to distribute the dataset files to all workers using Hadoop's DistributedCache. I think that a streaming implementation is feasible, the basic tree building algorithm (described here http://cwiki.apache.org/MAHOUT/random-forests.html) would have to stream through the data (either in-memory or from a file) for each node of the tree. During this pass, it computes the information gain (IG) for the selected variables. This algorithm could be improved to compute the IG's for a list of nodes, thus reducing the total number of passes through the data. When building the forest, the list of nodes comes from all the trees built by the mapper. Another way to put this is that the key question is how single node computation scales with input size. If the scaling is relatively linear with data size, then your approach (3) will work no matter the data size. If scaling shows an evil memory size effect, then your approach (2) would be required for large data sets. I'll have to run some tests before answering this question, but I think that the memory usage of the improved algorithm (described above) will mainly be needed to store the IG's computations (variable probabilities...). One way to limit the memory usage is to limit the number of tree-nodes computed at each data pass. Increasing this limit should reduce the data passes but increase the memory usage, and vice versa. There is still one case that this approach, even out-of-core, cannot handle: very large datasets that cannot fit in the node hard-drive, and thus must be distributed across the cluster. abdelHakim --- En date de : Lun 30.3.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: [gsoc] random forests À: mahout-dev@lucene.apache.org Date: Lundi 30 Mars 2009, 0h59 I have two answers for you. The first is that for any given application, the odds that the data will not fit in a single machine are small, especially if you have an out-of-core tree builder. Really, really big datasets are increasingly common, but are still a small minority of all datasets. The second answer is that the odds that SOME mahout application will be too large for a single node are quite high. These aren't contradictory. They just describe the long-tail nature of problem sizes. One question I have about your plan is whether your step (1) involves building trees or forests only from data held in memory or whether it can be adapted to stream through the data (possibly several times). If a streaming implementation is viable, then it may well be that performance is still quite good for small datasets due to buffering. If streaming works, then a single node will be able to handle very large datasets but will just be kind of slow. As you point out, that can be remedied trivially. Another way to put this is that the key question is how single node computation scales with input size. If the scaling is relatively linear with data size, then your approach (3) will work no matter the data size. If scaling shows an evil memory size effect, then your approach (2) would be required for large data sets. On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim a_dene...@yahoo.frwrote: My question is : when Mahout.RF will be used in a real application, what are the odds that the dataset will be so large that it can't fit on every machine of the cluster ? the answer to this question should help me decide which implementation I'll choose. -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax)
Re: [gsoc] random forests
Indeed. And those datasets exist. It is also plausible that this full data scan approach will fail when you want the forest building to take less time. It is also plausible that a full data scan approach fails to improve enough on a non-parallel implementation. This would happen if a significantly large fraction of the entire forest could be built on a single node. That would happen if the CPU requirements for forest building are overshadowed by the I/O cost of scanning the data set. This would imply that there is a small limit to the amount of parallelism that would help. You will know much more about this after you finish the non-parallel implementation than either of us knows now. On Mon, Mar 30, 2009 at 7:24 AM, deneche abdelhakim a_dene...@yahoo.frwrote: There is still one case that this approach, even out-of-core, cannot handle: very large datasets that cannot fit in the node hard-drive, and thus must be distributed across the cluster. -- Ted Dunning, CTO DeepDyve
Re: [gsoc] random forests
I suggest that we all learn from the experience you are about to have on the reference implementation. And, yes, I did mean the reference implementation when I said non-parallel. Thanks for clarifying. On Mon, Mar 30, 2009 at 10:45 AM, deneche abdelhakim a_dene...@yahoo.frwrote: What do you suggest ? And just to make sure, by 'non-paralel implementation' you mean the reference implementation, right ? -- Ted Dunning, CTO DeepDyve
Re: [gsoc] random forests
I have two answers for you. The first is that for any given application, the odds that the data will not fit in a single machine are small, especially if you have an out-of-core tree builder. Really, really big datasets are increasingly common, but are still a small minority of all datasets. The second answer is that the odds that SOME mahout application will be too large for a single node are quite high. These aren't contradictory. They just describe the long-tail nature of problem sizes. One question I have about your plan is whether your step (1) involves building trees or forests only from data held in memory or whether it can be adapted to stream through the data (possibly several times). If a streaming implementation is viable, then it may well be that performance is still quite good for small datasets due to buffering. If streaming works, then a single node will be able to handle very large datasets but will just be kind of slow. As you point out, that can be remedied trivially. Another way to put this is that the key question is how single node computation scales with input size. If the scaling is relatively linear with data size, then your approach (3) will work no matter the data size. If scaling shows an evil memory size effect, then your approach (2) would be required for large data sets. On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim a_dene...@yahoo.frwrote: My question is : when Mahout.RF will be used in a real application, what are the odds that the dataset will be so large that it can't fit on every machine of the cluster ? the answer to this question should help me decide which implementation I'll choose. -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax)
Re: [gsoc] random forests
you should read in . 2a . This implementation is, relatively, easy given... --- En date de : Sam 28.3.09, deneche abdelhakim a_dene...@yahoo.fr a écrit : De: deneche abdelhakim a_dene...@yahoo.fr Objet: Re: [gsoc] random forests À: mahout-dev@lucene.apache.org Date: Samedi 28 Mars 2009, 16h14 I'm actually writing my working plan, and it looks like this: * 1. reference implementation of Random/Regression Forests Building Algorithm: . Build a forest of trees, the basic algorithm (described in the wiki) takes a subset from the dataset as a training set and builds a decision tree. This basic algorithm is repeated for each tree of the forest. . The forest is stored in a file, this way it can be used later to classify new cases 2a. distributed Implementation A: . When the dataset can be replicated to all computing nodes. . Each mapper has access to the whole dataset, if the forest contains N trees and we have M mappers, each mapper runs the basic building algorithm N/M times. . This implementation is, relatively, given that the reference implementation is available, because each mapper runs the basic building algorithm as it is. 2b. Distributed Implementation B: . When the dataset is so big that it can no longer fit on every computing node, it must be distributed over the cluster. . Each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. . In this case, the basic algorithm must be rewritten to fit in the map-reduce paradigm. 3. Run the Random Forest with a real dataset on EC2: . This step is important, because running the RF on a local dual core machine is way different from running it on a real cluster with a real dataset. . This can make for a good demo for Mahout 4. If there is still time, implement one or two other important features of RFs such as Variable importance and Proximity estimation * It is clear from the plan that I won't be able to do all those steps, and in some way I must choose only one implementation (2a or 2b) to do. The first implementation should take less time to implement than 2b and I'm quite sure I can go up to the 4th step, adding other features to the RF. BUT the second implementation is the only one capable of dealing with very large distributed datasets. My question is : when Mahout.RF will be used in a real application, what are the odds that the dataset will be so large that it can't fit on every machine of the cluster ? the answer to this question should help me decide which implementation I'll choose. --- En date de : Dim 22.3.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: [gsoc] random forests À: mahout-dev@lucene.apache.org Date: Dimanche 22 Mars 2009, 0h36 Great expression! You may be right about the nose-bleed tendency between the two methods. On Sat, Mar 21, 2009 at 4:46 AM, deneche abdelhakim a_dene...@yahoo.frwrote: I can't find a no-nose-bleeding algorithm -- Ted Dunning, CTO DeepDyve
Re: [GSoC] SimRank algorithms on Mahout
Graph ranking strategies are something I am very much interested in and would love to see in Mahout. Please do propose. -Grant On Mar 24, 2009, at 6:00 AM, Xuan Yang wrote: Hello everyone, I am a student from Fudan University, Shanghai, China. These days I am doing some research work on SimRank, which is an model measuring similarity of objects. SimRank is applicable in any domain with object-to-object relationships, e.g., web pages with hyperlinks, papers and authors, customers and commodities etc. Based on the simple assumption that two objects are similar if they are related to similar objects, SimRank is calculated recursively on directed graph. You can find the algorithm here http://en.wikipedia.org/wiki/SimRank I found that the calculation of SimRank suits well for the framework of Hadoop. 1, the directed graph could be saved in the form of edge list in hbase. And the Result Sn(a,b) could also be saved in hbase as matrix. 2, We can distribute all the n^2 pairs into the map nodes to calculate SimRank value of the next iteration. 3, There is an optimization method for SimRank's calculation: We can let map nodes calculate the sum of Rk(Xi, V) = PSUMa(V), Xi belongs to the set In(a), and V is an arbitrary node, then hand it to reduce node. In reduce node: If we want to calculate Rk+1(a, b), we only need to calculate Sum of PSUMa(Yj) in which Yj belongs to In(b); 4, besides, there are other optimization methods such as threshold could be used in Map nodes and Reduce nodes. Of course, there are some problems: 1, It is true that mapreduce could make the computation of each node more easier. Yet if the volume of data is very huge, the transport latency of data will become more and more serious. So, methods to reduce IO would be very helpful. 2, SimRank is to compute the similarity between all the nodes. If we map a group of nodes {A, B, C} into one map node, and {D, E, F} into another map node. The computation inside set {A, B, C} will be easy, so will be set {D, E, F}. But when we want to compute SimRank between A and D, It will not be very convenient. I think it would be great to solve these problems and implement a mapreduce-version of algorithm for SimRank. I intend to implement this as my Summer of Code project. Would you be interested in this? And can I get some advices from you? Thanks a lot, Xuan Yang
Re: [GSoC] SimRank algorithms on Mahout
ok~ I will do it asap~ btw, I there any advices? thanks a lot~ :) 2009/3/24 Grant Ingersoll gsing...@apache.org Graph ranking strategies are something I am very much interested in and would love to see in Mahout. Please do propose. -Grant On Mar 24, 2009, at 6:00 AM, Xuan Yang wrote: Hello everyone, I am a student from Fudan University, Shanghai, China. These days I am doing some research work on SimRank, which is an model measuring similarity of objects. SimRank is applicable in any domain with object-to-object relationships, e.g., web pages with hyperlinks, papers and authors, customers and commodities etc. Based on the simple assumption that two objects are similar if they are related to similar objects, SimRank is calculated recursively on directed graph. You can find the algorithm here http://en.wikipedia.org/wiki/SimRank I found that the calculation of SimRank suits well for the framework of Hadoop. 1, the directed graph could be saved in the form of edge list in hbase. And the Result Sn(a,b) could also be saved in hbase as matrix. 2, We can distribute all the n^2 pairs into the map nodes to calculate SimRank value of the next iteration. 3, There is an optimization method for SimRank's calculation: We can let map nodes calculate the sum of Rk(Xi, V) = PSUMa(V), Xi belongs to the set In(a), and V is an arbitrary node, then hand it to reduce node. In reduce node: If we want to calculate Rk+1(a, b), we only need to calculate Sum of PSUMa(Yj) in which Yj belongs to In(b); 4, besides, there are other optimization methods such as threshold could be used in Map nodes and Reduce nodes. Of course, there are some problems: 1, It is true that mapreduce could make the computation of each node more easier. Yet if the volume of data is very huge, the transport latency of data will become more and more serious. So, methods to reduce IO would be very helpful. 2, SimRank is to compute the similarity between all the nodes. If we map a group of nodes {A, B, C} into one map node, and {D, E, F} into another map node. The computation inside set {A, B, C} will be easy, so will be set {D, E, F}. But when we want to compute SimRank between A and D, It will not be very convenient. I think it would be great to solve these problems and implement a mapreduce-version of algorithm for SimRank. I intend to implement this as my Summer of Code project. Would you be interested in this? And can I get some advices from you? Thanks a lot, Xuan Yang
Re: GSoC 2009-Discussion
talking about Random Forests, I think there are two possible ways to actually implement them: The first implementation is useful when the dataset is not that big (= 2Go perhaps) and thus can be distributed via Hadoop's DistributedCache. In this case each mapper has access to all the dataset and builds a subset of the forest. The second one is related to large datasets, and by large I mean datasets that cannot fit on every computing node. In this case each mapper processes a subset of the dataset for all the trees. Im more interested in the second implementation, so may be Samuel would be interested in the first...but of course if actually the community need them both :) --- En date de : Mar 24.3.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: GSoC 2009-Discussion À: mahout-dev@lucene.apache.org Date: Mardi 24 Mars 2009, 0h07 There are other algorithms of serious interest. Bayesian Additive Regression Trees (BART) would make a very interesting complement to Random Forests. I don't know how important it is to get a normal decision tree algorithm going because the cost to build these is often not that high. Boosted decision trees might be of interest, but probably not as much as BART. It might also be interesting to work with this student to implement some of the diagnostics associated with random forests. There is plenty to do. - Original Message From: Samuel Louvan samuel.lou...@gmail.com My questions: - I just notice in the mailing archive that other student also pretty serious to implement random forest algorithm. Should I select decision tree instead ? (for my future GSoC proposal) - Actually I found it would be interesting if I can combine Apache Nutch and Mahout so the idea is to implement web page segmentation + classifier inside a web crawler. By doing this, a crawler, for instance, can use the output of the classification to only follow certain links that lie on informative content parts. Is this interesting make sense for you guys? -- Ted Dunning, CTO DeepDyve
Re: [GSoC] SimRank algorithms on Mahout
Answering some of your email out of order, On Mon, Mar 23, 2009 at 10:00 PM, Xuan Yang sailingw...@gmail.com wrote: These days I am doing some research work on SimRank, which is an model measuring similarity of objects. Great. I think it would be great to solve these problems and implement a mapreduce-version of algorithm for SimRank. I intend to implement this as my Summer of Code project. Would you be interested in this? This sounds like a fine project. And can I get some advices from you? I am sure you can lots of advice from this group, both on the algorithm and suggestions on how to code it into a program. Back to your detailed suggestion. Here are some of my first thoughts: 1, the directed graph could be saved in the form of edge list in hbase. And the Result Sn(a,b) could also be saved in hbase as matrix. Hbase or flat files would be a fine way to store this and an edge list is an excellent way to store the data. The output matrix should probably be stored as triples containing row, column and value. 2, We can distribute all the n^2 pairs into the map nodes to calculate SimRank value of the next iteration. Hopefully you can keep this sparse. If you cannot, then the algorithm may not be suitable for use on large data no matter how you parallelize it. Skipping item 3 because I don't have time right now to analyze it in detail... 4, besides, there are other optimization methods such as threshold could be used in Map nodes and Reduce nodes. Thresholding is likely to be a critical step in order to preserve sparsity. 1, It is true that mapreduce could make the computation of each node more easier. Yet if the volume of data is very huge, the transport latency of data will become more and more serious. I think that you will find that with map-reduce in general and with Hadoop more specifically, that as the problem gets larger, the discipline imposed by map-reduce formulation on your data transport patterns actually allows better scaling than you would expect. Of course, if your data size scales with n^2, you are in trouble no matter how your parallelize. A good example came a year or so ago with a machine translation group at a university in Maryland. They had a large program that attempted to do coocurrence counting on text corpora using a single multi-core machine. They started to convert this to Hadoop using the simplest possible representation for the cooccurrence matrix (index, value triples) and expected that the redundancy of this representation would lead to very bad results. Since they expected bad results, they also expected to do lots of optimization on the map-reduce version. Also, since the original program was largely memory based, they expected that the communication overhead of hadoop would severely hurt performance. The actual results were that an 18 hour program run on 70 machines took 20 minutes. This is nearly perfect speedup over the sequential version. The moral is that highly sequential transport of large blocks of information can be incredibly efficient. So, methods to reduce IO would be very helpful. My first recommendation on this is to wait. Get and implementation first, then optimize. The problems you have will not be the problems you expect. 2, SimRank is to compute the similarity between all the nodes. If we map a group of nodes {A, B, C} into one map node, and {D, E, F} into another map node. The computation inside set {A, B, C} will be easy, so will be set {D, E, F}. But when we want to compute SimRank between A and D, It will not be very convenient. Map nodes should never communicate to each other. That is the purpose of the reduce layer. I think that what you should do is organize your recursive step so that the sum happens in the reduce. Then each mapper would output records where the key is the index pair for the summation (a and b in the notation used on wikipedia) and the reduce does this summation. This implies that you change your input format slightly to be variable length records containing a node index and the In set for that node. This transformation is a very simple, one time map-reduce step. More specifically, you would have original input which initially has zero values for R: links: (Node from, Node to, double R) and a transform MR step that does this to produce an auxilliary file inputSets: (Node to, ListNode inputs): map: (Node from, Node to) - (to, from) reduce: (Node to, ListNode inputs) - to, inputs Now you need to join the original input to the auxilliary file on both the from and to indexes. This join would require two map-reduces, one to join on the from index and one to join on the to index. The reduce in the final step should emit the cross product of the input sets. Then you need to join that against the original data. That join would require a single map-reduce for the join. Finally, you need to group on the to index and sum up all of the distances
Re: GSoC 2009-Discussion
[snip] a web crawler. By doing this, a crawler, for instance, can use the output of the classification to only follow certain links that lie on informative content parts. Is this interesting make sense for you guys? Hi Samuel. This would be of great interest for the Nutch folks, I think. And obviously for Mahout, since it would be a practical application of an ML algorithm. Dawid
Re: GSoC 2009-Discussion
Mmmm :) This would definitely be very useful to anyone dealing with web page parsing and indexing. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Samuel Louvan samuel.lou...@gmail.com To: mahout-dev@lucene.apache.org Sent: Sunday, March 22, 2009 7:17:11 PM Subject: GSoC 2009-Discussion Hi, I just browsed through the idea list in GSoC 2009 and I'm interested to work in Apache Mahout. Currently, I'm doing my master project in my university related to machine learning + information retrieval. More specifically it's about how to discover informative content in a web page by using machine learning approach. Overall, there are two stages for doing this task, namely web page segmentation and locating the informative content. Web page segmentation process, takes a DOM tree representation of a HTML document and then group the DOM nodes into certain granularity. Next, a classification task is performed to the DOM nodes into binary class whether it is a informative content or non-informative content. The features used for the classification are for example, inner HTML length, inner Text Length, stop word ratio, offsetHeight, coordinate of the HTML element on the browser etc. The dataset is generated by a labeling program that I made (for supervised learning). Basically, a user can select annotate a particular segment of the web page and then mark the class label as a informative content or not informative content. I did some small experiments with this last semester, I played with WEKA and tried some algorithms namely Random forests, Decision tree, SVM, and Neural Network. In this experiment, random forest and decision tree yield the most satisfying result. Currently, I'm working on my master project and will implement a machine learning algorithm either decision tree or random forest for the classifier. For this reason, I'm very interested to work on Apache Mahout in this year's GSoC to implement one of those algorithm. My questions: - I just notice in the mailing archive that other student also pretty serious to implement random forest algorithm. Should I select decision tree instead ? (for my future GSoC proposal) - Actually I found it would be interesting if I can combine Apache Nutch and Mahout so the idea is to implement web page segmentation + classifier inside a web crawler. By doing this, a crawler, for instance, can use the output of the classification to only follow certain links that lie on informative content parts. Is this interesting make sense for you guys? Maybe for more details, you can download my presentation slides and master project desription at http://rapidshare.com/files/212352116/Slide_Doc.zip A little bit background of me : I'm a 2nd year Master Student in TU Eindhoven, Netherlands. Last year I also participated in GSoC with OpenNMS (http://code.google.com/soc/2008/opennms/appinfo.html?csaid=EDA725BD4D34D481) Looking forward for your feedback and input. Regards, Samuel L.
Re: GSOC Mentor
Hi guys, I'm actually interested with your project. I haven't started my proposal yet, because I'm still working on my finals now, I'll be writing it soon and let you guys know any updates. But I'm generally interested this idea: http://wiki.apache.org/general/SummerOfCode2008#lucene I had Machine Learning class but haven't had the chance to implement algorithm. I used Lucene previously, and I have a strong interest with Machine Learning, so I thought it would be nice if I could spend my summer implementing Machine Learning algorithm. Regards, Grady On Fri, Mar 20, 2009 at 4:27 AM, Grant Ingersoll gsing...@apache.orgwrote: Hey Gang, The ASF has been accepted to participate in GSOC. If you want to be a mentor, you can now sign up to be one. Just choose to be a part of the ASF. http://socghop.appspot.com/program/home/google/gsoc2009 You should also subscribe to code-awa...@a.o for ASF specific info. Note, you have to be a committer to be a mentor. -Grant -- Grady Laksmono gradyfau...@laksmono.com www.laksmono.com I know the plans I have for you, declares the Lord, plans to prosper you and not to harm you, plans to give you hope and a future. ~ Jeremiah 29:11 ~
Re: GSoC 09 project ideas...
Hi Z.S., I'll update LUCENE-1313 after LUCENE-1516 is committed. I can post the basic new patch I have for LUCENE-1313 (heavily simplified compared to the previous patches), however it will assume LUCENE-1516. The other area that will need to be addressed is standard benchmarking for different realtime search approaches as we don't know what will be best yet. What areas in regard to realtime search are you working on? -J On Wed, Mar 18, 2009 at 9:04 AM, Zaid Md. Abdul Wahab Sheikh sheikh.z...@gmail.com wrote: Hi lucene, In this link http://wiki.apache.org/general/SummerOfCode2009 , there are no project ideas for Lucene proper. (Only ideas for Mahout listed). Please put up some ideas for Lucene there or please mention some popular open issues that might be suitable as a GSoC project. I would very much like to work on Lucene during Summer of Code 09. I am currently researching/doing a project on Realtime search. It seems, a contrib exists for realtime search in Lucene. http://issues.apache.org/jira/browse/LUCENE-1313. Can anyone give me an update on its status? Is that sufficient/complete, or should I start investigating possibilities of integrating 'realtime' search in Lucene. Please comment. Z.S.
Re: GSoC 09 project ideas...
I think creating a better Highlighter for Lucene, which is actively being discussed: https://issues.apache.org/jira/browse/LUCENE-1522 would make a good GSoC project, but I don't think I have time to mentor. Realtime search is currently in progress already, being tracked/iterated here: https://issues.apache.org/jira/browse/LUCENE-1516 The original Ocean (LUCENE-1313) that you found was a more ambitious approach, which after discussions here eventually lead to the simpler approach in LUCENE-1516. Mike Abdul Wahab Sheikh wrote: Hi lucene, In this link http://wiki.apache.org/general/SummerOfCode2009 , there are no project ideas for Lucene proper. (Only ideas for Mahout listed). Please put up some ideas for Lucene there or please mention some popular open issues that might be suitable as a GSoC project. I would very much like to work on Lucene during Summer of Code 09. I am currently researching/doing a project on Realtime search. It seems, a contrib exists for realtime search in Lucene. http://issues.apache.org/jira/browse/LUCENE-1313 . Can anyone give me an update on its status? Is that sufficient/ complete, or should I start investigating possibilities of integrating 'realtime' search in Lucene. Please comment. Z.S. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org