Mmmm.... :) This would definitely be very useful to anyone dealing with web page parsing and indexing.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Samuel Louvan <samuel.lou...@gmail.com> > To: mahout-dev@lucene.apache.org > Sent: Sunday, March 22, 2009 7:17:11 PM > Subject: GSoC 2009-Discussion > > Hi, > I just browsed through the idea list in GSoC 2009 and I'm interested > to work in Apache Mahout. > Currently, I'm doing my master project in my university related to > machine learning + information retrieval. More specifically > it's about how to discover informative content in a web page by using > machine learning approach. > > Overall, there are two stages for doing this task, namely web page > segmentation and locating the informative content. > Web page segmentation process, takes a DOM tree representation of a > HTML document and then group the DOM nodes > into certain granularity. Next, a classification task is performed to > the DOM nodes into binary class whether it is > a informative content or non-informative content. The features used > for the classification are for example, inner HTML length, > inner Text Length, stop word ratio, offsetHeight, coordinate of the > HTML element on the browser etc. > > The dataset is generated by a labeling program that I made (for > supervised learning). Basically, a user can > select & annotate a particular segment of the web page and then mark > the class label as a informative content or not informative content. > > I did some small experiments with this last semester, I played with > WEKA and tried some algorithms namely Random forests, > Decision tree, SVM, and Neural Network. In this experiment, random > forest and decision tree yield the most satisfying result. > > Currently, I'm working on my master project and will implement a > machine learning algorithm either decision tree or random forest > for the classifier. For this reason, I'm very interested to work on > Apache Mahout in this year's GSoC to implement one of those > algorithm. > > > My questions: > - I just notice in the mailing archive that other student also pretty > serious to implement random forest algorithm. Should I select > decision tree instead ? (for my future GSoC proposal) > - Actually I found it would be interesting if I can combine Apache > Nutch and Mahout so the idea is to implement web page segmentation + > classifier inside > a web crawler. By doing this, a crawler, for instance, can use the > output of the classification to only follow certain links that lie on > informative content parts. > Is this interesting & make sense for you guys? > > Maybe for more details, you can download my presentation slides and > master project desription at > http://rapidshare.com/files/212352116/Slide_Doc.zip > > A little bit background of me : I'm a 2nd year Master Student in TU > Eindhoven, Netherlands. > Last year I also participated in GSoC with OpenNMS > (http://code.google.com/soc/2008/opennms/appinfo.html?csaid=EDA725BD4D34D481) > > > Looking forward for your feedback and input. > > > > Regards, > Samuel L.