Hi Lewis, I'm Nisrina from Universitas Indonesia and I'm interested to participate in GSoC 2013 for this community. I have an idea to implement a content extraction module inside the Nutch web crawler. I think the content extraction module would benefit Apache Nutch and also Lucene. The idea of content extraction is about how to extract the most informative part of a document. For instance, if we crawl a news web page there a lot noisy information such as the heading of web page, advertisement, links to other related news, etc. By using content extraction we would be able to extract the main content/article of the web page.
I have found a technical paper which outlines the state of the art content extraction technique. The technique is based on the DOM text density to discover the informative content. DOM Based Content Extraction via Text Density <http://disnet.cs.bit.edu.cn/DOM%20Based%20Content%20Extraction%20via%20Text%20Density.pdf> Is this idea seems feasible for you? -- View this message in context: http://lucene.472066.n3.nabble.com/DISCUSS-Google-Summer-of-Code-tp4044606p4057249.html Sent from the Nutch - Dev mailing list archive at Nabble.com.

