Hi Lewis,
I'm Nisrina from Universitas Indonesia and I'm interested to participate in
GSoC 2013 for this community. 
I have an idea to implement a content extraction module inside the Nutch web
crawler. I think the content extraction module would benefit Apache Nutch
and also Lucene. The idea of content extraction is about how to extract the
most informative part of a document.
For instance, if we crawl a news web page there a lot noisy information such
as the heading of web page, advertisement, links to other related news, etc.
By using content extraction we would be able to extract the main
content/article of the web page.

I have found a technical paper which outlines the state of the art content
extraction technique. The technique is based on the DOM text density to
discover the informative content. 
DOM Based Content Extraction via Text Density
<http://disnet.cs.bit.edu.cn/DOM%20Based%20Content%20Extraction%20via%20Text%20Density.pdf>
  

Is this idea seems feasible for you?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DISCUSS-Google-Summer-of-Code-tp4044606p4057249.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to