Hi Nisrina, To my knowledge, the parser used in the Nutch has the similar ability to do so. You can configure Tika to use boilerpipe algorithm and configure the boiler pipe extraction type. Boiler pipe use shallow text detection algorithm.
If you want to incorporate your own algorithm you might want to look into creating your own parse plugin. Group, please correct me if the information is incorrect. Cheers, Ye On Fri, Apr 19, 2013 at 7:06 PM, nisrina <[email protected]>wrote: > Hi Lewis, > I'm Nisrina from Universitas Indonesia and I'm interested to participate in > GSoC 2013 for this community. > I have an idea to implement a content extraction module inside the Nutch > web > crawler. I think the content extraction module would benefit Apache Nutch > and also Lucene. The idea of content extraction is about how to extract the > most informative part of a document. > For instance, if we crawl a news web page there a lot noisy information > such > as the heading of web page, advertisement, links to other related news, > etc. > By using content extraction we would be able to extract the main > content/article of the web page. > > I have found a technical paper which outlines the state of the art content > extraction technique. The technique is based on the DOM text density to > discover the informative content. > DOM Based Content Extraction via Text Density > < > http://disnet.cs.bit.edu.cn/DOM%20Based%20Content%20Extraction%20via%20Text%20Density.pdf > > > > Is this idea seems feasible for you? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/DISCUSS-Google-Summer-of-Code-tp4044606p4057249.html > Sent from the Nutch - Dev mailing list archive at Nabble.com. >

