Hi Ye, hi nisrina, you're right. Take a look here [1], a lot of work was already done by Markus.
--Roland [1] https://issues.apache.org/jira/browse/NUTCH-961 On Fri, Apr 19, 2013 at 2:11 PM, Ye T Thet <[email protected]> wrote: > Hi Nisrina, > > To my knowledge, the parser used in the Nutch has the similar ability to > do so. You can configure Tika to use boilerpipe algorithm and configure the > boiler pipe extraction type. Boiler pipe use shallow text detection > algorithm. > > If you want to incorporate your own algorithm you might want to look into > creating your own parse plugin. > > Group, please correct me if the information is incorrect. > > Cheers, > > Ye > > > On Fri, Apr 19, 2013 at 7:06 PM, nisrina <[email protected]>wrote: > >> Hi Lewis, >> I'm Nisrina from Universitas Indonesia and I'm interested to participate >> in >> GSoC 2013 for this community. >> I have an idea to implement a content extraction module inside the Nutch >> web >> crawler. I think the content extraction module would benefit Apache Nutch >> and also Lucene. The idea of content extraction is about how to extract >> the >> most informative part of a document. >> For instance, if we crawl a news web page there a lot noisy information >> such >> as the heading of web page, advertisement, links to other related news, >> etc. >> By using content extraction we would be able to extract the main >> content/article of the web page. >> >> I have found a technical paper which outlines the state of the art content >> extraction technique. The technique is based on the DOM text density to >> discover the informative content. >> DOM Based Content Extraction via Text Density >> < >> http://disnet.cs.bit.edu.cn/DOM%20Based%20Content%20Extraction%20via%20Text%20Density.pdf >> > >> >> Is this idea seems feasible for you? >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/DISCUSS-Google-Summer-of-Code-tp4044606p4057249.html >> Sent from the Nutch - Dev mailing list archive at Nabble.com. >> > >

