Re: [DISCUSS] Google Summer of Code

Ye T Thet Fri, 19 Apr 2013 05:12:07 -0700

Hi Nisrina,

To my knowledge, the parser used in the Nutch has the similar ability to do
so. You can configure Tika to use boilerpipe algorithm and configure the
boiler pipe extraction type. Boiler pipe use shallow text detection
algorithm.


If you want to incorporate your own algorithm you might want to look into
creating your own parse plugin.

Group, please correct me if the information is incorrect.

Cheers,

Ye


On Fri, Apr 19, 2013 at 7:06 PM, nisrina <[email protected]>wrote:

> Hi Lewis,
> I'm Nisrina from Universitas Indonesia and I'm interested to participate in
> GSoC 2013 for this community.
> I have an idea to implement a content extraction module inside the Nutch
> web
> crawler. I think the content extraction module would benefit Apache Nutch
> and also Lucene. The idea of content extraction is about how to extract the
> most informative part of a document.
> For instance, if we crawl a news web page there a lot noisy information
> such
> as the heading of web page, advertisement, links to other related news,
> etc.
> By using content extraction we would be able to extract the main
> content/article of the web page.
>
> I have found a technical paper which outlines the state of the art content
> extraction technique. The technique is based on the DOM text density to
> discover the informative content.
> DOM Based Content Extraction via Text Density
> <
> http://disnet.cs.bit.edu.cn/DOM%20Based%20Content%20Extraction%20via%20Text%20Density.pdf
> >
>
> Is this idea seems feasible for you?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/DISCUSS-Google-Summer-of-Code-tp4044606p4057249.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>

Re: [DISCUSS] Google Summer of Code

Reply via email to