Hi Ye, hi nisrina,

you're right.
Take a look here [1], a lot of work was already done by Markus.

--Roland


[1] https://issues.apache.org/jira/browse/NUTCH-961



On Fri, Apr 19, 2013 at 2:11 PM, Ye T Thet <[email protected]> wrote:

> Hi Nisrina,
>
> To my knowledge, the parser used in the Nutch has the similar ability to
> do so. You can configure Tika to use boilerpipe algorithm and configure the
> boiler pipe extraction type. Boiler pipe use shallow text detection
> algorithm.
>
> If you want to incorporate your own algorithm you might want to look into
> creating your own parse plugin.
>
> Group, please correct me if the information is incorrect.
>
> Cheers,
>
> Ye
>
>
> On Fri, Apr 19, 2013 at 7:06 PM, nisrina <[email protected]>wrote:
>
>> Hi Lewis,
>> I'm Nisrina from Universitas Indonesia and I'm interested to participate
>> in
>> GSoC 2013 for this community.
>> I have an idea to implement a content extraction module inside the Nutch
>> web
>> crawler. I think the content extraction module would benefit Apache Nutch
>> and also Lucene. The idea of content extraction is about how to extract
>> the
>> most informative part of a document.
>> For instance, if we crawl a news web page there a lot noisy information
>> such
>> as the heading of web page, advertisement, links to other related news,
>> etc.
>> By using content extraction we would be able to extract the main
>> content/article of the web page.
>>
>> I have found a technical paper which outlines the state of the art content
>> extraction technique. The technique is based on the DOM text density to
>> discover the informative content.
>> DOM Based Content Extraction via Text Density
>> <
>> http://disnet.cs.bit.edu.cn/DOM%20Based%20Content%20Extraction%20via%20Text%20Density.pdf
>> >
>>
>> Is this idea seems feasible for you?
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/DISCUSS-Google-Summer-of-Code-tp4044606p4057249.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>>
>
>

Reply via email to