[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Ammar Shadiq (JIRA) Fri, 08 Apr 2011 15:43:50 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017724#comment-13017724
 ]


Ammar Shadiq commented on NUTCH-978:
------------------------------------

Please correct me if I'm wrong.
In my limited understanding, Nutch using plugin system, one of those are for 
parsing html pages (HTMLParseFilter class) whose  later selected appropriate 
plugin based on the configuration and runs it. 

Inside parse-html the main thing it's extract are : Content, Title, and 
Outlinks.

The problem that I'm trying to solve are, for adding custom field like on : 
http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
for various type of content and for various sites and add it to the index 
fields. Instead of creating a new plugin for each site, nutch user could simply 
create the xpath configuration file, put it on the configuration folder and the 
parsing of custom fields could be done automatically without writing/compiling 
any code.

In addition, user could also bypass Content, Title and Outlinks with a 
different result, for example, 
Set the title of page's from a news site (example: 
http://www.guardian.co.uk/world/2011/apr/08/ivory-coast-horror-recounted), 

instead the value of <head><title> :
=Ivory Coast horror recounted by victims and perpetrators | World news | The 
Guardian

get the title only, by using xpath of : 
/html/body/div[@id='wrapper']/div[@id='box']/div[@id='article-header']/div[@id='main-article-info']/h1/text(),
 and get:
=Ivory Coast horror recounted by victims and perpetrators

or only follow outlinks of related news, ignore the rest:

8 Apr 2011
Ouattara calls for Ivory Coast sanctions to be lifted 
7 Apr 2011
Ivory Coast crisis: Q&A 
5 Apr 2011
After Gbagbo, what next for Ivory Coast? 
5 Apr 2011
Ivory Coast: The final battle 

like the screenshoot here : 
https://issues.apache.org/jira/secure/attachment/12475860/app_guardian_ivory_coast_news_exmpl.png

Since the default parser are parse-html. I add the handler there, some kind of 
if-else bypass, if the parsed page have URL that match one of those 
Configuration, it's parsed by it, if there's no configuration matched with the 
URL, it's uses the default parser mechanism. 

I'm sorry for my English and if I'm not presenting my idea well enough.

> [GSoC 2011] A Plugin for extracting certain element of a web page on html 
> page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: 
> [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
> app_guardian_ivory_coast_news_exmpl.png, 
> app_screenshoot_configuration_result.png, 
> app_screenshoot_configuration_result_anchor.png, 
> app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of 
> the web page by removing html tags and component like javascript and css and 
> leaving the extracted text to be stored on the index. Nutch by default 
> doesn't have the capability to select certain atomic element on an html page, 
> like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text 
> as its node. This branch and node could be extracted using XPath. XPath 
> allowing us to select a certain branch or node of an XML and therefore could 
> be used to extract certain information and treat it differently based on its 
> content and the user requirements. Furthermore a web domain like news website 
> usually have a same html code structure for storing the information on its 
> web pages. This same html code structure could be parsed using the same XPath 
> query and retrieve the same content information element. All of the XPath 
> query for selecting various content could be stored on a XPath Configuration 
> File.
> The purpose of nutch are for various web source, not all of the web page 
> retrieved from those various source have the same html code structure, thus 
> have to be threated differently using the correct XPath Configuration. The 
> selection of the correct XPath configuration could be done automatically 
> using regex by matching the url of the web page with valid url pattern for 
> that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page 
> and get only certain information that user wants therefore making the index 
> more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting 
> certain elements on various news website for the purpose of document 
> clustering. This includes a Configuration Editor Application build using 
> NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Reply via email to