[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-978:
---------------------------------------
Fix Version/s: 1.10
> A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------
>
> Key: NUTCH-978
> URL: https://issues.apache.org/jira/browse/NUTCH-978
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.2
> Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
> Reporter: Ammar Shadiq
> Assignee: Chris A. Mattmann
> Priority: Minor
> Labels: gsoc2012, mentor
> Fix For: 2.4, 1.10
>
> Attachments:
> [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf,
> app_guardian_ivory_coast_news_exmpl.png,
> app_screenshoot_configuration_result.png,
> app_screenshoot_configuration_result_anchor.png,
> app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png,
> for_GSoc.zip, version_alpha2.zip
>
> Original Estimate: 1,680h
> Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of
> the web page by removing html tags and component like javascript and css and
> leaving the extracted text to be stored on the index. Nutch by default
> doesn't have the capability to select certain atomic element on an html page,
> like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text
> as its node. This branch and node could be extracted using XPath. XPath
> allowing us to select a certain branch or node of an XML and therefore could
> be used to extract certain information and treat it differently based on its
> content and the user requirements. Furthermore a web domain like news website
> usually have a same html code structure for storing the information on its
> web pages. This same html code structure could be parsed using the same XPath
> query and retrieve the same content information element. All of the XPath
> query for selecting various content could be stored on a XPath Configuration
> File.
> The purpose of nutch are for various web source, not all of the web page
> retrieved from those various source have the same html code structure, thus
> have to be threated differently using the correct XPath Configuration. The
> selection of the correct XPath configuration could be done automatically
> using regex by matching the url of the web page with valid url pattern for
> that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page
> and get only certain information that user wants therefore making the index
> more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting
> certain elements on various news website for the purpose of document
> clustering. This includes a Configuration Editor Application build using
> NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)