[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13548826#comment-13548826
]
Lewis John McGibbney commented on NUTCH-978:
--------------------------------------------
It sounds like this is nearly read for a review. I would suggest you to please
incorporate your suggestions into a fresh patch against the head version you
are working on. Is it the 2.x branch or 1.x trunk?
As you said, it always seems that a plugin of this nature is sought after,
therefore a reasonably documented and easily configurable implementation would
be very welcome.
> A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------
>
> Key: NUTCH-978
> URL: https://issues.apache.org/jira/browse/NUTCH-978
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.2
> Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
> Reporter: Ammar Shadiq
> Assignee: Chris A. Mattmann
> Priority: Minor
> Labels: gsoc2012, mentor
> Fix For: 2.2
>
> Attachments: app_guardian_ivory_coast_news_exmpl.png,
> app_screenshoot_configuration_result_anchor.png,
> app_screenshoot_configuration_result.png, app_screenshoot_source_view.png,
> app_screenshoot_url_regex_filter.png, for_GSoc.zip,
> [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf,
> version_alpha2.zip
>
> Original Estimate: 1,680h
> Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of
> the web page by removing html tags and component like javascript and css and
> leaving the extracted text to be stored on the index. Nutch by default
> doesn't have the capability to select certain atomic element on an html page,
> like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text
> as its node. This branch and node could be extracted using XPath. XPath
> allowing us to select a certain branch or node of an XML and therefore could
> be used to extract certain information and treat it differently based on its
> content and the user requirements. Furthermore a web domain like news website
> usually have a same html code structure for storing the information on its
> web pages. This same html code structure could be parsed using the same XPath
> query and retrieve the same content information element. All of the XPath
> query for selecting various content could be stored on a XPath Configuration
> File.
> The purpose of nutch are for various web source, not all of the web page
> retrieved from those various source have the same html code structure, thus
> have to be threated differently using the correct XPath Configuration. The
> selection of the correct XPath configuration could be done automatically
> using regex by matching the url of the web page with valid url pattern for
> that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page
> and get only certain information that user wants therefore making the index
> more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting
> certain elements on various news website for the purpose of document
> clustering. This includes a Configuration Editor Application build using
> NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira