[jira] [Created] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Ammar Shadiq (JIRA) Wed, 06 Apr 2011 08:11:45 -0700

[GSoC 2011] A Plugin for extracting certain element of a web page on html page 
parsing.
---------------------------------------------------------------------------------------


                 Key: NUTCH-978
                 URL: https://issues.apache.org/jira/browse/NUTCH-978
             Project: Nutch
          Issue Type: New Feature
          Components: parser
    Affects Versions: 1.2
         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
            Reporter: Ammar Shadiq
             Fix For: 2.0


Nutch use parse-html plugin to parse web pages, it process the contents of the 
web page by removing html tags and component like javascript and css and 
leaving the extracted text to be stored on the index. Nutch by default doesn't 
have the capability to select certain atomic element on an html page, like 
certain tags, certain content, some part of the page, etc.

A html page have a tree-like xml pattern with html tag as its branch and text 
as its node. This branch and node could be extracted using XPath. XPath 
allowing us to select a certain branch or node of an XML and therefore could be 
used to extract certain information and treat it differently based on its 
content and the user requirements. Furthermore a web domain like news website 
usually have a same html code structure for storing the information on its web 
pages. This same html code structure could be parsed using the same XPath query 
and retrieve the same content information element. All of the XPath query for 
selecting various content could be stored on a XPath Configuration File.

The purpose of nutch are for various web source, not all of the web page 
retrieved from those various source have the same html code structure, thus 
have to be threated differently using the correct XPath Configuration. The 
selection of the correct XPath configuration could be done automatically using 
regex by matching the url of the web page with valid url pattern for that xpath 
configuration.

This automatic mechanism allow the user of nutch to process various web page 
and get only certain information that user wants therefore making the index 
more accurate and its content more flexible.

The component for this idea have been tested on nutch 1.2 for selecting certain 
elements on various news website for the purpose of document clustering. This 
includes a Configuration Editor Application build using NetBeans 6.9 
Application Framework. though its need a few debugging.

http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Reply via email to