[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Lewis John McGibbney (Commented) (JIRA) Tue, 21 Feb 2012 05:36:59 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212584#comment-13212584
 ]


Lewis John McGibbney commented on NUTCH-978:
--------------------------------------------

Generally speaking the plugin sounds sounds really useful, the only problem I 
see is that it is very specific and for it to be integrated into the code base 
usually we need to make it specific enough to address some given task fully and 
in a well defined and well justified manner, but we also need to make it 
general enough to be used in many different contexts. This increases usability 
and user feedback as well engagement.

4. With regards to the biggest technical challenge being the processing of web 
page's, how far did you get with this? We're you able to process it with enough 
precision to satisfy your requirements?

5. How were you querying it with XPath? You cannot query with XPath, but 
instead with XQuery. Do you maybe mean that this enabled you to navigate the 
document and address various parts of it is XPath?

6. Ok I understand why it has crumbled slightly, but I think if the code is 
there is would be a huge waster if we didn't try to revive it, possibly getting 
it integrated into the code base, and maybe getting it added as a contrib 
component but not shipping it within the core codebase if the former was not a 
viable option.

I've had a look at NUTCH-185, but I think we can discard this as it was 
addressed a very long time ago, it's also already integrated into the codebase. 
I was referring more to Jira issues which were currently open, which we could 
maybe merge or combine to give this a more general and possibly more justified 
arguement for inclusion in the codebase... what do you think? Does NUTCH-585 
fit this?
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html 
> page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: 
> [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
> app_guardian_ivory_coast_news_exmpl.png, 
> app_screenshoot_configuration_result.png, 
> app_screenshoot_configuration_result_anchor.png, 
> app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
> for_GSoc.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of 
> the web page by removing html tags and component like javascript and css and 
> leaving the extracted text to be stored on the index. Nutch by default 
> doesn't have the capability to select certain atomic element on an html page, 
> like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text 
> as its node. This branch and node could be extracted using XPath. XPath 
> allowing us to select a certain branch or node of an XML and therefore could 
> be used to extract certain information and treat it differently based on its 
> content and the user requirements. Furthermore a web domain like news website 
> usually have a same html code structure for storing the information on its 
> web pages. This same html code structure could be parsed using the same XPath 
> query and retrieve the same content information element. All of the XPath 
> query for selecting various content could be stored on a XPath Configuration 
> File.
> The purpose of nutch are for various web source, not all of the web page 
> retrieved from those various source have the same html code structure, thus 
> have to be threated differently using the correct XPath Configuration. The 
> selection of the correct XPath configuration could be done automatically 
> using regex by matching the url of the web page with valid url pattern for 
> that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page 
> and get only certain information that user wants therefore making the index 
> more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting 
> certain elements on various news website for the purpose of document 
> clustering. This includes a Configuration Editor Application build using 
> NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Reply via email to