[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227963#comment-13227963
 ] 

Lewis John McGibbney commented on NUTCH-882:
--------------------------------------------

Mathijs, my opinion is that you have a clean sheet of paper to begin with 
certain aspects of this one (simply because you've stepped up to take it on). 
You obviously have you own idea about how you would like to see the new host 
table design and also have justification behind the eventual implementation 
(and API break/redesign) of NutchContext. I think it's wise to think sensibly 
about NOT breaking the plugin API at this stage and that an incremental 
approach to addressing this one is a suitable strategy. Feel free to open 
another issue for the NutchContext issue, as quite rightly this appears to have 
now morphed into it's own sub domain of the umbrella issue. 
                
> Design a Host table in GORA
> ---------------------------
>
>                 Key: NUTCH-882
>                 URL: https://issues.apache.org/jira/browse/NUTCH-882
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: nutchgora
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: nutchgora
>
>         Attachments: NUTCH-882-v1.patch, hostdb.patch
>
>
> Having a separate GORA table for storing information about hosts (and 
> domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of 
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the 
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments 
> are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to