CrawlDatum status and CrawlDbReducer refactoring
------------------------------------------------

                 Key: NUTCH-416
                 URL: http://issues.apache.org/jira/browse/NUTCH-416
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.9.0
            Reporter: Andrzej Bialecki 
         Assigned To: Andrzej Bialecki 
             Fix For: 0.9.0


CrawlDatum needs more status codes, e.g. to reflect redirected pages. However, 
current values of status codes are linear, which prevents us from adding new 
codes in proper places. This is also related to the logic in CrawlDbReducer, 
which makes decisions based on arithmetic ordering of status code values.

I propose to change the codes so that they are grouped into related values, 
with significant gaps between groups for adding new codes without causing 
significant reordering. I also propose to change the logic in CrawlDbReducer so 
that its operation is not so dependent on actual code values.

A mapping should also be added between old and new codes to facilitate 
backward-compatibility of existing data. This mapping should be applied on the 
fly, without requiring explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to