[ 
https://issues.apache.org/jira/browse/NUTCH-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603747#comment-13603747
 ] 

Lewis John McGibbney edited comment on NUTCH-1533 at 3/15/13 7:39 PM:
----------------------------------------------------------------------

Hi lufeng great work. I upload a new patch on this and comment below:

Added
-----
* correct mappings for other Gora datastores.
* Added license headers for WebPage classes generated by GoraCompiler.

Issues
-----
* I have an issue about the following cases in DbUpdateReducer#reduce()
{code}
      case CrawlStatus.STATUS_RETRY:
        schedule.setPageRetrySchedule(url, page, 0L, 0L, page.getFetchTime());
        if (page.getRetriesSinceFetch() < retryMax) {
          page.setStatus(CrawlStatus.STATUS_UNFETCHED);
        } else {
          page.setStatus(CrawlStatus.STATUS_GONE);
        }
        break;
      case CrawlStatus.STATUS_GONE:
        schedule.setPageGoneSchedule(url, page, 0L, 0L, page.getFetchTime());
        break;
{code}

We still see the 0L to represent prevModifiedTime which is fed into the 
respective FetchSchedule. 

* Is the Host table affected by batchId at all? If so do we wish to associate a 
batchId field to the Host table metadata? 
                
      was (Author: lewismc):
    Hi lufeng great work. I upload a new patch on this and comment below:

Added
-----
* correct mappings for other Gora datastores.
* Added license headers for WebPage classes generated by GoraCompiler.

Issues
------
* I have an issue about the following cases in DbUpdateReducer#reduce()
{code}
      case CrawlStatus.STATUS_RETRY:
        schedule.setPageRetrySchedule(url, page, 0L, 0L, page.getFetchTime());
        if (page.getRetriesSinceFetch() < retryMax) {
          page.setStatus(CrawlStatus.STATUS_UNFETCHED);
        } else {
          page.setStatus(CrawlStatus.STATUS_GONE);
        }
        break;
      case CrawlStatus.STATUS_GONE:
        schedule.setPageGoneSchedule(url, page, 0L, 0L, page.getFetchTime());
        break;
{code}

We still see the 0L to represent prevModifiedTime which is fed into the 
respective FetchSchedule. 

* Is the Host table affected by batchId at all? If so do we wish to associate a 
batchId field to the Host table metadata? 
                  
> Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and 
> setBatchId() accessors in o.a.n.storage.WebPage
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1533
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1533
>             Project: Nutch
>          Issue Type: Bug
>          Components: storage
>    Affects Versions: 2.1
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: NUTCH-1533.patch, NUTCH-1533v2.patch
>
>
> NUTCH-1532 needs to obtain a batchId to add to NutchDocument prior to 
> indexing. This is currently not available as we do not store the information 
> in the WebPage. Additionally, we do not store the other ModifiedTime's but 
> incorrectly set them in o.a.n.crawl.FetchSchedule#setFetchSchedule.
> All the above accessors should be implemented.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to