[ 
https://issues.apache.org/jira/browse/NUTCH-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13604424#comment-13604424
 ] 

Lewis John McGibbney edited comment on NUTCH-1533 at 3/16/13 9:29 PM:
----------------------------------------------------------------------

Hi Feng,

{bq}i see that prevFetchTime is not fed into the schedule#setPageRetrySchedule, 
so i also not fed prevModifiedTime into it. How do your think about it?
I am not quite understanding you here, I did not mention prevFetchTime, we are 
solely talking about *long prevModifiedTime* here. Can you please expand upon 
your comment?
 
* My point is as follows: so far this patch (correctly) accounts for the 
CrawlStatus.STATUS_NOTMODIFIED case however it does not account for 
CrawlStatus.STATUS_RETRY and CrawlStatus.STATUS_GONE which both 
setPageRetrySchedule(String url, WebPage page, long prevFetchTime, *long 
prevModifiedTime*, long fetchTime) and setPageGoneSchdule(String url, WebPage 
page, long prevFetchTime, *long prevModifiedTime*, long fetchTime) respectively.

As you see above, the current input parameters for the long prevModifiedTime 
for both method calls is set to 0L... which IMHO is incorrect.

Do you have a comment on this? 

With regards to point two, I agree with you. We should address this in a 
different issue if and when one wishes to do so. Thanks for the insight. 
                
      was (Author: lewismc):
    Hi Feng,

{bq}i see that prevFetchTime is not fed into the schedule#setPageRetrySchedule, 
so i also not fed prevModifiedTime into it. How do your think about it?
I am not quite understanding you here, I did not mention prevFetchTime, we are 
solely talking about * long prevModifiedTime* here. Can you please expand upon 
your comment?
 
* My point is as follows: so far this patch (correctly) accounts for the 
CrawlStatus.STATUS_NOTMODIFIED case however it does not account for 
CrawlStatus.STATUS_RETRY and CrawlStatus.STATUS_GONE which both 
setPageRetrySchedule(String url, WebPage page, long prevFetchTime, *long 
prevModifiedTime*, long fetchTime) and setPageGoneSchdule(String url, WebPage 
page, long prevFetchTime, *long prevModifiedTime*, long fetchTime) respectively.

As you see above, the current input parameters for the long prevModifiedTime 
for both method calls is set to 0L... which IMHO is incorrect.

Do you have a comment on this? 

With regards to point two, I agree with you. We should address this in a 
different issue if and when one wishes to do so. Thanks for the insight. 
                  
> Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and 
> setBatchId() accessors in o.a.n.storage.WebPage
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1533
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1533
>             Project: Nutch
>          Issue Type: Bug
>          Components: storage
>    Affects Versions: 2.1
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: NUTCH-1533.patch, NUTCH-1533v2.patch
>
>
> NUTCH-1532 needs to obtain a batchId to add to NutchDocument prior to 
> indexing. This is currently not available as we do not store the information 
> in the WebPage. Additionally, we do not store the other ModifiedTime's but 
> incorrectly set them in o.a.n.crawl.FetchSchedule#setFetchSchedule.
> All the above accessors should be implemented.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to