[ 
https://issues.apache.org/jira/browse/NUTCH-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13604424#comment-13604424
 ] 

Lewis John McGibbney edited comment on NUTCH-1533 at 3/16/13 9:29 PM:
----------------------------------------------------------------------

Hi Feng,

{bq}i see that prevFetchTime is not fed into the schedule#setPageRetrySchedule, 
so i also not fed prevModifiedTime into it. How do your think about it?
I am not quite understanding you here, I did not mention prevFetchTime, we are 
solely talking about * long prevModifiedTime* here. Can you please expand upon 
your comment?
 
* My point is as follows: so far this patch (correctly) accounts for the 
CrawlStatus.STATUS_NOTMODIFIED case however it does not account for 
CrawlStatus.STATUS_RETRY and CrawlStatus.STATUS_GONE which both 
setPageRetrySchedule(String url, WebPage page, long prevFetchTime, *long 
prevModifiedTime*, long fetchTime) and setPageGoneSchdule(String url, WebPage 
page, long prevFetchTime, *long prevModifiedTime*, long fetchTime) respectively.

As you see above, the current input parameters for the long prevModifiedTime 
for both method calls is set to 0L... which IMHO is incorrect.

Do you have a comment on this? 

With regards to point two, I agree with you. We should address this in a 
different issue if and when one wishes to do so. Thanks for the insight. 
                
      was (Author: lewismc):
    Hi Feng,

{bq}i see that prevFetchTime is not fed into the schedule#setPageRetrySchedule, 
so i also not fed prevModifiedTime into it. How do your think about it?
I am not quite understanding you here, I did not mention prevFetchTime, we are 
solely talking about prevModifiedTime here. Can you please expand upon your 
comment?
 
* My point is as follows: so far this patch (correctly) accounts for the 
CrawlStatus.STATUS_NOTMODIFIED case however it does not account for 
CrawlStatus.STATUS_RETRY and CrawlStatus.STATUS_GONE which both 
setPageRetrySchedule(String url, WebPage page, long prevFetchTime, long 
prevModifiedTime, long fetchTime) and setPageGoneSchdule(String url, WebPage 
page, long prevFetchTime, long prevModifiedTime, long fetchTime) respectively.

As you see above, the current input parameters for the long prevModifiedTime 
for both method calls is set to 0L... which IMHO is incorrect.

Do you have a comment on this? 

With regards to point two, I agree with you. We should address this in a 
different issue if and when one wishes to do so. Thanks for the insight. 
                  
> Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and 
> setBatchId() accessors in o.a.n.storage.WebPage
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1533
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1533
>             Project: Nutch
>          Issue Type: Bug
>          Components: storage
>    Affects Versions: 2.1
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: NUTCH-1533.patch, NUTCH-1533v2.patch
>
>
> NUTCH-1532 needs to obtain a batchId to add to NutchDocument prior to 
> indexing. This is currently not available as we do not store the information 
> in the WebPage. Additionally, we do not store the other ModifiedTime's but 
> incorrectly set them in o.a.n.crawl.FetchSchedule#setFetchSchedule.
> All the above accessors should be implemented.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to