[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

Koen Smets (JIRA) Tue, 25 Feb 2014 04:07:19 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911504#comment-13911504
 ]


Koen Smets commented on NUTCH-1679:
-----------------------------------

Added a patch that skips updating the pages that already exist in the data 
store and aded a TODO for merging inlinks with inlinedScoreData.

For a more permanent solution, I need some advice on how to deal with 
re-scoring links when merging the inlinks already present on those pages with 
the ones present in inlinkedScoreData. Current code base assumes all 
page.getInLinks() are present in inlinkedScoreData, which is true when using 
the `-all` parameter, but definitely isn't the case when using batchId 
parameter.

if (page.getInlinks() != null) {
        page.getInlinks().clear();
}

Any ideas?

> UpdateDb using batchId, link may override crawled page.
> -------------------------------------------------------
>
>                 Key: NUTCH-1679
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1679
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Tien Nguyen Manh
>            Priority: Critical
>             Fix For: 2.3
>
>         Attachments: NUTCH-1679.patch
>
>
> The problem is in Hbase store, not sure about other store.
> Suppose at first crawl cycle we crawl link A, then get an outlink B.
> In second cycle we crawl link B which also has a link point to A
> In second updatedb we load only page B from store, and will add A as new link 
> because it doesn't know A already exist in store and will override A.
> UpdateDb must be run without batchId or we must set additionsAllowed=false
> Here are code for new page
>       page = new WebPage();
>       schedule.initializeSchedule(url, page);
>       page.setStatus(CrawlStatus.STATUS_UNFETCHED);
>       try {
>         scoringFilters.initialScore(url, page);
>       } catch (ScoringFilterException e) {
>         page.setScore(0.0f);
>       }
> new page will override old page status, score, fetchTime, fetchInterval, 
> retries, metadata[CASH_KEY]
>  - i think we can change something here so that new page will only update one 
> column for example 'link' and if it is really a new page, we can initialize 
> all above fields in generator
> - or we add operator checkAndPut to store so when add new page we will check 
> if already exist first



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

Reply via email to