[ 
https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861142#comment-13861142
 ] 

Tien Nguyen Manh commented on NUTCH-1686:
-----------------------------------------

In this patch i also fixed an bug with fetchTime. Currently each time we run 
whole updatedb, fetchTime is increased again for all urls.

> Optimize UpdateDb to load less field from Store
> -----------------------------------------------
>
>                 Key: NUTCH-1686
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1686
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.3
>            Reporter: Tien Nguyen Manh
>             Fix For: 2.3
>
>         Attachments: NUTCH-1686.patch
>
>
> While running large crawl i found that updatedb run very slow, especially the 
> Map task which loading data from store.
> We can't use filter by batchId to load less url due to bug in NUTCH-1679 so 
> we must always update the whole table.
> After checking the field loaded in UpdateDbJob i found that it load many 
> fields from store (at least 15/25 field) which make updatedb slow
> I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, 
> METADATA which is used to compute link score, distance that i think the main 
> purpose of this job.
> The other fields is used to compute url schedule to parser and fetcher, we 
> can move code to Parser or Fetcher whithout loading much new field because 
> many field are generated from parser. WE can also use gora filter for Fetcher 
> or Parser so load new field is not a problem.
> I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is 
> currently store in METADATA. field CASH is used in OPICScoring which is used 
> only in UpdateDB and distance is used only in Generator and Updater so move 
> both field two new Metadata field can prevent reading METADATA at Generator 
> and Updater, METADATA contains many data that is used only at Parser and 
> Indexer
> So with new change
> UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we 
> don't need to load big family Fetch and INLINKS.
> Generator only load SCOREMETA (which is smaller than current METADATA)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to