[ https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861142#comment-13861142 ]
Tien Nguyen Manh commented on NUTCH-1686: ----------------------------------------- In this patch i also fixed an bug with fetchTime. Currently each time we run whole updatedb, fetchTime is increased again for all urls. > Optimize UpdateDb to load less field from Store > ----------------------------------------------- > > Key: NUTCH-1686 > URL: https://issues.apache.org/jira/browse/NUTCH-1686 > Project: Nutch > Issue Type: Improvement > Affects Versions: 2.3 > Reporter: Tien Nguyen Manh > Fix For: 2.3 > > Attachments: NUTCH-1686.patch > > > While running large crawl i found that updatedb run very slow, especially the > Map task which loading data from store. > We can't use filter by batchId to load less url due to bug in NUTCH-1679 so > we must always update the whole table. > After checking the field loaded in UpdateDbJob i found that it load many > fields from store (at least 15/25 field) which make updatedb slow > I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, > METADATA which is used to compute link score, distance that i think the main > purpose of this job. > The other fields is used to compute url schedule to parser and fetcher, we > can move code to Parser or Fetcher whithout loading much new field because > many field are generated from parser. WE can also use gora filter for Fetcher > or Parser so load new field is not a problem. > I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is > currently store in METADATA. field CASH is used in OPICScoring which is used > only in UpdateDB and distance is used only in Generator and Updater so move > both field two new Metadata field can prevent reading METADATA at Generator > and Updater, METADATA contains many data that is used only at Parser and > Indexer > So with new change > UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we > don't need to load big family Fetch and INLINKS. > Generator only load SCOREMETA (which is smaller than current METADATA) -- This message was sent by Atlassian JIRA (v6.1.5#6160)