Nguyen Manh Tien created NUTCH-1686:
---------------------------------------

             Summary: Optimize UpdateDb to load less field from Store
                 Key: NUTCH-1686
                 URL: https://issues.apache.org/jira/browse/NUTCH-1686
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 2.3
            Reporter: Nguyen Manh Tien
         Attachments: NUTCH-1686.patch

While running large crawl i found that updatedb run very slow, especially the 
Map task which loading data from store.
We can't use filter by batchId to load less url due to bug in NUTCH-1679 so we 
must always update the whole table.

After checking the field loaded in UpdateDbJob i found that it load many fields 
from store (at least 15/25 field) which make updatedb slow

I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, METADATA 
which is used to compute link score, distance that i think the main purpose of 
this job.
The other fields is used to compute url schedule to parser and fetcher, we can 
move code to Parser or Fetcher whithout loading much new field because many 
field are generated from parser. WE can also use gora filter for Fetcher or 
Parser so load new field is not a problem.

I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is 
currently store in METADATA. field CASH is used in OPICScoring which is used 
only in UpdateDB and distance is used only in Generator and Updater so move 
both field two new Metadata field can prevent reading METADATA at Generator and 
Updater, METADATA contains many data that is used only at Parser and Indexer

So with new change
UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we 
don't need to load big family Fetch and INLINKS.
Generator only load SCOREMETA (which is smaller than current METADATA)




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to