Nguyen Manh Tien created NUTCH-1686:
---------------------------------------
Summary: Optimize UpdateDb to load less field from Store
Key: NUTCH-1686
URL: https://issues.apache.org/jira/browse/NUTCH-1686
Project: Nutch
Issue Type: Improvement
Affects Versions: 2.3
Reporter: Nguyen Manh Tien
Attachments: NUTCH-1686.patch
While running large crawl i found that updatedb run very slow, especially the
Map task which loading data from store.
We can't use filter by batchId to load less url due to bug in NUTCH-1679 so we
must always update the whole table.
After checking the field loaded in UpdateDbJob i found that it load many fields
from store (at least 15/25 field) which make updatedb slow
I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, METADATA
which is used to compute link score, distance that i think the main purpose of
this job.
The other fields is used to compute url schedule to parser and fetcher, we can
move code to Parser or Fetcher whithout loading much new field because many
field are generated from parser. WE can also use gora filter for Fetcher or
Parser so load new field is not a problem.
I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is
currently store in METADATA. field CASH is used in OPICScoring which is used
only in UpdateDB and distance is used only in Generator and Updater so move
both field two new Metadata field can prevent reading METADATA at Generator and
Updater, METADATA contains many data that is used only at Parser and Indexer
So with new change
UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we
don't need to load big family Fetch and INLINKS.
Generator only load SCOREMETA (which is smaller than current METADATA)
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)