[
https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855370#comment-13855370
]
Otis Gospodnetic commented on NUTCH-1686:
-----------------------------------------
{code}
- private final static Utf8 CASH_KEY = new Utf8("_csh_");
-
+ public static final Utf8 CASH_KEY = new Utf8("c");
{code}
Is this going to cause any backwards compatibility issues by any chance?
> Optimize UpdateDb to load less field from Store
> -----------------------------------------------
>
> Key: NUTCH-1686
> URL: https://issues.apache.org/jira/browse/NUTCH-1686
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 2.3
> Reporter: Nguyen Manh Tien
> Fix For: 2.3
>
> Attachments: NUTCH-1686.patch
>
>
> While running large crawl i found that updatedb run very slow, especially the
> Map task which loading data from store.
> We can't use filter by batchId to load less url due to bug in NUTCH-1679 so
> we must always update the whole table.
> After checking the field loaded in UpdateDbJob i found that it load many
> fields from store (at least 15/25 field) which make updatedb slow
> I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS,
> METADATA which is used to compute link score, distance that i think the main
> purpose of this job.
> The other fields is used to compute url schedule to parser and fetcher, we
> can move code to Parser or Fetcher whithout loading much new field because
> many field are generated from parser. WE can also use gora filter for Fetcher
> or Parser so load new field is not a problem.
> I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is
> currently store in METADATA. field CASH is used in OPICScoring which is used
> only in UpdateDB and distance is used only in Generator and Updater so move
> both field two new Metadata field can prevent reading METADATA at Generator
> and Updater, METADATA contains many data that is used only at Parser and
> Indexer
> So with new change
> UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we
> don't need to load big family Fetch and INLINKS.
> Generator only load SCOREMETA (which is smaller than current METADATA)
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)