Michael Coffey created NUTCH-2379:
-------------------------------------
Summary: crawl script dedup's crawldb update is slow
Key: NUTCH-2379
URL: https://issues.apache.org/jira/browse/NUTCH-2379
Project: Nutch
Issue Type: Bug
Components: bin
Affects Versions: 1.11
Environment: shell
Reporter: Michael Coffey
Priority: Minor
In the standard crawl script, there is a _bin_nutch updatedb command and, soon
after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with
"crawldb /path/to/crawl/db" in their names (in addition to the actual
deduplication job).
In my situation, the "crawldb" job launched by dedup takes twice as long as the
one launched by updatedb.
I notice that the script passes $commonOptions to updatedb but not to dedup. I
suspect that the crawldb update launched by dedup may not be compressing its
output.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)