[jira] [Created] (NUTCH-2379) crawl script dedup's crawldb update is slow

Michael Coffey (JIRA) Mon, 01 May 2017 15:54:00 -0700

Michael Coffey created NUTCH-2379:
-------------------------------------

             Summary: crawl script dedup's crawldb update is slow 
                 Key: NUTCH-2379
                 URL: https://issues.apache.org/jira/browse/NUTCH-2379
             Project: Nutch
          Issue Type: Bug
          Components: bin
    Affects Versions: 1.11
         Environment: shell
            Reporter: Michael Coffey
            Priority: Minor



 In the standard crawl script, there is a _bin_nutch updatedb command and, soon 
after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with 
"crawldb /path/to/crawl/db" in their names (in addition to the actual 
deduplication job).

In my situation, the "crawldb" job launched by dedup takes twice as long as the 
one launched by updatedb.

I notice that the script passes $commonOptions to updatedb but not to dedup. I 
suspect that the crawldb update launched by dedup may not be compressing its 
output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (NUTCH-2379) crawl script dedup's crawldb update is slow

Reply via email to