RE: Nutch Crawl a Specific List Of URLs (150K)
Hi, You ran one crawl cycle. Depending on the generator and fetcher settings you are not guaranteerd to fetch 200.000 URL's with only topN specified. Check the logs, the generator will tell you if there are too many URL's for a host or domain. Also check all fetcher logs, it will tell you how much it crawled and why it likely stopped when it did. Cheers -Original message- From: Bin Wangbinwang...@gmail.com Sent: Friday 27th December 2013 19:50 To: dev@nutch.apache.org Subject: Nutch Crawl a Specific List Of URLs (150K) Hi, I have a very specific list of URLs, which is about 140K URLs. I switch off the `db.update.additions.allowed` so it will not update the crawldb... and I was assuming I can feed all the URLs to Nutch, and after one round of fetching, it will finish and leave all the raw HTML files in the segment folder. However, after I run this command: nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 It ended up with a small number of URLs.. TOTAL urls: 872 retry 0:872 min score: 1.0 avg score: 1.0 max score: 1.0 And I double check the log to make sure that every url can pass the filter and normalization. And here is the log: 2013-12-27 17:55:25,068 INFO crawl.Injector - Injector: total number of urls rejected by filters: 0 2013-12-27 17:55:25,069 INFO crawl.Injector - Injector: total number of urls injected after normalization and filtering: 139058 2013-12-27 17:55:25,069 INFO crawl.Injector - Injector: Merging injected urls into crawl db. I dont know how 140K URLs ended up being 872 in the end... /usr/bin -- AWS ubuntu instance Nutch 1.7 java version 1.6.0_27 OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1ubuntu0.12.04.4) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859275#comment-13859275 ] Tejas Patil commented on NUTCH-1687: This is one good point by [~tiennm]. Although this might not give significant performance improvement, it would fairly distribute requests across all fetch queues. Some comments wrt the patch: 1. Do you really need to make the methods of CircularLinkedList class thread safe ? The methods in FetchItemQueues which interact with the CircularLinkedList (ie. getFetchItemQueue and getFetchItem) are all synchronized. So, its ensured that only one thread accesses the list at a time. 2. Why is 'id' needed in FetchItemQueue ? Pick queue in Round Robin - Key: NUTCH-1687 URL: https://issues.apache.org/jira/browse/NUTCH-1687 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Tien Nguyen Manh Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1687.patch Currently we chose queue to pick url from start of queues list, so queue at the start of list have more change to be pick first, that can cause problem of long tail queue, which only few queue available at the end which have many urls. public synchronized FetchItem getFetchItem() { final IteratorMap.EntryString, FetchItemQueue it = queues.entrySet().iterator(); == always reset to find queue from start while (it.hasNext()) { I think it is better to pick queue in round robin, that can make reduce time to find the available queue and make all queue was picked in round robin and if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Build failed in Jenkins: Nutch-trunk #2469
See https://builds.apache.org/job/Nutch-trunk/2469/ -- [...truncated 3407 lines...] init: [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/classes [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-host [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/urlnormalizer-host.jar deps-test: deploy: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host copy-generated-lib: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host init: [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-pass [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/urlnormalizer-pass.jar deps-test: deploy: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass copy-generated-lib: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass init: [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/classes [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/test [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-querystring init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-querystring [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/urlnormalizer-querystring.jar deps-test: deploy: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-querystring copy-generated-lib: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-querystring [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/test/data [copy] Copying 4 files to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/test/data init: [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/classes [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-regex init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-regex [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/urlnormalizer-regex.jar deps-test: init:
[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859358#comment-13859358 ] Tejas Patil commented on NUTCH-1687: Created a review request: https://reviews.apache.org/r/16535/ Pick queue in Round Robin - Key: NUTCH-1687 URL: https://issues.apache.org/jira/browse/NUTCH-1687 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Tien Nguyen Manh Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch Currently we chose queue to pick url from start of queues list, so queue at the start of list have more change to be pick first, that can cause problem of long tail queue, which only few queue available at the end which have many urls. public synchronized FetchItem getFetchItem() { final IteratorMap.EntryString, FetchItemQueue it = queues.entrySet().iterator(); == always reset to find queue from start while (it.hasNext()) { I think it is better to pick queue in round robin, that can make reduce time to find the available queue and make all queue was picked in round robin and if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859364#comment-13859364 ] Tien Nguyen Manh commented on NUTCH-1687: - It is nice! Pick queue in Round Robin - Key: NUTCH-1687 URL: https://issues.apache.org/jira/browse/NUTCH-1687 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Tien Nguyen Manh Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch Currently we chose queue to pick url from start of queues list, so queue at the start of list have more change to be pick first, that can cause problem of long tail queue, which only few queue available at the end which have many urls. public synchronized FetchItem getFetchItem() { final IteratorMap.EntryString, FetchItemQueue it = queues.entrySet().iterator(); == always reset to find queue from start while (it.hasNext()) { I think it is better to pick queue in round robin, that can make reduce time to find the available queue and make all queue was picked in round robin and if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.1.5#6160)