RE: Nutch Crawl a Specific List Of URLs (150K)

2013-12-30 Thread Markus Jelsma
Hi, 

You ran one crawl cycle. Depending on the generator and fetcher settings you 
are not guaranteerd to fetch 200.000 URL's with only topN specified. Check the 
logs, the generator will tell you if there are too many URL's for a host or 
domain. Also check all fetcher logs, it will tell you how much it crawled and 
why it likely stopped when it did.

Cheers

-Original message-
From: Bin Wangbinwang...@gmail.com
Sent: Friday 27th December 2013 19:50
To: dev@nutch.apache.org
Subject: Nutch Crawl a Specific List Of URLs (150K)

Hi,

I have a very specific list of URLs, which is about 140K URLs.

I switch off the `db.update.additions.allowed` so it will not update the 
crawldb... and I was assuming I can feed all the URLs to Nutch, and after one 
round of fetching, it will finish and leave all the raw HTML files in the 
segment folder.

However, after I run this command:

nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 

It ended up with a small number of URLs..

TOTAL urls: 872

retry 0:872

min score:  1.0

avg score:  1.0

max score:  1.0

And I double check the log to make sure that every url can pass the filter and 
normalization. And here is the log:

2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of urls 
rejected by filters: 0

2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of urls 
injected after normalization and filtering: 139058

2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected urls 
into crawl db.

I dont know how 140K URLs ended up being 872 in the end...

/usr/bin

--

AWS ubuntu instance

Nutch 1.7

java version 1.6.0_27

OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1ubuntu0.12.04.4)

OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)




[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859275#comment-13859275
 ] 

Tejas Patil commented on NUTCH-1687:


This is one good point by [~tiennm].  Although this might not give significant 
performance improvement, it would fairly distribute requests across all fetch 
queues.

Some comments wrt the patch:
1. Do you really need to make the methods of CircularLinkedList class thread 
safe ? The methods in FetchItemQueues which interact with the 
CircularLinkedList (ie. getFetchItemQueue and getFetchItem) are all 
synchronized. So, its ensured that only one thread accesses the list at a time.
2. Why is 'id' needed in FetchItemQueue ?

 Pick queue in Round Robin
 -

 Key: NUTCH-1687
 URL: https://issues.apache.org/jira/browse/NUTCH-1687
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1687.patch


 Currently we chose queue to pick url from start of queues list, so queue at 
 the start of list have more change to be pick first, that can cause problem 
 of long tail queue, which only few queue available at the end which have many 
 urls.
 public synchronized FetchItem getFetchItem() {
   final IteratorMap.EntryString, FetchItemQueue it =
 queues.entrySet().iterator(); == always reset to find queue from 
 start
   while (it.hasNext()) {
 
 I think it is better to pick queue in round robin, that can make reduce time 
 to find the available queue and make all queue was picked in round robin and 
 if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Build failed in Jenkins: Nutch-trunk #2469

2013-12-30 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/2469/

--
[...truncated 3407 lines...]
init:
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/classes
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-host
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/urlnormalizer-host.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host

copy-generated-lib:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host

init:
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-pass
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/urlnormalizer-pass.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass

copy-generated-lib:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass

init:
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/classes
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/test
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-querystring

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-querystring
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/urlnormalizer-querystring.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-querystring

copy-generated-lib:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-querystring
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/test/data
 [copy] Copying 4 files to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/test/data

init:
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/classes
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-regex

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-regex
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/urlnormalizer-regex.jar

deps-test:

init:


[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859358#comment-13859358
 ] 

Tejas Patil commented on NUTCH-1687:


Created a review request: https://reviews.apache.org/r/16535/

 Pick queue in Round Robin
 -

 Key: NUTCH-1687
 URL: https://issues.apache.org/jira/browse/NUTCH-1687
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch


 Currently we chose queue to pick url from start of queues list, so queue at 
 the start of list have more change to be pick first, that can cause problem 
 of long tail queue, which only few queue available at the end which have many 
 urls.
 public synchronized FetchItem getFetchItem() {
   final IteratorMap.EntryString, FetchItemQueue it =
 queues.entrySet().iterator(); == always reset to find queue from 
 start
   while (it.hasNext()) {
 
 I think it is better to pick queue in round robin, that can make reduce time 
 to find the available queue and make all queue was picked in round robin and 
 if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859364#comment-13859364
 ] 

Tien Nguyen Manh commented on NUTCH-1687:
-

It is nice!

 Pick queue in Round Robin
 -

 Key: NUTCH-1687
 URL: https://issues.apache.org/jira/browse/NUTCH-1687
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch


 Currently we chose queue to pick url from start of queues list, so queue at 
 the start of list have more change to be pick first, that can cause problem 
 of long tail queue, which only few queue available at the end which have many 
 urls.
 public synchronized FetchItem getFetchItem() {
   final IteratorMap.EntryString, FetchItemQueue it =
 queues.entrySet().iterator(); == always reset to find queue from 
 start
   while (it.hasNext()) {
 
 I think it is better to pick queue in round robin, that can make reduce time 
 to find the available queue and make all queue was picked in round robin and 
 if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)