Hi,

I had a similar situation when I tried to use a start url that issues a 302 
redirect.

Have you checked what comes across the wire ala:

telnet www.targetserver.com 80
#it returns some host info
Trying 127.0.1
Connected to w1.targetserver.com.
Escape character is '^]'.
# you type
HEAD url HTTP/1.1
host: w1.targetserver.com
#it should return something.
HTTP/1.1 200 OK
Date: Tue, 20 Jun 2006 19:51:51 GMT
Server: Apache/2.0.55 (Unix) mod_ssl/2.0.55 OpenSSL/0.9.7g DAV/2 PHP/4.4.1
X-Powered-By: PHP/4.4.1
Set-Cookie: random=5; expires=Tuesday, 20-Jun-06 20:52:13 GMT
Content-Type: text/html

Connection closed by foreign host.

For us, the HTTP protocol handler was not working right for 302 redirect.

Our target server reported in the return headers:
location: some-url
But Nutch was looking for case-sensitive:
Location: url


If this is not the case, look at conf/regex-urlfilter.txt

You are using it and you may have not set the pattern checker up correctly.

Yuzo


--On Tuesday, June 20, 2006 8:35 AM -0700 nasm <[EMAIL PROTECTED]> wrote:

>
> maybe you are right, because when run the commad
>
> bin/nutch crawl urls -dir crawl.globalforte.com -depth >& forte.log
>
> it gives the following messages
>
> 060620 182944 parsing file:/home/nasm/nutch-0.7.2/conf/nutch-default.xml
> 060620 182944 parsing file:/home/nasm/nutch-0.7.2/conf/crawl-tool.xml
> 060620 182944 parsing file:/home/nasm/nutch-0.7.2/conf/nutch-site.xml
> 060620 182944 No FS indicated, using default:local
> 060620 182944 crawl started in: crawl.globalforte
> 060620 182944 rootUrlFile = urls
> 060620 182944 threads = 10
> 060620 182944 depth = 10
> 060620 182945 Created webdb at
> LocalFS,/home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182945 Starting URL processing
> 060620 182945 Plugins: looking in: /home/nasm/nutch-0.7.2/plugins
> 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/query-more
> 060620 182945 parsing:
> /home/nasm/nutch-0.7.2/plugins/query-site/plugin.xml 060620 182945 impl:
> point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060620 182945 parsing:
> /home/nasm/nutch-0.7.2/plugins/parse-html/plugin.xml 060620 182945 impl:
> point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.html.HtmlParser
> 060620 182945 parsing:
> /home/nasm/nutch-0.7.2/plugins/parse-text/plugin.xml 060620 182945 impl:
> point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.text.TextParser
> 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/parse-ext
> 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/parse-pdf
> 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/parse-rss
> 060620 182945 parsing:
> /home/nasm/nutch-0.7.2/plugins/query-basic/plugin.xml 060620 182945 impl:
> point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/index-more
> 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/parse-js
> 060620 182945 parsing:
> /home/nasm/nutch-0.7.2/plugins/urlfilter-regex/plugin.xml
> 060620 182945 impl: point=org.apache.nutch.net.URLFilter
> class=org.apache.nutch.net.RegexURLFilter
> 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/protocol-ftp
> 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/parse-msword
> 060620 182945 not including:
> /home/nasm/nutch-0.7.2/plugins/creativecommons 060620 182945 not
> including: /home/nasm/nutch-0.7.2/plugins/ontology 060620 182945 parsing:
> /home/nasm/nutch-0.7.2/plugins/nutch-extensionpoints/plugin.xml
> 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/protocol-file
> 060620 182945 parsing:
> /home/nasm/nutch-0.7.2/plugins/protocol-http/plugin.xml
> 060620 182945 impl: point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.http.Http
> 060620 182945 not including:
> /home/nasm/nutch-0.7.2/plugins/clustering-carrot2
> 060620 182945 not including:
> /home/nasm/nutch-0.7.2/plugins/language-identifier
> 060620 182945 not including:
> /home/nasm/nutch-0.7.2/plugins/urlfilter-prefix 060620 182945 parsing:
> /home/nasm/nutch-0.7.2/plugins/query-url/plugin.xml 060620 182945 impl:
> point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.url.URLQueryFilter
> 060620 182945 parsing:
> /home/nasm/nutch-0.7.2/plugins/index-basic/plugin.xml 060620 182945 impl:
> point=org.apache.nutch.indexer.IndexingFilter
> class=org.apache.nutch.indexer.basic.BasicIndexingFilter
> 060620 182945 not including:
> /home/nasm/nutch-0.7.2/plugins/protocol-httpclient
> 060620 182945 found resource crawl-urlfilter.txt at
> file:/home/nasm/nutch-0.7.2/conf/crawl-urlfilter.txt
> .060620 182945 Added 0 pages
> 060620 182945 FetchListTool started
> 060620 182945 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060620 182945 Overall processing: Sorted NaN entries/second
> 060620 182945 FetchListTool completed
> 060620 182945 logging at INFO
> 060620 182946 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182946 Updating for
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182945
> 060620 182946 Finishing update
> 060620 182946 Update finished
> 060620 182946 FetchListTool started
> 060620 182946 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060620 182946 Overall processing: Sorted NaN entries/second
> 060620 182946 FetchListTool completed
> 060620 182946 logging at INFO
> 060620 182947 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182947 Updating for
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182946
> 060620 182947 Finishing update
> 060620 182947 Update finished
> 060620 182947 FetchListTool started
> 060620 182948 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060620 182948 Overall processing: Sorted NaN entries/second
> 060620 182948 FetchListTool completed
> 060620 182948 logging at INFO
> 060620 182949 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182949 Updating for
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182947
> 060620 182949 Finishing update
> 060620 182949 Update finished
> 060620 182949 FetchListTool started
> 060620 182949 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060620 182949 Overall processing: Sorted NaN entries/second
> 060620 182949 FetchListTool completed
> 060620 182949 logging at INFO
> 060620 182950 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182950 Updating for
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182949
> 060620 182950 Finishing update
> 060620 182950 Update finished
> 060620 182950 FetchListTool started
> 060620 182950 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060620 182950 Overall processing: Sorted NaN entries/second
> 060620 182950 FetchListTool completed
> 060620 182950 logging at INFO
> 060620 182951 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182951 Updating for
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182950
> 060620 182951 Finishing update
> 060620 182951 Update finished
> 060620 182951 FetchListTool started
> 060620 182951 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060620 182951 Overall processing: Sorted NaN entries/second
> 060620 182951 FetchListTool completed
> 060620 182951 logging at INFO
> 060620 182952 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182952 Updating for
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182951
> 060620 182952 Finishing update
> 060620 182952 Update finished
> 060620 182952 FetchListTool started
> 060620 182953 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060620 182953 Overall processing: Sorted NaN entries/second
> 060620 182953 FetchListTool completed
> 060620 182953 logging at INFO
> 060620 182954 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182954 Updating for
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182952
> 060620 182954 Finishing update
> 060620 182954 Update finished
> 060620 182954 FetchListTool started
> 060620 182954 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060620 182954 Overall processing: Sorted NaN entries/second
> 060620 182954 FetchListTool completed
> 060620 182954 logging at INFO
> 060620 182955 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182955 Updating for
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182954
> 060620 182955 Finishing update
> 060620 182955 Update finished
> 060620 182955 FetchListTool started
> 060620 182955 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060620 182955 Overall processing: Sorted NaN entries/second
> 060620 182955 FetchListTool completed
> 060620 182955 logging at INFO
> 060620 182956 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182956 Updating for
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182955
> 060620 182956 Finishing update
> 060620 182956 Update finished
> 060620 182956 FetchListTool started
> 060620 182956 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060620 182956 Overall processing: Sorted NaN entries/second
> 060620 182956 FetchListTool completed
> 060620 182956 logging at INFO
> 060620 182957 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182957 Updating for
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182956
> 060620 182957 Finishing update
> 060620 182957 Update finished
> 060620 182957 Updating
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments from
> /home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182957  reading
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182945
> 060620 182957  reading
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182946
> 060620 182957  reading
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182947
> 060620 182957  reading
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182949
> 060620 182957  reading
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182950
> 060620 182957  reading
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182951
> 060620 182957  reading
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182952
> 060620 182957  reading
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182954
> 060620 182957  reading
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182955
> 060620 182957  reading
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182956
> 060620 182957 Sorting pages by url...
> 060620 182957 Getting updated scores and anchors from db...
> 060620 182957 Sorting updates by segment...
> 060620 182957 Updating segments...
> 060620 182957 Done updating
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments from
> /home/nasm/nutch-0.7.2/crawl.globalforte/db
> 060620 182957 indexing segment:
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182945
> 060620 182958 * Opening segment 20060620182945
> 060620 182958 * Indexing segment 20060620182945
> 060620 182958 * Optimizing index...
> 060620 182958 * Moving index to NFS if needed...
> 060620 182958 DONE indexing segment 20060620182945: total 0 records in
> 0.076 s (NaN rec/s).
> 060620 182958 done indexing
> 060620 182958 indexing segment:
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182946
> 060620 182958 * Opening segment 20060620182946
> 060620 182958 * Indexing segment 20060620182946
> 060620 182958 * Optimizing index...
> 060620 182958 * Moving index to NFS if needed...
> 060620 182958 DONE indexing segment 20060620182946: total 0 records in
> 0.0030 s (NaN rec/s).
> 060620 182958 done indexing
> 060620 182958 indexing segment:
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182947
> 060620 182958 * Opening segment 20060620182947
> 060620 182958 * Indexing segment 20060620182947
> 060620 182958 * Optimizing index...
> 060620 182958 * Moving index to NFS if needed...
> 060620 182958 DONE indexing segment 20060620182947: total 0 records in
> 0.0060 s (NaN rec/s).
> 060620 182958 done indexing
> 060620 182958 indexing segment:
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182949
> 060620 182958 * Opening segment 20060620182949
> 060620 182958 * Indexing segment 20060620182949
> 060620 182958 * Optimizing index...
> 060620 182958 * Moving index to NFS if needed...
> 060620 182958 DONE indexing segment 20060620182949: total 0 records in
> 0.0040 s (NaN rec/s).
> 060620 182958 done indexing
> 060620 182958 indexing segment:
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182950
> 060620 182958 * Opening segment 20060620182950
> 060620 182958 * Indexing segment 20060620182950
> 060620 182958 * Optimizing index...
> 060620 182958 * Moving index to NFS if needed...
> 060620 182958 DONE indexing segment 20060620182950: total 0 records in
> 0.0060 s (NaN rec/s).
> 060620 182958 done indexing
> 060620 182958 indexing segment:
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182951
> 060620 182958 * Opening segment 20060620182951
> 060620 182958 * Indexing segment 20060620182951
> 060620 182958 * Optimizing index...
> 060620 182958 * Moving index to NFS if needed...
> 060620 182958 DONE indexing segment 20060620182951: total 0 records in
> 0.0060 s (NaN rec/s).
> 060620 182958 done indexing
> 060620 182958 indexing segment:
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182952
> 060620 182958 * Opening segment 20060620182952
> 060620 182958 * Indexing segment 20060620182952
> 060620 182958 * Optimizing index...
> 060620 182958 * Moving index to NFS if needed...
> 060620 182958 DONE indexing segment 20060620182952: total 0 records in
> 0.0050 s (NaN rec/s).
> 060620 182958 done indexing
> 060620 182958 indexing segment:
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182954
> 060620 182958 * Opening segment 20060620182954
> 060620 182958 * Indexing segment 20060620182954
> 060620 182958 * Optimizing index...
> 060620 182958 * Moving index to NFS if needed...
> 060620 182958 DONE indexing segment 20060620182954: total 0 records in
> 0.035 s (NaN rec/s).
> 060620 182958 done indexing
> 060620 182958 indexing segment:
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182955
> 060620 182958 * Opening segment 20060620182955
> 060620 182958 * Indexing segment 20060620182955
> 060620 182958 * Optimizing index...
> 060620 182958 * Moving index to NFS if needed...
> 060620 182958 DONE indexing segment 20060620182955: total 0 records in
> 0.0030 s (NaN rec/s).
> 060620 182958 done indexing
> 060620 182958 indexing segment:
> /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182956
> 060620 182958 * Opening segment 20060620182956
> 060620 182958 * Indexing segment 20060620182956
> 060620 182958 * Optimizing index...
> 060620 182958 * Moving index to NFS if needed...
> 060620 182958 DONE indexing segment 20060620182956: total 0 records in
> 0.0040 s (NaN rec/s).
> 060620 182958 done indexing
> 060620 182958 Reading url hashes...
> 060620 182958 Sorting url hashes...
> 060620 182958 Deleting url duplicates...
> 060620 182958 Deleted 0 url duplicates.
> 060620 182958 Reading content hashes...
> 060620 182958 Sorting content hashes...
> 060620 182958 Deleting content duplicates...
> 060620 182958 Deleted 0 content duplicates.
> 060620 182958 Duplicate deletion complete locally.  Now returning to
> NFS... 060620 182958 DeleteDuplicates complete
> 060620 182958 Merging segment indexes...
> 060620 182958 crawl finished: crawl.globalforte
>
>
> --
> View this message in context:
> http://www.nabble.com/nutch-0.7.2-does-not-work-t1817625.html#a4957542
> Sent from the Nutch - User forum at Nabble.com.






_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to