Hi, I had a similar situation when I tried to use a start url that issues a 302 redirect.
Have you checked what comes across the wire ala: telnet www.targetserver.com 80 #it returns some host info Trying 127.0.1 Connected to w1.targetserver.com. Escape character is '^]'. # you type HEAD url HTTP/1.1 host: w1.targetserver.com #it should return something. HTTP/1.1 200 OK Date: Tue, 20 Jun 2006 19:51:51 GMT Server: Apache/2.0.55 (Unix) mod_ssl/2.0.55 OpenSSL/0.9.7g DAV/2 PHP/4.4.1 X-Powered-By: PHP/4.4.1 Set-Cookie: random=5; expires=Tuesday, 20-Jun-06 20:52:13 GMT Content-Type: text/html Connection closed by foreign host. For us, the HTTP protocol handler was not working right for 302 redirect. Our target server reported in the return headers: location: some-url But Nutch was looking for case-sensitive: Location: url If this is not the case, look at conf/regex-urlfilter.txt You are using it and you may have not set the pattern checker up correctly. Yuzo --On Tuesday, June 20, 2006 8:35 AM -0700 nasm <[EMAIL PROTECTED]> wrote: > > maybe you are right, because when run the commad > > bin/nutch crawl urls -dir crawl.globalforte.com -depth >& forte.log > > it gives the following messages > > 060620 182944 parsing file:/home/nasm/nutch-0.7.2/conf/nutch-default.xml > 060620 182944 parsing file:/home/nasm/nutch-0.7.2/conf/crawl-tool.xml > 060620 182944 parsing file:/home/nasm/nutch-0.7.2/conf/nutch-site.xml > 060620 182944 No FS indicated, using default:local > 060620 182944 crawl started in: crawl.globalforte > 060620 182944 rootUrlFile = urls > 060620 182944 threads = 10 > 060620 182944 depth = 10 > 060620 182945 Created webdb at > LocalFS,/home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182945 Starting URL processing > 060620 182945 Plugins: looking in: /home/nasm/nutch-0.7.2/plugins > 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/query-more > 060620 182945 parsing: > /home/nasm/nutch-0.7.2/plugins/query-site/plugin.xml 060620 182945 impl: > point=org.apache.nutch.searcher.QueryFilter > class=org.apache.nutch.searcher.site.SiteQueryFilter > 060620 182945 parsing: > /home/nasm/nutch-0.7.2/plugins/parse-html/plugin.xml 060620 182945 impl: > point=org.apache.nutch.parse.Parser > class=org.apache.nutch.parse.html.HtmlParser > 060620 182945 parsing: > /home/nasm/nutch-0.7.2/plugins/parse-text/plugin.xml 060620 182945 impl: > point=org.apache.nutch.parse.Parser > class=org.apache.nutch.parse.text.TextParser > 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/parse-ext > 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/parse-pdf > 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/parse-rss > 060620 182945 parsing: > /home/nasm/nutch-0.7.2/plugins/query-basic/plugin.xml 060620 182945 impl: > point=org.apache.nutch.searcher.QueryFilter > class=org.apache.nutch.searcher.basic.BasicQueryFilter > 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/index-more > 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/parse-js > 060620 182945 parsing: > /home/nasm/nutch-0.7.2/plugins/urlfilter-regex/plugin.xml > 060620 182945 impl: point=org.apache.nutch.net.URLFilter > class=org.apache.nutch.net.RegexURLFilter > 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/protocol-ftp > 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/parse-msword > 060620 182945 not including: > /home/nasm/nutch-0.7.2/plugins/creativecommons 060620 182945 not > including: /home/nasm/nutch-0.7.2/plugins/ontology 060620 182945 parsing: > /home/nasm/nutch-0.7.2/plugins/nutch-extensionpoints/plugin.xml > 060620 182945 not including: /home/nasm/nutch-0.7.2/plugins/protocol-file > 060620 182945 parsing: > /home/nasm/nutch-0.7.2/plugins/protocol-http/plugin.xml > 060620 182945 impl: point=org.apache.nutch.protocol.Protocol > class=org.apache.nutch.protocol.http.Http > 060620 182945 not including: > /home/nasm/nutch-0.7.2/plugins/clustering-carrot2 > 060620 182945 not including: > /home/nasm/nutch-0.7.2/plugins/language-identifier > 060620 182945 not including: > /home/nasm/nutch-0.7.2/plugins/urlfilter-prefix 060620 182945 parsing: > /home/nasm/nutch-0.7.2/plugins/query-url/plugin.xml 060620 182945 impl: > point=org.apache.nutch.searcher.QueryFilter > class=org.apache.nutch.searcher.url.URLQueryFilter > 060620 182945 parsing: > /home/nasm/nutch-0.7.2/plugins/index-basic/plugin.xml 060620 182945 impl: > point=org.apache.nutch.indexer.IndexingFilter > class=org.apache.nutch.indexer.basic.BasicIndexingFilter > 060620 182945 not including: > /home/nasm/nutch-0.7.2/plugins/protocol-httpclient > 060620 182945 found resource crawl-urlfilter.txt at > file:/home/nasm/nutch-0.7.2/conf/crawl-urlfilter.txt > .060620 182945 Added 0 pages > 060620 182945 FetchListTool started > 060620 182945 Overall processing: Sorted 0 entries in 0.0 seconds. > 060620 182945 Overall processing: Sorted NaN entries/second > 060620 182945 FetchListTool completed > 060620 182945 logging at INFO > 060620 182946 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182946 Updating for > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182945 > 060620 182946 Finishing update > 060620 182946 Update finished > 060620 182946 FetchListTool started > 060620 182946 Overall processing: Sorted 0 entries in 0.0 seconds. > 060620 182946 Overall processing: Sorted NaN entries/second > 060620 182946 FetchListTool completed > 060620 182946 logging at INFO > 060620 182947 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182947 Updating for > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182946 > 060620 182947 Finishing update > 060620 182947 Update finished > 060620 182947 FetchListTool started > 060620 182948 Overall processing: Sorted 0 entries in 0.0 seconds. > 060620 182948 Overall processing: Sorted NaN entries/second > 060620 182948 FetchListTool completed > 060620 182948 logging at INFO > 060620 182949 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182949 Updating for > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182947 > 060620 182949 Finishing update > 060620 182949 Update finished > 060620 182949 FetchListTool started > 060620 182949 Overall processing: Sorted 0 entries in 0.0 seconds. > 060620 182949 Overall processing: Sorted NaN entries/second > 060620 182949 FetchListTool completed > 060620 182949 logging at INFO > 060620 182950 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182950 Updating for > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182949 > 060620 182950 Finishing update > 060620 182950 Update finished > 060620 182950 FetchListTool started > 060620 182950 Overall processing: Sorted 0 entries in 0.0 seconds. > 060620 182950 Overall processing: Sorted NaN entries/second > 060620 182950 FetchListTool completed > 060620 182950 logging at INFO > 060620 182951 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182951 Updating for > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182950 > 060620 182951 Finishing update > 060620 182951 Update finished > 060620 182951 FetchListTool started > 060620 182951 Overall processing: Sorted 0 entries in 0.0 seconds. > 060620 182951 Overall processing: Sorted NaN entries/second > 060620 182951 FetchListTool completed > 060620 182951 logging at INFO > 060620 182952 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182952 Updating for > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182951 > 060620 182952 Finishing update > 060620 182952 Update finished > 060620 182952 FetchListTool started > 060620 182953 Overall processing: Sorted 0 entries in 0.0 seconds. > 060620 182953 Overall processing: Sorted NaN entries/second > 060620 182953 FetchListTool completed > 060620 182953 logging at INFO > 060620 182954 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182954 Updating for > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182952 > 060620 182954 Finishing update > 060620 182954 Update finished > 060620 182954 FetchListTool started > 060620 182954 Overall processing: Sorted 0 entries in 0.0 seconds. > 060620 182954 Overall processing: Sorted NaN entries/second > 060620 182954 FetchListTool completed > 060620 182954 logging at INFO > 060620 182955 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182955 Updating for > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182954 > 060620 182955 Finishing update > 060620 182955 Update finished > 060620 182955 FetchListTool started > 060620 182955 Overall processing: Sorted 0 entries in 0.0 seconds. > 060620 182955 Overall processing: Sorted NaN entries/second > 060620 182955 FetchListTool completed > 060620 182955 logging at INFO > 060620 182956 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182956 Updating for > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182955 > 060620 182956 Finishing update > 060620 182956 Update finished > 060620 182956 FetchListTool started > 060620 182956 Overall processing: Sorted 0 entries in 0.0 seconds. > 060620 182956 Overall processing: Sorted NaN entries/second > 060620 182956 FetchListTool completed > 060620 182956 logging at INFO > 060620 182957 Updating /home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182957 Updating for > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182956 > 060620 182957 Finishing update > 060620 182957 Update finished > 060620 182957 Updating > /home/nasm/nutch-0.7.2/crawl.globalforte/segments from > /home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182957 reading > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182945 > 060620 182957 reading > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182946 > 060620 182957 reading > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182947 > 060620 182957 reading > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182949 > 060620 182957 reading > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182950 > 060620 182957 reading > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182951 > 060620 182957 reading > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182952 > 060620 182957 reading > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182954 > 060620 182957 reading > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182955 > 060620 182957 reading > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182956 > 060620 182957 Sorting pages by url... > 060620 182957 Getting updated scores and anchors from db... > 060620 182957 Sorting updates by segment... > 060620 182957 Updating segments... > 060620 182957 Done updating > /home/nasm/nutch-0.7.2/crawl.globalforte/segments from > /home/nasm/nutch-0.7.2/crawl.globalforte/db > 060620 182957 indexing segment: > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182945 > 060620 182958 * Opening segment 20060620182945 > 060620 182958 * Indexing segment 20060620182945 > 060620 182958 * Optimizing index... > 060620 182958 * Moving index to NFS if needed... > 060620 182958 DONE indexing segment 20060620182945: total 0 records in > 0.076 s (NaN rec/s). > 060620 182958 done indexing > 060620 182958 indexing segment: > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182946 > 060620 182958 * Opening segment 20060620182946 > 060620 182958 * Indexing segment 20060620182946 > 060620 182958 * Optimizing index... > 060620 182958 * Moving index to NFS if needed... > 060620 182958 DONE indexing segment 20060620182946: total 0 records in > 0.0030 s (NaN rec/s). > 060620 182958 done indexing > 060620 182958 indexing segment: > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182947 > 060620 182958 * Opening segment 20060620182947 > 060620 182958 * Indexing segment 20060620182947 > 060620 182958 * Optimizing index... > 060620 182958 * Moving index to NFS if needed... > 060620 182958 DONE indexing segment 20060620182947: total 0 records in > 0.0060 s (NaN rec/s). > 060620 182958 done indexing > 060620 182958 indexing segment: > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182949 > 060620 182958 * Opening segment 20060620182949 > 060620 182958 * Indexing segment 20060620182949 > 060620 182958 * Optimizing index... > 060620 182958 * Moving index to NFS if needed... > 060620 182958 DONE indexing segment 20060620182949: total 0 records in > 0.0040 s (NaN rec/s). > 060620 182958 done indexing > 060620 182958 indexing segment: > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182950 > 060620 182958 * Opening segment 20060620182950 > 060620 182958 * Indexing segment 20060620182950 > 060620 182958 * Optimizing index... > 060620 182958 * Moving index to NFS if needed... > 060620 182958 DONE indexing segment 20060620182950: total 0 records in > 0.0060 s (NaN rec/s). > 060620 182958 done indexing > 060620 182958 indexing segment: > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182951 > 060620 182958 * Opening segment 20060620182951 > 060620 182958 * Indexing segment 20060620182951 > 060620 182958 * Optimizing index... > 060620 182958 * Moving index to NFS if needed... > 060620 182958 DONE indexing segment 20060620182951: total 0 records in > 0.0060 s (NaN rec/s). > 060620 182958 done indexing > 060620 182958 indexing segment: > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182952 > 060620 182958 * Opening segment 20060620182952 > 060620 182958 * Indexing segment 20060620182952 > 060620 182958 * Optimizing index... > 060620 182958 * Moving index to NFS if needed... > 060620 182958 DONE indexing segment 20060620182952: total 0 records in > 0.0050 s (NaN rec/s). > 060620 182958 done indexing > 060620 182958 indexing segment: > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182954 > 060620 182958 * Opening segment 20060620182954 > 060620 182958 * Indexing segment 20060620182954 > 060620 182958 * Optimizing index... > 060620 182958 * Moving index to NFS if needed... > 060620 182958 DONE indexing segment 20060620182954: total 0 records in > 0.035 s (NaN rec/s). > 060620 182958 done indexing > 060620 182958 indexing segment: > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182955 > 060620 182958 * Opening segment 20060620182955 > 060620 182958 * Indexing segment 20060620182955 > 060620 182958 * Optimizing index... > 060620 182958 * Moving index to NFS if needed... > 060620 182958 DONE indexing segment 20060620182955: total 0 records in > 0.0030 s (NaN rec/s). > 060620 182958 done indexing > 060620 182958 indexing segment: > /home/nasm/nutch-0.7.2/crawl.globalforte/segments/20060620182956 > 060620 182958 * Opening segment 20060620182956 > 060620 182958 * Indexing segment 20060620182956 > 060620 182958 * Optimizing index... > 060620 182958 * Moving index to NFS if needed... > 060620 182958 DONE indexing segment 20060620182956: total 0 records in > 0.0040 s (NaN rec/s). > 060620 182958 done indexing > 060620 182958 Reading url hashes... > 060620 182958 Sorting url hashes... > 060620 182958 Deleting url duplicates... > 060620 182958 Deleted 0 url duplicates. > 060620 182958 Reading content hashes... > 060620 182958 Sorting content hashes... > 060620 182958 Deleting content duplicates... > 060620 182958 Deleted 0 content duplicates. > 060620 182958 Duplicate deletion complete locally. Now returning to > NFS... 060620 182958 DeleteDuplicates complete > 060620 182958 Merging segment indexes... > 060620 182958 crawl finished: crawl.globalforte > > > -- > View this message in context: > http://www.nabble.com/nutch-0.7.2-does-not-work-t1817625.html#a4957542 > Sent from the Nutch - User forum at Nabble.com. _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
