[ http://issues.apache.org/jira/browse/NUTCH-177?page=all ]
     
Doug Cutting resolved NUTCH-177:
--------------------------------

    Fix Version: 0.8-dev
     Resolution: Fixed

The problem is that your seed url does not end in a slash, yet your url filter 
requires a slash.  In 0.8-dev (aka trunk) this is fixed, since urls are 
normalized before filtering, which adds a slash after the hostname.

> Default installation seems to produce working entity of nutch
> -------------------------------------------------------------
>
>          Key: NUTCH-177
>          URL: http://issues.apache.org/jira/browse/NUTCH-177
>      Project: Nutch
>         Type: Bug
>     Versions: 0.7.1
>  Environment: Linux SUSE 9.3
>     Reporter: Matthias Günter
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: crawl-urlfilter.txt, urllist.txt
>
> I downloaded 0.7.1 and installed it.
> Then changed crawl-urlfilter.txt for apache.org
> Then I added an urllist.txt  and tried scanning.
> Apparently the URL has been ignored, even when it matched the rule in the 
> crawl-url-filter.txt
> [EMAIL PROTECTED]:~/workspace/lucene/nutch-0.7.1/bin> sh ./nutch crawl 
> ../../urllist.txt
> 060115 141534 parsing 
> file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-default.xml
> 060115 141534 parsing 
> file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-tool.xml
> 060115 141534 parsing 
> file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-site.xml
> 060115 141534 No FS indicated, using default:local
> 060115 141534 crawl started in: crawl-20060115141534
> 060115 141534 rootUrlFile = ../../urllist.txt
> 060115 141534 threads = 10
> 060115 141534 depth = 5
> 060115 141535 Created webdb at 
> LocalFS,/home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141535 Starting URL processing
> 060115 141535 Plugins: looking in: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-more
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-site/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter 
> class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-html/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.parse.Parser 
> class=org.apache.nutch.parse.html.HtmlParser
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-text/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.parse.Parser 
> class=org.apache.nutch.parse.text.TextParser
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-ext
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-pdf
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-rss
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-basic/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter 
> class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-more
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-js
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.net.URLFilter 
> class=org.apache.nutch.net.RegexURLFilter
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-ftp
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-msword
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/creativecommons
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/ontology
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-file
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-http/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.protocol.Protocol 
> class=org.apache.nutch.protocol.http.Http
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/clustering-carrot2
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/language-identifier
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-prefix
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-url/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter 
> class=org.apache.nutch.searcher.url.URLQueryFilter
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-basic/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.indexer.IndexingFilter 
> class=org.apache.nutch.indexer.basic.BasicIndexingFilter
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-httpclient
> 060115 141535 found resource crawl-urlfilter.txt at 
> file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-urlfilter.txt
> ..060115 141535 Added 0 pages
> 060115 141535 FetchListTool started
> 060115 141535 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141535 Overall processing: Sorted NaN entries/second
> 060115 141535 FetchListTool completed
> 060115 141536 logging at INFO
> 060115 141537 Updating 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141537 Updating for 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141537 Finishing update
> 060115 141537 Update finished
> 060115 141537 FetchListTool started
> 060115 141537 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141537 Overall processing: Sorted NaN entries/second
> 060115 141537 FetchListTool completed
> 060115 141537 logging at INFO
> 060115 141538 Updating 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141538 Updating for 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141538 Finishing update
> 060115 141538 Update finished
> 060115 141538 FetchListTool started
> 060115 141538 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141538 Overall processing: Sorted NaN entries/second
> 060115 141538 FetchListTool completed
> 060115 141538 logging at INFO
> 060115 141539 Updating 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141539 Updating for 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141539 Finishing update
> 060115 141539 Update finished
> 060115 141539 FetchListTool started
> 060115 141540 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141540 Overall processing: Sorted NaN entries/second
> 060115 141540 FetchListTool completed
> 060115 141540 logging at INFO
> 060115 141541 Updating 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141541 Updating for 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141541 Finishing update
> 060115 141541 Update finished
> 060115 141541 FetchListTool started
> 060115 141541 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141541 Overall processing: Sorted NaN entries/second
> 060115 141541 FetchListTool completed
> 060115 141541 logging at INFO
> 060115 141542 Updating 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542 Updating for 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141542 Finishing update
> 060115 141542 Update finished
> 060115 141542 Updating 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments 
> from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542  reading 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141542  reading 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141542  reading 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141542  reading 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141542  reading 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141542 Sorting pages by url...
> 060115 141542 Getting updated scores and anchors from db...
> 060115 141542 Sorting updates by segment...
> 060115 141542 Updating segments...
> 060115 141542 Done updating 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments 
> from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542 indexing segment: 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141542 * Opening segment 20060115141535
> 060115 141542 * Indexing segment 20060115141535
> 060115 141542 * Optimizing index...
> 060115 141542 * Moving index to NFS if needed...
> 060115 141542 DONE indexing segment 20060115141535: total 0 records in 0.035 
> s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141543 * Opening segment 20060115141537
> 060115 141543 * Indexing segment 20060115141537
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141537: total 0 records in 0.076 
> s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141543 * Opening segment 20060115141538
> 060115 141543 * Indexing segment 20060115141538
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141538: total 0 records in 0.012 
> s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141543 * Opening segment 20060115141539
> 060115 141543 * Indexing segment 20060115141539
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141539: total 0 records in 0.013 
> s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: 
> /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141543 * Opening segment 20060115141541
> 060115 141543 * Indexing segment 20060115141541
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141541: total 0 records in 0.02 s 
> (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 Reading url hashes...
> 060115 141543 Sorting url hashes...
> 060115 141543 Deleting url duplicates...
> 060115 141543 Deleted 0 url duplicates.
> 060115 141543 Reading content hashes...
> 060115 141543 Sorting content hashes...
> 060115 141543 Deleting content duplicates...
> 060115 141543 Deleted 0 content duplicates.
> 060115 141543 Duplicate deletion complete locally.  Now returning to NFS...
> 060115 141543 DeleteDuplicates complete
> 060115 141543 Merging segment indexes...
> 060115 141543 crawl finished: crawl-20060115141534
> [EMAIL PROTECTED]:~/workspace/lucene/nutch-0.7.1/bin>   

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to