[ http://issues.apache.org/jira/browse/NUTCH-177?page=all ]
Sami Siren closed NUTCH-177. ---------------------------- > Default installation seems to produce working entity of nutch > ------------------------------------------------------------- > > Key: NUTCH-177 > URL: http://issues.apache.org/jira/browse/NUTCH-177 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.7.1 > Environment: Linux SUSE 9.3 > Reporter: Matthias Günter > Priority: Minor > Fix For: 0.8 > > Attachments: crawl-urlfilter.txt, urllist.txt > > > I downloaded 0.7.1 and installed it. > Then changed crawl-urlfilter.txt for apache.org > Then I added an urllist.txt and tried scanning. > Apparently the URL has been ignored, even when it matched the rule in the > crawl-url-filter.txt > [EMAIL PROTECTED]:~/workspace/lucene/nutch-0.7.1/bin> sh ./nutch crawl > ../../urllist.txt > 060115 141534 parsing > file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-default.xml > 060115 141534 parsing > file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-tool.xml > 060115 141534 parsing > file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-site.xml > 060115 141534 No FS indicated, using default:local > 060115 141534 crawl started in: crawl-20060115141534 > 060115 141534 rootUrlFile = ../../urllist.txt > 060115 141534 threads = 10 > 060115 141534 depth = 5 > 060115 141535 Created webdb at > LocalFS,/home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db > 060115 141535 Starting URL processing > 060115 141535 Plugins: looking in: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-more > 060115 141535 parsing: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-site/plugin.xml > 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter > class=org.apache.nutch.searcher.site.SiteQueryFilter > 060115 141535 parsing: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-html/plugin.xml > 060115 141535 impl: point=org.apache.nutch.parse.Parser > class=org.apache.nutch.parse.html.HtmlParser > 060115 141535 parsing: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-text/plugin.xml > 060115 141535 impl: point=org.apache.nutch.parse.Parser > class=org.apache.nutch.parse.text.TextParser > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-ext > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-pdf > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-rss > 060115 141535 parsing: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-basic/plugin.xml > 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter > class=org.apache.nutch.searcher.basic.BasicQueryFilter > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-more > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-js > 060115 141535 parsing: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml > 060115 141535 impl: point=org.apache.nutch.net.URLFilter > class=org.apache.nutch.net.RegexURLFilter > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-ftp > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-msword > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/creativecommons > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/ontology > 060115 141535 parsing: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-file > 060115 141535 parsing: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-http/plugin.xml > 060115 141535 impl: point=org.apache.nutch.protocol.Protocol > class=org.apache.nutch.protocol.http.Http > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/clustering-carrot2 > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/language-identifier > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-prefix > 060115 141535 parsing: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-url/plugin.xml > 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter > class=org.apache.nutch.searcher.url.URLQueryFilter > 060115 141535 parsing: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-basic/plugin.xml > 060115 141535 impl: point=org.apache.nutch.indexer.IndexingFilter > class=org.apache.nutch.indexer.basic.BasicIndexingFilter > 060115 141535 not including: > /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-httpclient > 060115 141535 found resource crawl-urlfilter.txt at > file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-urlfilter.txt > ..060115 141535 Added 0 pages > 060115 141535 FetchListTool started > 060115 141535 Overall processing: Sorted 0 entries in 0.0 seconds. > 060115 141535 Overall processing: Sorted NaN entries/second > 060115 141535 FetchListTool completed > 060115 141536 logging at INFO > 060115 141537 Updating > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db > 060115 141537 Updating for > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535 > 060115 141537 Finishing update > 060115 141537 Update finished > 060115 141537 FetchListTool started > 060115 141537 Overall processing: Sorted 0 entries in 0.0 seconds. > 060115 141537 Overall processing: Sorted NaN entries/second > 060115 141537 FetchListTool completed > 060115 141537 logging at INFO > 060115 141538 Updating > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db > 060115 141538 Updating for > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537 > 060115 141538 Finishing update > 060115 141538 Update finished > 060115 141538 FetchListTool started > 060115 141538 Overall processing: Sorted 0 entries in 0.0 seconds. > 060115 141538 Overall processing: Sorted NaN entries/second > 060115 141538 FetchListTool completed > 060115 141538 logging at INFO > 060115 141539 Updating > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db > 060115 141539 Updating for > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538 > 060115 141539 Finishing update > 060115 141539 Update finished > 060115 141539 FetchListTool started > 060115 141540 Overall processing: Sorted 0 entries in 0.0 seconds. > 060115 141540 Overall processing: Sorted NaN entries/second > 060115 141540 FetchListTool completed > 060115 141540 logging at INFO > 060115 141541 Updating > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db > 060115 141541 Updating for > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539 > 060115 141541 Finishing update > 060115 141541 Update finished > 060115 141541 FetchListTool started > 060115 141541 Overall processing: Sorted 0 entries in 0.0 seconds. > 060115 141541 Overall processing: Sorted NaN entries/second > 060115 141541 FetchListTool completed > 060115 141541 logging at INFO > 060115 141542 Updating > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db > 060115 141542 Updating for > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541 > 060115 141542 Finishing update > 060115 141542 Update finished > 060115 141542 Updating > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments > from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db > 060115 141542 reading > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535 > 060115 141542 reading > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537 > 060115 141542 reading > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538 > 060115 141542 reading > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539 > 060115 141542 reading > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541 > 060115 141542 Sorting pages by url... > 060115 141542 Getting updated scores and anchors from db... > 060115 141542 Sorting updates by segment... > 060115 141542 Updating segments... > 060115 141542 Done updating > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments > from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db > 060115 141542 indexing segment: > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535 > 060115 141542 * Opening segment 20060115141535 > 060115 141542 * Indexing segment 20060115141535 > 060115 141542 * Optimizing index... > 060115 141542 * Moving index to NFS if needed... > 060115 141542 DONE indexing segment 20060115141535: total 0 records in 0.035 > s (NaN rec/s). > 060115 141543 done indexing > 060115 141543 indexing segment: > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537 > 060115 141543 * Opening segment 20060115141537 > 060115 141543 * Indexing segment 20060115141537 > 060115 141543 * Optimizing index... > 060115 141543 * Moving index to NFS if needed... > 060115 141543 DONE indexing segment 20060115141537: total 0 records in 0.076 > s (NaN rec/s). > 060115 141543 done indexing > 060115 141543 indexing segment: > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538 > 060115 141543 * Opening segment 20060115141538 > 060115 141543 * Indexing segment 20060115141538 > 060115 141543 * Optimizing index... > 060115 141543 * Moving index to NFS if needed... > 060115 141543 DONE indexing segment 20060115141538: total 0 records in 0.012 > s (NaN rec/s). > 060115 141543 done indexing > 060115 141543 indexing segment: > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539 > 060115 141543 * Opening segment 20060115141539 > 060115 141543 * Indexing segment 20060115141539 > 060115 141543 * Optimizing index... > 060115 141543 * Moving index to NFS if needed... > 060115 141543 DONE indexing segment 20060115141539: total 0 records in 0.013 > s (NaN rec/s). > 060115 141543 done indexing > 060115 141543 indexing segment: > /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541 > 060115 141543 * Opening segment 20060115141541 > 060115 141543 * Indexing segment 20060115141541 > 060115 141543 * Optimizing index... > 060115 141543 * Moving index to NFS if needed... > 060115 141543 DONE indexing segment 20060115141541: total 0 records in 0.02 s > (NaN rec/s). > 060115 141543 done indexing > 060115 141543 Reading url hashes... > 060115 141543 Sorting url hashes... > 060115 141543 Deleting url duplicates... > 060115 141543 Deleted 0 url duplicates. > 060115 141543 Reading content hashes... > 060115 141543 Sorting content hashes... > 060115 141543 Deleting content duplicates... > 060115 141543 Deleted 0 content duplicates. > 060115 141543 Duplicate deletion complete locally. Now returning to NFS... > 060115 141543 DeleteDuplicates complete > 060115 141543 Merging segment indexes... > 060115 141543 crawl finished: crawl-20060115141534 > [EMAIL PROTECTED]:~/workspace/lucene/nutch-0.7.1/bin> -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
