I'm having trouble figuring out why I keep getting "Added 0 pages" when running the crawl with nutch. I've searched the site and can't find an answer to as what might be going wrong. I'm running this on windows using eclipse because I may have to change the code slightly. I've already made a few modifications so that the path of the config files is specified explicitly, but I don't think that would be related to this issue. Any help is greatly appreciated!
crawl-root-urls.txt: http://www.apache.com crawl-urlfilter.txt: # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*apache.com/ # skip everything else -. Log: 060713 150946 parsing C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\WEB-INF\conf\nutch-default.xml 060713 150946 parsing C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\WEB-INF\conf\crawl-tool.xml 060713 150946 parsing C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\WEB-INF\conf\nutch-site.xml 060713 150946 No FS indicated, using default:local 060713 150946 crawl started in: crawl-20060713150946 060713 150946 rootUrlFile = crawl-root-urls.txt 060713 150946 threads = 10 060713 150946 depth = 5 060713 150947 Created webdb at LocalFS,C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db 060713 150947 Starting URL processing 060713 150947 Plugins: looking in: C:\Nutch\WEB-INF\plugins 060713 150947 not including: C:\Nutch\WEB-INF\plugins\clustering-carrot2 060713 150947 not including: C:\Nutch\WEB-INF\plugins\creativecommons 060713 150947 parsing: C:\Nutch\WEB-INF\plugins\index-basic\plugin.xml 060713 150947 impl: point=org.apache.nutch.indexer.IndexingFilter class= org.apache.nutch.indexer.basic.BasicIndexingFilter 060713 150947 not including: C:\Nutch\WEB-INF\plugins\index-more 060713 150947 not including: C:\Nutch\WEB-INF\plugins\language-identifier 060713 150947 parsing: C:\Nutch\WEB-INF\plugins\nutch-extensionpoints\plugin.xml 060713 150947 not including: C:\Nutch\WEB-INF\plugins\ontology 060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-ext 060713 150947 parsing: C:\Nutch\WEB-INF\plugins\parse-html\plugin.xml 060713 150947 impl: point=org.apache.nutch.parse.Parser class= org.apache.nutch.parse.html.HtmlParser 060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-js 060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-msword 060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-pdf 060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-rss 060713 150947 parsing: C:\Nutch\WEB-INF\plugins\parse-text\plugin.xml 060713 150947 impl: point=org.apache.nutch.parse.Parser class= org.apache.nutch.parse.text.TextParser 060713 150947 not including: C:\Nutch\WEB-INF\plugins\protocol-file 060713 150947 not including: C:\Nutch\WEB-INF\plugins\protocol-ftp 060713 150947 parsing: C:\Nutch\WEB-INF\plugins\protocol-http\plugin.xml 060713 150947 impl: point=org.apache.nutch.protocol.Protocol class= org.apache.nutch.protocol.http.Http 060713 150947 not including: C:\Nutch\WEB-INF\plugins\protocol-httpclient 060713 150947 parsing: C:\Nutch\WEB-INF\plugins\query-basic\plugin.xml 060713 150947 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.basic.BasicQueryFilter 060713 150947 not including: C:\Nutch\WEB-INF\plugins\query-more 060713 150947 parsing: C:\Nutch\WEB-INF\plugins\query-site\plugin.xml 060713 150947 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.site.SiteQueryFilter .060713 150947 parsing: C:\Nutch\WEB-INF\plugins\query-url\plugin.xml 060713 150947 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.url.URLQueryFilter 060713 150947 not including: C:\Nutch\WEB-INF\plugins\urlfilter-prefix 060713 150947 parsing: C:\Nutch\WEB-INF\plugins\urlfilter-regex\plugin.xml 060713 150947 impl: point=org.apache.nutch.net.URLFilter class= org.apache.nutch.net.RegexURLFilter 060713 150947 found resource crawl-urlfilter.txt at file:/C:/Documents%20and%20Settings/jschorzman/My%20Documents/My%20Workspace/Nutch/WEB-INF/conf/crawl- urlfilter.txt 060713 150947 Added 0 pages 060713 150947 FetchListTool started 060713 150947 Overall processing: Sorted 0 entries in 0.0 seconds. 060713 150947 Overall processing: Sorted NaN entries/second 060713 150947 FetchListTool completed 060713 150947 logging at INFO 060713 150948 Updating C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db 060713 150948 Updating for C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150947 060713 150948 Finishing update 060713 150948 Update finished 060713 150948 FetchListTool started 060713 150948 Overall processing: Sorted 0 entries in 0.0 seconds. 060713 150948 Overall processing: Sorted NaN entries/second 060713 150949 FetchListTool completed 060713 150949 logging at INFO 060713 150950 Updating C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db 060713 150950 Updating for C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150948 060713 150950 Finishing update 060713 150950 Update finished 060713 150950 FetchListTool started 060713 150950 Overall processing: Sorted 0 entries in 0.0 seconds. 060713 150950 Overall processing: Sorted NaN entries/second 060713 150950 FetchListTool completed 060713 150950 logging at INFO 060713 150951 Updating C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db 060713 150951 Updating for C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150950 060713 150951 Finishing update 060713 150951 Update finished 060713 150951 FetchListTool started 060713 150951 Overall processing: Sorted 0 entries in 0.0 seconds. 060713 150951 Overall processing: Sorted NaN entries/second 060713 150951 FetchListTool completed 060713 150951 logging at INFO 060713 150952 Updating C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db 060713 150952 Updating for C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150951 060713 150952 Finishing update 060713 150952 Update finished 060713 150952 FetchListTool started 060713 150953 Overall processing: Sorted 0 entries in 0.0 seconds. 060713 150953 Overall processing: Sorted NaN entries/second 060713 150953 FetchListTool completed 060713 150953 logging at INFO 060713 150954 Updating C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db 060713 150954 Updating for C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150952 060713 150954 Finishing update 060713 150954 Update finished 060713 150954 Updating C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments from C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db 060713 150954 reading C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150947 060713 150954 reading C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150948 060713 150954 reading C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150950 060713 150954 reading C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150951 060713 150954 reading C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150952 060713 150954 Sorting pages by url... 060713 150954 Getting updated scores and anchors from db... 060713 150954 Sorting updates by segment... 060713 150954 Updating segments... 060713 150954 Done updating C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments from C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db 060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150947 060713 150954 * Opening segment 20060713150947 060713 150954 * Indexing segment 20060713150947 060713 150954 * Optimizing index... 060713 150954 * Moving index to NFS if needed... 060713 150954 DONE indexing segment 20060713150947: total 0 records in 0.02s (NaN rec/s). 060713 150954 done indexing 060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150948 060713 150954 * Opening segment 20060713150948 060713 150954 * Indexing segment 20060713150948 060713 150954 * Optimizing index... 060713 150954 * Moving index to NFS if needed... 060713 150954 DONE indexing segment 20060713150948: total 0 records in 0.021s (NaN rec/s). 060713 150954 done indexing 060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150950 060713 150954 * Opening segment 20060713150950 060713 150954 * Indexing segment 20060713150950 060713 150954 * Optimizing index... 060713 150954 * Moving index to NFS if needed... 060713 150954 DONE indexing segment 20060713150950: total 0 records in 0.01s (NaN rec/s). 060713 150954 done indexing 060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150951 060713 150954 * Opening segment 20060713150951 060713 150954 * Indexing segment 20060713150951 060713 150954 * Optimizing index... 060713 150954 * Moving index to NFS if needed... 060713 150954 DONE indexing segment 20060713150951: total 0 records in 0.01s (NaN rec/s). 060713 150954 done indexing 060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150952 060713 150954 * Opening segment 20060713150952 060713 150954 * Indexing segment 20060713150952 060713 150954 * Optimizing index... 060713 150954 * Moving index to NFS if needed... 060713 150954 DONE indexing segment 20060713150952: total 0 records in 0.06s (NaN rec/s). 060713 150954 done indexing 060713 150954 Reading url hashes... 060713 150954 Sorting url hashes... 060713 150954 Deleting url duplicates... 060713 150954 Deleted 0 url duplicates. 060713 150954 Reading content hashes... 060713 150954 Sorting content hashes... 060713 150954 Deleting content duplicates... 060713 150954 Deleted 0 content duplicates. 060713 150954 Duplicate deletion complete locally. Now returning to NFS... 060713 150954 DeleteDuplicates complete 060713 150954 Merging segment indexes... 060713 150954 crawl finished: crawl-20060713150946
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
