the urls file needed http://www.irs.gov/index.html without the index.html it did not work. I fixed my server problem too!. Now IRS has been nutched (to some extent and the results can be seen here). My objective for using nutch is to 1. Hopefully learn something 2. Create an index of tax related web pages that are relevant to tax software developers. I noticed something about not being able to parse PDF in the log file (is that true?) I cant wait to nutch some more.....
-----Original Message----- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Saturday, February 25, 2006 11:50 PM To: 'nutch-agent@lucene.apache.org' Subject: Nutching IRS Hi, I am trying to setup nutch on Tomcat 5.5/Windows2000/jdk1.5.0_04/latest CYGWIN. I think I am about 99% of the way there, but I finally hit a stumbling block. I followed the instructions to a T, setup the war in the the root context, modified the config files, etc., set env NUTCH_JAVA_HOME, etc. I have 2 problems 1. The crawl doesn;t seem to be working. The crawled dir gets created, but see the log below. 0 records processed . My second problem is with the servlet (see 2. below). Thanks in advance for the help. crawl-urlfilter.txt -^(file|ftp|mailto): -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|m ov|MOV|exe|png|PNG)$ [EMAIL PROTECTED] +^http://([a-z0-9]*\.)* irs.gov/ -. urls http://www.irs.gov/ Log: run java in C:\Program Files\Java\jdk1.5.0_04 060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-default.xml 060225 233931 parsing file:/T:/nutch-0.7.1/conf/crawl-tool.xml 060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-site.xml 060225 233931 No FS indicated, using default:local 060225 233931 crawl started in: crawled 060225 233931 rootUrlFile = urls 060225 233931 threads = 10 060225 233931 depth = 3 060225 233932 Created webdb at LocalFS,T:\nutch-0.7.1\crawled\db 060225 233932 Starting URL processing 060225 233932 Plugins: looking in: T:\nutch-0.7.1\plugins 060225 233932 not including: T:\nutch-0.7.1\plugins\clustering-carrot2 060225 233932 not including: T:\nutch-0.7.1\plugins\creativecommons 060225 233932 parsing: T:\nutch-0.7.1\plugins\index-basic\plugin.xml 060225 233932 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter 060225 233932 not including: T:\nutch-0.7.1\plugins\index-more 060225 233932 not including: T:\nutch-0.7.1\plugins\language-identifier 060225 233932 parsing: T:\nutch-0.7.1\plugins\nutch-extensionpoints\plugin.xml 060225 233932 not including: T:\nutch-0.7.1\plugins\ontology 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-ext 060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-html\plugin.xml 060225 233932 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-js 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-msword 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-pdf 060225 233932 not including: T:\nutch-0.7.1\plugins\parse-rss 060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-text\plugin.xml 060225 233932 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-file 060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-ftp 060225 233932 parsing: T:\nutch-0.7.1\plugins\protocol-http\plugin.xml 060225 233932 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http 060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-httpclient 060225 233932 parsing: T:\nutch-0.7.1\plugins\query-basic\plugin.xml 060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter 060225 233933 not including: T:\nutch-0.7.1\plugins\query-more 060225 233933 parsing: T:\nutch-0.7.1\plugins\query-site\plugin.xml 060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 060225 233933 parsing: T:\nutch-0.7.1\plugins\query-url\plugin.xml 060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 060225 233933 not including: T:\nutch-0.7.1\plugins\urlfilter-prefix 060225 233933 parsing: T:\nutch-0.7.1\plugins\urlfilter-regex\plugin.xml 060225 233933 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter 060225 233933 found resource crawl-urlfilter.txt at file:/T:/nutch-0.7.1/conf/crawl-urlfilter.txt .060225 233933 Added 0 pages 060225 233933 FetchListTool started 060225 233933 Overall processing: Sorted 0 entries in 0.0 seconds. 060225 233933 Overall processing: Sorted NaN entries/second 060225 233933 FetchListTool completed 060225 233933 logging at INFO 060225 233934 Updating T:\nutch-0.7.1\crawled\db 060225 233934 Updating for T:\nutch-0.7.1\crawled\segments\20060225233933 060225 233934 Finishing update 060225 233934 Update finished 060225 233934 FetchListTool started 060225 233935 Overall processing: Sorted 0 entries in 0.0 seconds. 060225 233935 Overall processing: Sorted NaN entries/second 060225 233935 FetchListTool completed 060225 233935 logging at INFO 060225 233936 Updating T:\nutch-0.7.1\crawled\db 060225 233936 Updating for T:\nutch-0.7.1\crawled\segments\20060225233934 060225 233936 Finishing update 060225 233936 Update finished 060225 233936 FetchListTool started 060225 233936 Overall processing: Sorted 0 entries in 0.0 seconds. 060225 233936 Overall processing: Sorted NaN entries/second 060225 233936 FetchListTool completed 060225 233936 logging at INFO 060225 233937 Updating T:\nutch-0.7.1\crawled\db 060225 233938 Updating for T:\nutch-0.7.1\crawled\segments\20060225233936 060225 233938 Finishing update 060225 233938 Update finished 060225 233938 Updating T:\nutch-0.7.1\crawled\segments from T:\nutch-0.7.1\crawled\db 060225 233938 reading T:\nutch-0.7.1\crawled\segments\20060225233933 060225 233938 reading T:\nutch-0.7.1\crawled\segments\20060225233934 060225 233938 reading T:\nutch-0.7.1\crawled\segments\20060225233936 060225 233938 Sorting pages by url... 060225 233938 Getting updated scores and anchors from db... 060225 233938 Sorting updates by segment... 060225 233938 Updating segments... 060225 233938 Done updating T:\nutch-0.7.1\crawled\segments from T:\nutch-0.7.1\crawled\db 060225 233938 indexing segment: T:\nutch-0.7.1\crawled\segments\20060225233933 060225 233938 * Opening segment 20060225233933 060225 233938 * Indexing segment 20060225233933 060225 233938 * Optimizing index... 060225 233938 * Moving index to NFS if needed... 060225 233938 DONE indexing segment 20060225233933: total 0 records in 0.14 s (NaN rec/s). 060225 233938 done indexing 060225 233938 indexing segment: T:\nutch-0.7.1\crawled\segments\20060225233934 060225 233938 * Opening segment 20060225233934 060225 233938 * Indexing segment 20060225233934 060225 233938 * Optimizing index... 060225 233938 * Moving index to NFS if needed... 060225 233938 DONE indexing segment 20060225233934: total 0 records in 0.031 s (NaN rec/s). 060225 233938 done indexing 060225 233938 indexing segment: T:\nutch-0.7.1\crawled\segments\20060225233936 060225 233938 * Opening segment 20060225233936 060225 233938 * Indexing segment 20060225233936 060225 233938 * Optimizing index... 060225 233938 * Moving index to NFS if needed... 060225 233938 DONE indexing segment 20060225233936: total 0 records in 0.032 s (NaN rec/s). 060225 233938 done indexing 060225 233938 Reading url hashes... 060225 233938 Sorting url hashes... 060225 233938 Deleting url duplicates... 060225 233938 Deleted 0 url duplicates. 060225 233938 Reading content hashes... 060225 233938 Sorting content hashes... 060225 233938 Deleting content duplicates... 060225 233938 Deleted 0 content duplicates. 060225 233938 Duplicate deletion complete locally. Now returning to NFS... 060225 233938 DeleteDuplicates complete 060225 233938 Merging segment indexes... 060225 233938 crawl finished: crawled 2. Nutch seems to launch fine http://24.75.221.234:8080/ When you search you get the following error: Is this maybe because I haven;t completed a good crawl yet org.apache.jasper.JasperException org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.ja va:370) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) root cause java.lang.NullPointerException org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96) org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:82) org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:72) org.apache.nutch.searcher.NutchBean.get(NutchBean.java:64) org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:112) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.ja va:322) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> Free Open Source Tax Software coming soon: nutch.taxcodesoftware.org Open directory of tax software development.